[00:01:58] PROBLEM - Check size of conntrack table on mw1163 is CRITICAL: CRITICAL: nf_conntrack is 93 % full [00:03:49] RECOVERY - Check size of conntrack table on mw1163 is OK: OK: nf_conntrack is 58 % full [00:05:36] !log mwscript deleteEqualMessages.php --wiki maiwiki [00:05:38] Reedy: i'd like to reinstall bast1001 soonish, i noticed your home dir is 53G, it's no problem to copy it back but all needed? [00:05:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:07:44] RoanKattouw: still need /home/catrope/oldlogs ? ^ for the same reason [00:07:57] On bast1001? [00:08:01] yes [00:09:07] Hah, I didn't realize I had 3 gigs of old log files from the wtp* servers lying around from back in 2013 [00:09:21] gwicke: ^ same wtp* files there for you i think [00:09:39] RoanKattouw: :) not a big deal to copy them all but i thought it'd be old, yea [00:10:28] thanks [00:10:28] That may have been from that outage caused by a bug in the rewritten logging backend that caused disk space exhaustion on all wtp* servers around the same time [00:11:11] *nod* and in the other case it's the videos from devsummit or wikimania [00:11:28] Yeah those can be deleted too, they've probably been imported to Commons already [00:12:58] i'm not sure that actually happened [00:13:16] keep seeing ancient tickets about that [00:13:38] like we had a ticket for the new videos before we managed to upload the ones from the year before [00:14:18] yea 2014 is still open https://phabricator.wikimedia.org/T84465 [00:14:38] this too https://phabricator.wikimedia.org/T106038 [00:14:46] not even surprised [00:15:31] "the output of the conversion is larger than swift can handle, and as a result can't be uploaded to commons. resolving this task. " of course that resolves it [00:25:59] PROBLEM - Check size of conntrack table on mw1163 is CRITICAL: CRITICAL: nf_conntrack is 90 % full [00:27:58] RECOVERY - Check size of conntrack table on mw1163 is OK: OK: nf_conntrack is 54 % full [00:35:08] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 65.52% of data above the critical threshold [5000000.0] [00:38:38] !log mwscript deleteEqualMessages.php --wiki newiki [00:38:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:06:42] PROBLEM - MariaDB Slave Lag: m3 on db1048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 390.36 seconds [01:18:58] PROBLEM - puppet last run on cp3040 is CRITICAL: CRITICAL: puppet fail [01:19:58] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 50.00% above the threshold [1000000.0] [01:31:58] PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 73.33% of data above the critical threshold [5000000.0] [01:37:56] anyone doing anything on db1048 (m3) [01:38:59] PROBLEM - Check size of conntrack table on mw1163 is CRITICAL: CRITICAL: nf_conntrack is 90 % full [01:40:49] RECOVERY - Check size of conntrack table on mw1163 is OK: OK: nf_conntrack is 27 % full [01:41:13] volans: nope... [01:42:00] because the replica is just stopped [01:42:21] hmm [01:42:32] * YuviPanda doesn't know what to do / help [01:42:33] and I can see a connection with user root from dbstore1001 that is doing basically a dump [01:42:50] on phabricator_repository [01:42:51] volans: ah, I think you've maybe discovered the strange ways of the eventlogging replication [01:42:54] oh [01:43:01] not that then. [01:43:22] you could kill it and see if it recovers. [01:43:33] * YuviPanda doesn't know what he's saying or doing [01:43:42] I'm just here for the company :) [01:43:57] I can see a peak every day at this time in bytes sent, so looks like a daily job [01:44:28] * volans checking icinga for past alerts [01:46:10] and now it was started again [01:46:39] so yes we have a script that stops the replica and do a dump... it will be so kind if it will put a dowtime on icinga alert too :) [01:47:09] RECOVERY - puppet last run on cp3040 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [01:47:52] RECOVERY - MariaDB Slave Lag: m3 on db1048 is OK: OK slave_sql_lag Replication lag: 0.12 seconds [01:50:18] PROBLEM - MariaDB Slave Lag: x1 on db2008 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 321.20 seconds [01:51:42] PROBLEM - MariaDB Slave Lag: x1 on db1031 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 398.22 seconds [01:51:44] PROBLEM - MariaDB Slave Lag: x1 on db2009 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 403.12 seconds [01:51:53] yep, x1 turn [01:52:02] is this just dumps? [01:52:06] yes [01:52:18] PROBLEM - puppet last run on mw1080 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:52:32] ok! [01:52:55] not really, they should't alarm IMHO [01:53:01] volans: you're in the sf evening dead zone, where usually I'm the only one around. I don't know much about databases, but happy to help in whatever way I can if help is needed. [01:53:20] nothing to do really [01:53:28] thanks for the offer [01:53:55] ok! [01:54:09] RECOVERY - MariaDB Slave Lag: x1 on db2008 is OK: OK slave_sql_lag Replication lag: 0.23 seconds [01:55:32] RECOVERY - MariaDB Slave Lag: x1 on db1031 is OK: OK slave_sql_lag Replication lag: 0.28 seconds [01:55:34] RECOVERY - MariaDB Slave Lag: x1 on db2009 is OK: OK slave_sql_lag Replication lag: 0.41 seconds [01:55:58] crontab completed, last one was x1 [01:58:15] checked with irc logs, it's pretty regular each week [02:04:02] opened a phab task for tracking [02:12:44] (03PS1) 10BryanDavis: Labs: Preserve env for vagrant commands [puppet] - 10https://gerrit.wikimedia.org/r/283118 (https://phabricator.wikimedia.org/T120186) [02:13:09] bd808: let me know if you want a merge [02:13:30] YuviPanda: I'm going to test it out and see if it actually works :) [02:13:41] but hopefully I'll poke you soon [02:14:41] bd808: ok! [02:14:47] PROBLEM - Check size of conntrack table on mw1163 is CRITICAL: CRITICAL: nf_conntrack is 91 % full [02:16:28] (03CR) 10BryanDavis: "Tested via cherry-pick. Works as expected." [puppet] - 10https://gerrit.wikimedia.org/r/283118 (https://phabricator.wikimedia.org/T120186) (owner: 10BryanDavis) [02:16:36] YuviPanda: ^ looks good [02:16:37] RECOVERY - Check size of conntrack table on mw1163 is OK: OK: nf_conntrack is 42 % full [02:16:58] (03CR) 10Yuvipanda: [C: 032] Labs: Preserve env for vagrant commands [puppet] - 10https://gerrit.wikimedia.org/r/283118 (https://phabricator.wikimedia.org/T120186) (owner: 10BryanDavis) [02:17:17] RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 50.00% above the threshold [1000000.0] [02:27:48] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.20) (duration: 11m 56s) [02:27:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:37:48] PROBLEM - Hadoop NodeManager on analytics1046 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [02:39:37] RECOVERY - Hadoop NodeManager on analytics1046 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [02:53:05] (03PS3) 10Yuvipanda: tools: Don't track mediawiki's font list for precise [puppet] - 10https://gerrit.wikimedia.org/r/282958 (https://phabricator.wikimedia.org/T132282) [02:53:12] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Don't track mediawiki's font list for precise [puppet] - 10https://gerrit.wikimedia.org/r/282958 (https://phabricator.wikimedia.org/T132282) (owner: 10Yuvipanda) [03:03:18] PROBLEM - Check size of conntrack table on mw1163 is CRITICAL: CRITICAL: nf_conntrack is 91 % full [03:05:08] RECOVERY - Check size of conntrack table on mw1163 is OK: OK: nf_conntrack is 24 % full [03:14:25] (03PS1) 10Yuvipanda: tools: Remove non-precise fonts from precise hosts [puppet] - 10https://gerrit.wikimedia.org/r/283121 (https://phabricator.wikimedia.org/T132282) [03:14:49] (03CR) 10jenkins-bot: [V: 04-1] tools: Remove non-precise fonts from precise hosts [puppet] - 10https://gerrit.wikimedia.org/r/283121 (https://phabricator.wikimedia.org/T132282) (owner: 10Yuvipanda) [03:27:18] PROBLEM - Check size of conntrack table on mw1163 is CRITICAL: CRITICAL: nf_conntrack is 90 % full [03:29:08] RECOVERY - Check size of conntrack table on mw1163 is OK: OK: nf_conntrack is 30 % full [03:29:16] 06Operations, 06Commons: Unable to restore file that has a very large file size - https://phabricator.wikimedia.org/T131832#2202028 (10Pokefan95) p:05Triage>03High [03:38:38] PROBLEM - Check size of conntrack table on mw1163 is CRITICAL: CRITICAL: nf_conntrack is 94 % full [03:40:19] RECOVERY - Check size of conntrack table on mw1163 is OK: OK: nf_conntrack is 62 % full [03:46:57] (03PS2) 10Yuvipanda: tools: Remove non-precise fonts from precise hosts [puppet] - 10https://gerrit.wikimedia.org/r/283121 (https://phabricator.wikimedia.org/T132282) [03:51:27] PROBLEM - Check size of conntrack table on mw1163 is CRITICAL: CRITICAL: nf_conntrack is 94 % full [03:53:18] RECOVERY - Check size of conntrack table on mw1163 is OK: OK: nf_conntrack is 28 % full [04:02:47] PROBLEM - Check size of conntrack table on mw1163 is CRITICAL: CRITICAL: nf_conntrack is 96 % full [04:04:37] RECOVERY - Check size of conntrack table on mw1163 is OK: OK: nf_conntrack is 57 % full [04:05:41] (03CR) 10Yuvipanda: [C: 032] tools: Remove non-precise fonts from precise hosts [puppet] - 10https://gerrit.wikimedia.org/r/283121 (https://phabricator.wikimedia.org/T132282) (owner: 10Yuvipanda) [04:30:29] PROBLEM - Hadoop NodeManager on analytics1057 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [04:41:59] 06Operations, 10Traffic, 07HTTPS: HTTPS error on status.wikimedia.org (watchmouse certificate mismatch) - https://phabricator.wikimedia.org/T131017#2202093 (10Pokefan95) p:05Triage>03Normal [04:43:37] RECOVERY - Hadoop NodeManager on analytics1057 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [04:50:48] PROBLEM - Check size of conntrack table on mw1163 is CRITICAL: CRITICAL: nf_conntrack is 93 % full [04:52:39] RECOVERY - Check size of conntrack table on mw1163 is OK: OK: nf_conntrack is 57 % full [04:58:28] PROBLEM - puppet last run on mw2036 is CRITICAL: CRITICAL: puppet fail [05:09:04] 06Operations, 10Traffic, 07HTTPS: enable HSTS on *.planet.wikimedia.org - https://phabricator.wikimedia.org/T132543#2202098 (10Dzahn) [05:11:22] 06Operations, 10Traffic, 07HTTPS: enable HSTS on *.planet.wikimedia.org - https://phabricator.wikimedia.org/T132543#2202114 (10Pokefan95) p:05Triage>03Normal [05:12:06] 06Operations, 10Traffic, 07HTTPS: Enforce HTTPS+HSTS on remaining one-off sites in wikimedia.org that don't use standard cache cluster termination - https://phabricator.wikimedia.org/T132521#2202115 (10Pokefan95) p:05Triage>03Normal [05:14:08] 06Operations, 10Traffic, 07HTTPS: enable HSTS on *.planet.wikimedia.org - https://phabricator.wikimedia.org/T132543#2202116 (10Dzahn) looking at the config i already see: 13 Header always set Strict-Transport-Security "max-age=604800" isn't it already enabled? [05:14:59] PROBLEM - Check size of conntrack table on mw1163 is CRITICAL: CRITICAL: nf_conntrack is 93 % full [05:15:30] ori: sorry, I was sleeping [05:15:58] btw bblack and ori, my bot went back editing at 1:38 CEST [05:16:49] RECOVERY - Check size of conntrack table on mw1163 is OK: OK: nf_conntrack is 54 % full [05:16:50] anzi I can now reach everything [05:20:37] 06Operations, 10Traffic, 07HTTPS: enable HSTS on *.planet.wikimedia.org - https://phabricator.wikimedia.org/T132543#2202117 (10Dzahn) 05Open>03Invalid already resolved/invalid it's enabled and *.planet. uses use standard cache cluster termination, it's misc-web, besides having a separate wildcard cert,... [05:22:26] 06Operations, 10Traffic, 07HTTPS: enable HSTS on *.planet.wikimedia.org - https://phabricator.wikimedia.org/T132543#2202119 (10Dzahn) @Pokefan do me a favor and update https://wikitech.wikimedia.org/wiki/HTTPS/domains ? can't login on wikitech due to lack of second factor [05:26:19] RECOVERY - puppet last run on mw2036 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [05:29:06] 06Operations, 10Traffic, 07HTTPS: enable HSTS on *.planet.wikimedia.org - https://phabricator.wikimedia.org/T132543#2202121 (10Pokefan95) @Dhann: Doing... [05:29:23] 06Operations, 10Traffic, 07HTTPS: enable HSTS on *.planet.wikimedia.org - https://phabricator.wikimedia.org/T132543#2202123 (10Dzahn) [05:29:25] 06Operations, 10Traffic, 07HTTPS: Enable HSTS on Wikimedia sites - https://phabricator.wikimedia.org/T40516#2202122 (10Dzahn) [05:31:00] 06Operations, 10Traffic, 07HTTPS: Enable HSTS on Wikimedia sites - https://phabricator.wikimedia.org/T40516#1146945 (10Dzahn) [05:31:02] 06Operations, 10Traffic, 07HTTPS: enable HSTS on *.planet.wikimedia.org - https://phabricator.wikimedia.org/T132543#2202098 (10Dzahn) 05Invalid>03Resolved @Pokefan95 thank you , then there was actually something to resolve, heh [05:35:13] 06Operations, 10Traffic, 07HTTPS: enable HSTS on *.planet.wikimedia.org - https://phabricator.wikimedia.org/T132543#2202129 (10Dzahn) the change that enabled this was https://gerrit.wikimedia.org/r/#/c/253758/ on 2015-11-18 [05:36:41] 06Operations, 10Traffic, 07HTTPS: enable HSTS on *.planet.wikimedia.org - https://phabricator.wikimedia.org/T132543#2202130 (10Pokefan95) @Dhann: For now, I just changed it from "No" to "Yes". What is the duration of the HSTS? [05:39:10] 06Operations, 10Traffic, 07HTTPS: enable HSTS on *.planet.wikimedia.org - https://phabricator.wikimedia.org/T132543#2202131 (10Dzahn) it's max-age=31536000 (from https://www.ssllabs.com/ssltest/analyze.html?d=es.planet.wikimedia.org&s=208.80.153.248) so that means [[ https://duckduckgo.com/?q=31536000+seco... [05:39:17] PROBLEM - Check size of conntrack table on mw1163 is CRITICAL: CRITICAL: nf_conntrack is 97 % full [05:39:30] 06Operations, 10Traffic, 07HTTPS: enable HSTS on *.planet.wikimedia.org - https://phabricator.wikimedia.org/T132543#2202132 (10Pokefan95) Ah, ok, thanks [05:41:07] RECOVERY - Check size of conntrack table on mw1163 is OK: OK: nf_conntrack is 51 % full [06:03:28] PROBLEM - Check size of conntrack table on mw1163 is CRITICAL: CRITICAL: nf_conntrack is 91 % full [06:04:53] analytics1057 and 1046 (node manager down) were probably having issues for stuff like "CPU 14 THERMAL EVENT TSC 6c7bed1096c482" [06:05:01] 06Operations, 07Puppet, 06Labs: Implement role based hiera lookups for labs - https://phabricator.wikimedia.org/T120165#1847021 (10mmodell) Is this really difficult to do? I'm very interested in fixing this but not at all sure where to start. [06:05:02] that is not the first analytics* host [06:05:04] sigh [06:05:18] RECOVERY - Check size of conntrack table on mw1163 is OK: OK: nf_conntrack is 39 % full [06:07:01] ---^ tons of TIME_WAITs (a lot towards rdb hosts) [06:07:49] also, [Wed Apr 13 05:38:08 2016] nf_conntrack: table full, dropping packet in the dmesg, might want to check to avoid problems with job runners [06:08:06] moritzm: --^ (morning :) [06:08:40] (brb) [06:21:28] RECOVERY - Updater process on wdqs1002 is OK: PROCS OK: 1 process with UID = 998 (blazegraph), regex args ^java .* org.wikidata.query.rdf.tool.Update [06:27:28] PROBLEM - Check size of conntrack table on mw1163 is CRITICAL: CRITICAL: nf_conntrack is 92 % full [06:29:19] RECOVERY - Check size of conntrack table on mw1163 is OK: OK: nf_conntrack is 62 % full [06:30:10] PROBLEM - puppet last run on mc2015 is CRITICAL: CRITICAL: puppet fail [06:32:08] PROBLEM - puppet last run on analytics1047 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:28] PROBLEM - puppet last run on db1059 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:29] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 2 failures [06:35:07] PROBLEM - Check size of conntrack table on mw1163 is CRITICAL: CRITICAL: nf_conntrack is 99 % full [06:37:14] <_joe_> why only that server [06:37:18] <_joe_> goddamnit [06:38:48] RECOVERY - Check size of conntrack table on mw1163 is OK: OK: nf_conntrack is 60 % full [06:46:54] 06Operations, 10Traffic, 07HTTPS: Preload HSTS for select hostnames within wikimedia.org - https://phabricator.wikimedia.org/T111967#2202184 (10BBlack) Yeah, I'm in the process of enumerating those.... [06:47:58] PROBLEM - Check size of conntrack table on mw1163 is CRITICAL: CRITICAL: nf_conntrack is 99 % full [06:48:46] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Inline comments" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/282830 (https://phabricator.wikimedia.org/T107702) (owner: 10Andrew Bogott) [06:51:37] RECOVERY - Check size of conntrack table on mw1163 is OK: OK: nf_conntrack is 32 % full [06:56:18] RECOVERY - puppet last run on db1059 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [06:57:28] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:57] RECOVERY - puppet last run on analytics1047 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:59] RECOVERY - puppet last run on mc2015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:58] PROBLEM - Check size of conntrack table on mw1163 is CRITICAL: CRITICAL: nf_conntrack is 96 % full [07:02:49] RECOVERY - Check size of conntrack table on mw1163 is OK: OK: nf_conntrack is 55 % full [07:11:59] PROBLEM - Check size of conntrack table on mw1163 is CRITICAL: CRITICAL: nf_conntrack is 97 % full [07:15:38] RECOVERY - Check size of conntrack table on mw1163 is OK: OK: nf_conntrack is 28 % full [07:21:52] 06Operations, 10ops-eqiad, 06Analytics-Kanban: Analytics hosts showed high temperature alarms - https://phabricator.wikimedia.org/T132256#2202237 (10elukey) [07:22:59] PROBLEM - Check size of conntrack table on mw1163 is CRITICAL: CRITICAL: nf_conntrack is 99 % full [07:26:22] 06Operations, 10Traffic, 07HTTPS: Enforce HTTPS+HSTS on remaining one-off sites in wikimedia.org that don't use standard cache cluster termination - https://phabricator.wikimedia.org/T132521#2202245 (10BBlack) ========= //Audit Data// ========= Methodology: ----------------- The starting point is our raw D... [07:26:38] RECOVERY - Check size of conntrack table on mw1163 is OK: OK: nf_conntrack is 33 % full [07:26:48] !log temporarily bumped connection tracking table size on mw1163 to 512k (randomly spiking) [07:26:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:27:40] elukey: I'll add the diamond collectors we applied to the kafka brokers to the job runners so that we have better data [07:27:50] (03CR) 10DCausse: [C: 031] Convert mwgrep to use regexp by default (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/283107 (owner: 10EBernhardson) [07:30:20] moritzm: I was about to ask the same :) [07:34:44] 06Operations, 10Traffic, 07HTTPS: Enforce HTTPS+HSTS on remaining one-off sites in wikimedia.org that don't use standard cache cluster termination - https://phabricator.wikimedia.org/T132521#2202250 (10BBlack) While we should fix all of these issues in the long term (they should all be 301->https on the same... [07:43:16] 06Operations, 10Traffic, 07HTTPS: Enforce HTTPS+HSTS on remaining one-off sites in wikimedia.org that don't use standard cache cluster termination - https://phabricator.wikimedia.org/T132521#2202254 (10BBlack) As for the rest of the work, IMHO we should re-purpose the wiki tracking page at https://wikitech.w... [07:48:23] 06Operations, 10ops-eqiad, 10DBA: db1070, db1071 and db1065 overheating problems - https://phabricator.wikimedia.org/T132515#2202271 (10Volans) For reference I found it looking at a small spike in connections errors from here: https://logstash.wikimedia.org/#/dashboard/elasticsearch/wfLogDBError [07:56:37] 06Operations: rsync module doesnt work on trusty - https://phabricator.wikimedia.org/T132532#2201690 (10MoritzMuehlenhoff) There are plenty of trusty systems with rsync::server, e.g. the swift-be systems. [08:01:44] (03CR) 10Stats: [C: 031] varnishmedia: ignore transactions without status code [puppet] - 10https://gerrit.wikimedia.org/r/282976 (https://phabricator.wikimedia.org/T132430) (owner: 10Ema) [08:02:35] (03CR) 10Elukey: [C: 031] "The Stats review was me logged in Gerrit with a different user, sorry :)" [puppet] - 10https://gerrit.wikimedia.org/r/282976 (https://phabricator.wikimedia.org/T132430) (owner: 10Ema) [08:05:02] volans: cmjonhson has said prevsiouly there has been a batch have been overheating and needing new thermal paste [08:05:44] (03CR) 10BBlack: [C: 031] varnishmedia: ignore transactions without status code [puppet] - 10https://gerrit.wikimedia.org/r/282976 (https://phabricator.wikimedia.org/T132430) (owner: 10Ema) [08:06:37] p858snake|L_: thanks, you mean a bunch of servers from the same order? [08:06:58] no idea [08:07:44] There was the phab host recently (iridium?) and I have seen passing reports about others [08:07:51] but haven't paid attention [08:08:35] yes, I'm aware of them, I was chatting with chris yesterday, he will take a look today if it's an hot spot on the rack or not [08:11:00] (03PS1) 10Elukey: Add nfconntrack and TCP States diamond collectors to Jobrunners. [puppet] - 10https://gerrit.wikimedia.org/r/283129 [08:12:31] mutante: errm [08:13:24] mutante: 7.0M now [08:13:24] :P [08:13:58] (03PS1) 10Muehlenhoff: Update to Linux 4.4.7 [debs/linux44] - 10https://gerrit.wikimedia.org/r/283130 [08:17:11] (03CR) 10Muehlenhoff: [C: 032 V: 032] Update to Linux 4.4.7 [debs/linux44] - 10https://gerrit.wikimedia.org/r/283130 (owner: 10Muehlenhoff) [08:17:58] (03PS3) 10Ema: varnishmedia: ignore transactions without status code [puppet] - 10https://gerrit.wikimedia.org/r/282976 (https://phabricator.wikimedia.org/T132430) [08:18:22] (03CR) 10Ema: [C: 032 V: 032] varnishmedia: ignore transactions without status code [puppet] - 10https://gerrit.wikimedia.org/r/282976 (https://phabricator.wikimedia.org/T132430) (owner: 10Ema) [08:19:38] PROBLEM - Hadoop NodeManager on analytics1054 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [08:19:45] boooooo [08:20:19] --^ checking [08:23:48] sad_trombone.wav [08:23:54] java.lang.OutOfMemoryError: GC overhead limit exceeded [08:24:19] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 65.52% of data above the critical threshold [5000000.0] [08:25:44] (03PS5) 10Filippo Giunchedi: write cassandra instance yaml descriptors [puppet] - 10https://gerrit.wikimedia.org/r/277865 (owner: 10Eevans) [08:25:51] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] write cassandra instance yaml descriptors [puppet] - 10https://gerrit.wikimedia.org/r/277865 (owner: 10Eevans) [08:26:59] RECOVERY - Hadoop NodeManager on analytics1054 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [08:27:16] godog: .ogg >.> [08:28:32] p858snake|L_: hehe wav felt like a touch of old [08:28:35] Duplicate declaration: File[/etc/cassandra-instances.d] is already declared in file /etc/puppet/modules/cassandra/manifests/instance.pp:214; cannot redeclare at /etc/puppet/modules/cassandra/manifests/instance.pp:214 [08:28:39] thanks puppet [08:30:57] (03CR) 10Muehlenhoff: [C: 031] Add nfconntrack and TCP States diamond collectors to Jobrunners. [puppet] - 10https://gerrit.wikimedia.org/r/283129 (owner: 10Elukey) [08:30:57] PROBLEM - puppet last run on restbase1009 is CRITICAL: CRITICAL: puppet fail [08:36:36] (03PS2) 10Elukey: Add nfconntrack and TCP States diamond collectors to Jobrunners. [puppet] - 10https://gerrit.wikimedia.org/r/283129 [08:38:42] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/2414/mw1163.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/283129 (owner: 10Elukey) [08:45:54] (03PS1) 10Filippo Giunchedi: cassandra: don't require cassandra Package for cassandra-instances.d [puppet] - 10https://gerrit.wikimedia.org/r/283132 [08:48:21] (03PS2) 10Filippo Giunchedi: cassandra: don't require cassandra Package for cassandra-instances.d [puppet] - 10https://gerrit.wikimedia.org/r/283132 [08:52:27] (03PS3) 10Filippo Giunchedi: cassandra: don't require cassandra Package for cassandra-instances.d [puppet] - 10https://gerrit.wikimedia.org/r/283132 [08:52:28] PROBLEM - aqs endpoints health on aqs1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:52:57] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: don't require cassandra Package for cassandra-instances.d [puppet] - 10https://gerrit.wikimedia.org/r/283132 (owner: 10Filippo Giunchedi) [08:53:04] is it the analytics joyful morning today? sigh.. checking aqs.. [08:53:31] 06Operations, 10Traffic, 07HTTPS: Enforce HTTPS+HSTS on remaining one-off sites in wikimedia.org that don't use standard cache cluster termination - https://phabricator.wikimedia.org/T132521#2202415 (10Chmarkine) >>! In T132521#2202254, @BBlack wrote: > As for the rest of the work, IMHO we should re-purpose... [08:53:53] (03PS1) 10Muehlenhoff: Drop Debian-specific logging fix (clashes with 4.4.7 point update). [debs/linux44] - 10https://gerrit.wikimedia.org/r/283133 [08:54:09] RECOVERY - aqs endpoints health on aqs1002 is OK: All endpoints are healthy [08:54:49] (03PS1) 10Volans: MariaDB: use Puppet certs for TLS [puppet] - 10https://gerrit.wikimedia.org/r/283134 (https://phabricator.wikimedia.org/T111654) [08:55:13] (03CR) 10Muehlenhoff: [C: 032 V: 032] Drop Debian-specific logging fix (clashes with 4.4.7 point update). [debs/linux44] - 10https://gerrit.wikimedia.org/r/283133 (owner: 10Muehlenhoff) [08:55:17] RECOVERY - puppet last run on restbase1009 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [08:55:28] PROBLEM - aqs endpoints health on aqs1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:57:19] RECOVERY - aqs endpoints health on aqs1003 is OK: All endpoints are healthy [08:57:48] (03PS1) 10Volans: Depool db1050 and db1041 for TLS upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283135 (https://phabricator.wikimedia.org/T111654) [09:02:58] PROBLEM - puppet last run on aqs1002 is CRITICAL: CRITICAL: Puppet has 1 failures [09:04:38] (03CR) 10Muehlenhoff: [C: 04-1] "See comment for proposed fix" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/282761 (owner: 10Chad) [09:07:08] (03CR) 10Volans: "All changes looks good:" [puppet] - 10https://gerrit.wikimedia.org/r/283134 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [09:17:38] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 50.00% above the threshold [1000000.0] [09:19:51] (03CR) 10Gehel: [C: 031] "Looks good and there should be no issues now that T128813 is fixed. I'd like to wait for the reinstall of wdqs1002 to be done and tested b" [puppet] - 10https://gerrit.wikimedia.org/r/274864 (https://phabricator.wikimedia.org/T126730) (owner: 10Smalyshev) [09:21:58] !log start upgrading TLS for cross-dc replica on shards s6 and s7 - T111654 [09:21:59] T111654: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654 [09:22:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:22:43] (03CR) 10Volans: [C: 032] Depool db1050 and db1041 for TLS upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283135 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [09:22:47] (03CR) 10Gehel: Simplification of Cassandra Logstash filtering (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/282466 (https://phabricator.wikimedia.org/T130861) (owner: 10Jstenval) [09:23:07] (03Merged) 10jenkins-bot: Depool db1050 and db1041 for TLS upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283135 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [09:25:42] (03PS1) 10Volans: Repool db1050 and db1041 for TLS upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283139 (https://phabricator.wikimedia.org/T111654) [09:29:03] I had some issues with scap regarding some mw hosts anyone can give me an hand? _joe_ godog ? [09:29:59] volans: sure, what problems? [09:30:03] https://phabricator.wikimedia.org/P2891 [09:30:15] seems that there is an mw with a RO filesystem :/ [09:30:56] <_joe_> volans: so we need to depool it [09:31:19] but which one is? the command that failed has a bunch of them [09:31:45] <_joe_> yeah I was trying to understand that [09:31:59] mw1080, https://phabricator.wikimedia.org/T132529 [09:32:11] salt "touch /src/__test" ? :D [09:32:19] s/src/srv/ [09:33:09] oh yes godog it says "on mw1080.eqiad.wmnet returned" at the end of the line [09:33:17] <_joe_> so just remove it from mediawiki-installation [09:33:32] <_joe_> and from conftool-data [09:33:34] never done... [09:33:55] <_joe_> find puppet -name mediawiki-installation [09:34:36] remove or comment? [09:35:07] <_joe_> remove [09:35:34] <_joe_> it's also already depooled [09:36:54] (03PS1) 10Volans: MW: remove mw1080, has RO filesystem [puppet] - 10https://gerrit.wikimedia.org/r/283142 (https://phabricator.wikimedia.org/T111654) [09:37:28] (03CR) 10Giuseppe Lavagetto: [C: 031] MW: remove mw1080, has RO filesystem [puppet] - 10https://gerrit.wikimedia.org/r/283142 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [09:37:37] (03PS2) 10Volans: MW: remove mw1080, has RO filesystem [puppet] - 10https://gerrit.wikimedia.org/r/283142 (https://phabricator.wikimedia.org/T111654) [09:37:47] <_joe_> volans: check if it's a scap proxy by any chance [09:38:30] I've grepped mw1080 in puppet repo and was only there and linux-host-entries of course [09:38:35] <_joe_> ok [09:38:38] anywhere else I should check? [09:38:41] <_joe_> nope [09:38:52] <_joe_> remember to run puppet-merge with sudo -i [09:39:04] <_joe_> as this is going to do act on conftool [09:39:06] 06Operations, 10Phabricator, 10hardware-requests: We need a backup phabricator front-end node - https://phabricator.wikimedia.org/T131775#2202510 (10mmodell) Phabricator's support for "High Availability" is making progress recently, see [[ https://secure.phabricator.com/T10751 | upstream task (T10751) ]] for... [09:39:37] ok, thanks, didn't know [09:39:54] (03CR) 10Volans: [C: 032] MW: remove mw1080, has RO filesystem [puppet] - 10https://gerrit.wikimedia.org/r/283142 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [09:40:34] 06Operations, 10Traffic, 13Patch-For-Review, 07Varnish: Add icinga monitoring for varnish statistics daemons - https://phabricator.wikimedia.org/T131760#2202515 (10ema) 05Open>03Resolved [09:41:02] _joe_: should I force puppet on tin then? [09:41:17] <_joe_> volans: if you need to deploy again, yes [09:41:18] (03PS2) 10Volans: MariaDB: use Puppet certs for TLS [puppet] - 10https://gerrit.wikimedia.org/r/283134 (https://phabricator.wikimedia.org/T111654) [09:41:22] <_joe_> if not, it's going to run [09:41:53] I don't know if scap finished properly [09:44:47] 06Operations, 10Traffic, 07Graphite, 07HTTPS: HTTPS redirects for graphite.wikimedia.org - https://phabricator.wikimedia.org/T132461#2202522 (10fgiunchedi) the most critical I can think of is `check_graphite` which already supports https (not sure about following redirects) ``` $ /usr/lib/nagios/plugins/c... [09:45:50] 06Operations, 10ops-eqiad: mw1080 - readonly fs - depooled - please check sda - https://phabricator.wikimedia.org/T132529#2201642 (10Volans) @Dzahn FYI scap failed on that host, I've merged this: https://gerrit.wikimedia.org/r/#/c/283142/ [09:47:28] (03CR) 10Volans: [C: 032] MariaDB: use Puppet certs for TLS [puppet] - 10https://gerrit.wikimedia.org/r/283134 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [09:47:51] <_joe_> so, puppet is failing on aqs1002 [09:48:00] <_joe_> was it converted to multi-instance? [09:48:15] <_joe_> godog mobrovac ? [09:48:30] nope [09:48:44] but that's me, checking [09:48:52] <_joe_> it laments there is no /etc/cassandra-instances.d directory [09:49:03] yeah my lame mistake, I'll fix it [09:49:06] <_joe_> (the error is more convoluted of course) [09:49:16] _joe_ I was about to check, aqs100[23] had some troubles becase of some cassandra timeouts (we are waiting SSDs) [09:49:37] ah ok thanks godog :) [09:52:13] (03PS1) 10Filippo Giunchedi: cassandra: create /etc/cassandra-instances.d [puppet] - 10https://gerrit.wikimedia.org/r/283144 [09:53:26] still baffled that I can't eyeball a puppet change and tell whether it is going to fail to compile or not [09:54:40] (03PS2) 10Muehlenhoff: Enable base::firewall for all rdb* servers in codfw [puppet] - 10https://gerrit.wikimedia.org/r/282979 [09:56:37] PROBLEM - Hadoop NodeManager on analytics1043 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [09:57:18] PROBLEM - Hadoop NodeManager on analytics1049 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [10:01:36] (03PS2) 10Filippo Giunchedi: cassandra: create /etc/cassandra-instances.d [puppet] - 10https://gerrit.wikimedia.org/r/283144 [10:02:13] checking analytics hosts [10:05:06] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: create /etc/cassandra-instances.d [puppet] - 10https://gerrit.wikimedia.org/r/283144 (owner: 10Filippo Giunchedi) [10:05:17] RECOVERY - Hadoop NodeManager on analytics1049 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [10:05:26] they are all went down for out of memory sigh, opening a phab yask [10:05:28] *task [10:05:33] (03PS3) 10Muehlenhoff: Enable base::firewall for all rdb* servers in codfw [puppet] - 10https://gerrit.wikimedia.org/r/282979 [10:05:39] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable base::firewall for all rdb* servers in codfw [puppet] - 10https://gerrit.wikimedia.org/r/282979 (owner: 10Muehlenhoff) [10:06:16] moritzm: going to merge that too ^ [10:06:36] ok, I was just about to ask you whether I should merge your's along :-) [10:08:27] RECOVERY - Hadoop NodeManager on analytics1043 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [10:09:38] 06Operations, 06Analytics-Kanban: Out of memory errors causing Yarn nodemanager to shutdown on analytics hosts - https://phabricator.wikimedia.org/T132559#2202590 (10elukey) [10:14:16] 06Operations, 06Analytics-Kanban: Out of memory errors causing Yarn nodemanager to shutdown on analytics hosts - https://phabricator.wikimedia.org/T132559#2202603 (10elukey) [10:17:15] 06Operations, 10Analytics, 10Traffic, 13Patch-For-Review, 07Varnish: varnishstatsd crashes with ValueError in vsl_callback without being restarted by systemd - https://phabricator.wikimedia.org/T132430#2202604 (10ema) One more: Feb 11 09:17:47 cp4010 varnishstatsd[2820]: Traceback (most recent call la... [10:23:57] PROBLEM - mediawiki-installation DSH group on mw1080 is CRITICAL: Host mw1080 is not in mediawiki-installation dsh group [10:24:46] 06Operations, 10Analytics: kafkatee cronspam from oxygen - https://phabricator.wikimedia.org/T132322#2202623 (10elukey) p:05Triage>03Low [10:27:34] 06Operations, 06Analytics-Kanban, 10Traffic, 13Patch-For-Review: varnishkafka logrotate cronspam - https://phabricator.wikimedia.org/T129344#2102820 (10elukey) Installed on maps hosts by @ema, we will rollout the new version everywhere along wiht the Varnish 4 upgrade. [10:27:47] RECOVERY - puppet last run on aqs1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:29:23] 06Operations, 10ops-eqiad: ms-be1001.eqiad.wmnet: slot=8 dev=sdi failed - https://phabricator.wikimedia.org/T132142#2202643 (10fgiunchedi) 05Open>03Resolved thanks @Cmjohnson ! the disk came back in the right order, now rebuiling ``` /dev/sdi1 1.9T 2.2G 1.9T 1% /srv/swift-storage/sdi1 ``` [10:42:06] 06Operations, 10Analytics, 10Traffic, 13Patch-For-Review, 07Varnish: varnishstatsd crashes with ValueError in vsl_callback without being restarted by systemd - https://phabricator.wikimedia.org/T132430#2202664 (10ema) And this one: Mar 18 12:41:35 cp4010 varnishstatsd[10396]: Traceback (most recent ca... [10:42:34] (03PS1) 10Giuseppe Lavagetto: puppet: add a function for performing conftool lookups [puppet] - 10https://gerrit.wikimedia.org/r/283151 [10:50:56] (03CR) 10Dereckson: "Sure, but the ProofreadPage code to use extension registration is in wmf/1.27.0-wmf.21." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281976 (https://phabricator.wikimedia.org/T119117) (owner: 10Dereckson) [10:52:25] (03CR) 10Volans: [C: 032] Repool db1050 and db1041 for TLS upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283139 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [10:52:48] (03Merged) 10jenkins-bot: Repool db1050 and db1041 for TLS upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283139 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [10:56:10] !log volans@tin Synchronized wmf-config/db-eqiad.php: Repool db1050 and db1041 after TLS upgrade - T111654 (duration: 00m 42s) [10:56:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:04:14] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 [11:04:22] PROBLEM - puppet last run on stat1003 is CRITICAL: CRITICAL: Puppet has 1 failures [11:04:54] ---^ checking [11:07:44] (03CR) 10Filippo Giunchedi: [C: 04-1] "to be merged once the full switchover is in place (cfr. https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Media_storage.2FSwift )" [puppet] - 10https://gerrit.wikimedia.org/r/282893 (https://phabricator.wikimedia.org/T129089) (owner: 10Filippo Giunchedi) [11:08:08] the problem on stat1003 seems to be libcairo2-dev install [11:08:20] * elukey wonders what libcairo2-dev is [11:08:31] elukey: transient or an error? [11:08:43] PROBLEM - puppet last run on analytics1027 is CRITICAL: CRITICAL: Puppet has 1 failures [11:09:56] volans: error, I am checking now [11:10:40] 2016-04-13 10:55:21 remove libcairo2-dev:amd64 1.13.0~20140204-0ubuntu1.1 [11:14:33] RECOVERY - puppet last run on analytics1027 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:15:12] (03PS3) 10Alexandros Kosiaris: scap: A basic workaround for the git clone issue [puppet] - 10https://gerrit.wikimedia.org/r/282992 (https://phabricator.wikimedia.org/T132267) (owner: 10Ladsgroup) [11:15:18] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] scap: A basic workaround for the git clone issue [puppet] - 10https://gerrit.wikimedia.org/r/282992 (https://phabricator.wikimedia.org/T132267) (owner: 10Ladsgroup) [11:16:43] PROBLEM - Hadoop NodeManager on analytics1053 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [11:18:42] RECOVERY - Hadoop NodeManager on analytics1053 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [11:18:51] --^ /me doing sudo service hadoop-yarn-nodemanager restart [11:19:06] there is a phab task for this [11:20:59] (03PS1) 10Faidon Liambotis: Remove allocation for old eqiad-ulsfo GTT link [dns] - 10https://gerrit.wikimedia.org/r/283152 [11:21:48] (03CR) 10Faidon Liambotis: [C: 032] Remove allocation for old eqiad-ulsfo GTT link [dns] - 10https://gerrit.wikimedia.org/r/283152 (owner: 10Faidon Liambotis) [11:22:42] !log completed upgrading TLS for cross-dc replica on shards s6 and s7 - T111654 [11:22:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:22:50] cleanup! [11:25:22] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Puppet has 1 failures [11:26:39] 07Blocked-on-Operations, 06Operations, 10RESTBase, 10hardware-requests: Expand SSD space in Cassandra cluster - https://phabricator.wikimedia.org/T121575#2202721 (10fgiunchedi) [11:26:41] 07Blocked-on-Operations, 06Operations, 10RESTBase-Cassandra: expand raid0 in restbase200[1-6] - https://phabricator.wikimedia.org/T127951#2202719 (10fgiunchedi) 05Open>03Resolved done ``` restbase2001.codfw.wmnet: /dev/mapper/restbase2001--vg-srv 4.5T 2.5T 2.0T 57% /srv restbase2002.codfw.wmnet: /de... [11:27:42] RECOVERY - puppet last run on stat1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:33:23] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:43:30] we lost grrrit-wm [11:46:06] !log volans@tin Synchronized wmf-config/db-eqiad.php: Reduce db1050 weight - T111654 (duration: 00m 30s) [11:46:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:46:49] volans: it should restart itself iirc, if not the following people can give it a little push https://wikitech.wikimedia.org/wiki/Grrrit-wm#Access [11:47:41] ok, let's see, thanks [11:48:44] PROBLEM - puppet last run on cp3018 is CRITICAL: CRITICAL: puppet fail [11:58:29] (03PS2) 10Muehlenhoff: Enable base::firewall for rdb* servers in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/282980 [12:00:23] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable base::firewall for rdb* servers in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/282980 (owner: 10Muehlenhoff) [12:07:44] 06Operations, 06Commons, 10media-storage: Unable to restore file that has a very large file size - https://phabricator.wikimedia.org/T131832#2202787 (10Aklapper) [12:16:03] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 63.33% of data above the critical threshold [5000000.0] [12:18:24] RECOVERY - puppet last run on cp3018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:18:30] (03PS1) 10Muehlenhoff: Disable base::firewall on rdb1* again, needs more tweaks [puppet] - 10https://gerrit.wikimedia.org/r/283160 [12:22:58] :( [12:23:16] we have some issue with DPKG [12:23:32] at least icinga started complaining on a lot of DBs, from puppet logs: [12:23:35] W: Failed to fetch http://ubuntu.wikimedia.org/ubuntu/dists/trusty-updates/universe/binary-amd64/Packages Hash Sum mismatch [12:24:21] <_joe_> moritzm: as I feared... [12:24:25] <_joe_> sigh [12:24:33] PROBLEM - DPKG on labmon1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:25:55] _joe_: icinga alarms just cleared now [12:26:53] but I didn't check what the check checks (clear concept right :-P) [12:28:33] RECOVERY - DPKG on labmon1001 is OK: All packages OK [12:29:44] PROBLEM - puppet last run on cp3016 is CRITICAL: CRITICAL: puppet fail [12:30:32] 06Operations, 10media-storage, 13Patch-For-Review: swift capacity planning - https://phabricator.wikimedia.org/T1268#2202894 (10fgiunchedi) over the last year we're still averaging ~140GB/day or ~51TB/year (not including 3x replication) : [[ https://graphite.wikimedia.org/render/?width=816&height=329&_salt=1... [12:30:47] 06Operations, 10Traffic, 10Wiki-Loves-Monuments-General, 07HTTPS: configure https for www.wikilovesmonuments.org - https://phabricator.wikimedia.org/T118388#2202895 (10SindyM3) @Akoopal Thank you! I will contact server admin. [12:32:46] volans: yeah, I noticed this on a few other hosts as well, this was also the reason for the earlier icinga alerts on stat100[23], since it failed to install a binary package which was partly unavailable due to the failing "apt-get update" [12:34:10] RECOVERY - puppet last run on cp3016 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [12:38:52] (03CR) 10Elukey: "After a chat with Joe I believe that the picture is:" [puppet] - 10https://gerrit.wikimedia.org/r/282936 (https://phabricator.wikimedia.org/T131877) (owner: 10Elukey) [12:40:05] jzerebecki: Dear anthropoid, the time has come. Please deploy MediaWiki deployment (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160413T1240). [12:47:30] PROBLEM - Hadoop NodeManager on analytics1044 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [12:50:44] ---^ taking care of it [12:51:19] RECOVERY - Hadoop NodeManager on analytics1044 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [12:54:31] (03PS1) 10Muehlenhoff: Disable connection tracking for redis/jobqueue [puppet] - 10https://gerrit.wikimedia.org/r/283167 [12:55:11] (03CR) 10Muehlenhoff: [C: 032 V: 032] Disable base::firewall on rdb1* again, needs more tweaks [puppet] - 10https://gerrit.wikimedia.org/r/283160 (owner: 10Muehlenhoff) [13:07:48] (03PS1) 10Volans: MariaDB: Use Puppet certs for s2 [puppet] - 10https://gerrit.wikimedia.org/r/283170 (https://phabricator.wikimedia.org/T111654) [13:09:50] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 50.00% above the threshold [1000000.0] [13:12:03] 06Operations, 06Analytics-Kanban: Out of memory errors causing Yarn nodemanager to shutdown on analytics hosts - https://phabricator.wikimedia.org/T132559#2203044 (10elukey) From https://grafana.wikimedia.org/dashboard/db/server-board I don't see major memory problems at host level, a lot of GBs are simply cac... [13:14:33] (03CR) 10Volans: "changes looks good:" [puppet] - 10https://gerrit.wikimedia.org/r/283170 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [13:14:46] !log start upgrading TLS for cross-dc replica on shards s2 - T111654 [13:14:51] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:14:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:17:20] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:18:41] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [13:22:18] !log jzerebecki@tin Started scap: php-1.27.0-wmf.21: Update Wikidata to wmf/1.27.0-wmf.21 [13:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:24:40] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:25:42] 06Operations, 10Beta-Cluster-Infrastructure, 07Performance: Need a way to simulate replication lag to test replag issues - https://phabricator.wikimedia.org/T40945#2203091 (10Nikerabbit) [13:26:16] 06Operations, 06Analytics-Kanban: Out of memory errors causing Yarn nodemanager to shutdown on analytics hosts - https://phabricator.wikimedia.org/T132559#2203092 (10elukey) Another ticket was opened for the same thing https://phabricator.wikimedia.org/T102954 [13:26:25] (03PS2) 10Volans: MariaDB: Use Puppet certs for s2 [puppet] - 10https://gerrit.wikimedia.org/r/283170 (https://phabricator.wikimedia.org/T111654) [13:26:50] 06Operations, 06Analytics-Kanban: Out of memory errors causing Yarn nodemanager to shutdown on analytics hosts - https://phabricator.wikimedia.org/T132559#2203094 (10elukey) 05Open>03Resolved p:05Triage>03Normal [13:28:21] (03PS1) 10Muehlenhoff: Ferm rules for puppetmaster frontend [puppet] - 10https://gerrit.wikimedia.org/r/283174 [13:30:06] (03CR) 10Volans: [C: 032] MariaDB: Use Puppet certs for s2 [puppet] - 10https://gerrit.wikimedia.org/r/283170 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [13:30:42] (03CR) 10Andrew Bogott: Use half-baked ldap auth for librenms (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/282830 (https://phabricator.wikimedia.org/T107702) (owner: 10Andrew Bogott) [13:32:02] 06Operations, 10ops-eqiad, 06Analytics-Kanban: Analytics hosts showed high temperature alarms - https://phabricator.wikimedia.org/T132256#2203112 (10elukey) ``` elukey@neodymium:~$ sudo -i salt -t 120 analytics10* cmd.run 'grep "Hardware event" /var/log/mcelog | uniq -c' analytics1041.eqiad.wmnet: analytics1... [13:32:38] (03PS5) 10Andrew Bogott: Use half-baked ldap auth for librenms [puppet] - 10https://gerrit.wikimedia.org/r/282830 (https://phabricator.wikimedia.org/T107702) [13:32:51] (03PS1) 10Muehlenhoff: Add ferm rules for pybal_conf / http [puppet] - 10https://gerrit.wikimedia.org/r/283175 [13:33:29] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [13:34:05] (03CR) 10jenkins-bot: [V: 04-1] Add ferm rules for pybal_conf / http [puppet] - 10https://gerrit.wikimedia.org/r/283175 (owner: 10Muehlenhoff) [13:34:37] elukey: any idea of what's up with an1045? [13:34:50] (03PS2) 10Muehlenhoff: Add ferm rules for pybal_conf / http [puppet] - 10https://gerrit.wikimedia.org/r/283175 [13:36:31] paravoid: there's some discussion how to fix this in #wikimedia-analytics [13:36:49] PROBLEM - Hadoop NodeManager on analytics1045 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [13:37:25] ok [13:39:33] (03PS1) 10Volans: Remove require on resource not managed by Puppet [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/283176 (https://phabricator.wikimedia.org/T111654) [13:39:40] PROBLEM - Hadoop NodeManager on analytics1056 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [13:39:58] Hmph! [13:40:03] paravoid: hi! So the hadoop nodemanager registers a Java OOM and shutsdown (no upstart scritp with respawn). ottomata saw this issue a while ago but very sporadically, and puppet was basically fixing the problem for us. We will investigate what's happening.. [13:40:14] (03CR) 10Volans: [C: 032] Remove require on resource not managed by Puppet [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/283176 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [13:40:27] a while ago meaning over a year ago [13:41:04] ottomata: just restarted yarn on 1045 [13:41:18] k [13:41:30] am looking at logs there and on 56 [13:42:40] RECOVERY - Hadoop NodeManager on analytics1045 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [13:42:43] restarted also 1056 [13:42:45] (03PS1) 10Volans: MariaDB: Update submodule reference [puppet] - 10https://gerrit.wikimedia.org/r/283178 (https://phabricator.wikimedia.org/T111654) [13:43:39] RECOVERY - Hadoop NodeManager on analytics1056 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [13:45:01] (03CR) 10Volans: [C: 032] MariaDB: Update submodule reference [puppet] - 10https://gerrit.wikimedia.org/r/283178 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [13:46:08] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [13:47:20] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:47:23] (03PS1) 10Ema: varnishstatsd: log ValueError exceptions [puppet] - 10https://gerrit.wikimedia.org/r/283179 (https://phabricator.wikimedia.org/T132430) [13:51:10] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [13:52:00] (03CR) 10BBlack: [C: 031] varnishstatsd: log ValueError exceptions [puppet] - 10https://gerrit.wikimedia.org/r/283179 (https://phabricator.wikimedia.org/T132430) (owner: 10Ema) [13:52:15] !log jzerebecki@tin Finished scap: php-1.27.0-wmf.21: Update Wikidata to wmf/1.27.0-wmf.21 (duration: 29m 56s) [13:52:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:52:36] 06Operations, 10ops-eqiad: mw1080 - readonly fs - depooled - please check sda - https://phabricator.wikimedia.org/T132529#2203158 (10Dzahn) Ok, thanks. So that means depooling is 3 steps now? Running conftool, changing conftool config AND still changing the old dsh files? [13:52:54] (03PS2) 10Ema: varnishstatsd: log ValueError exceptions [puppet] - 10https://gerrit.wikimedia.org/r/283179 (https://phabricator.wikimedia.org/T132430) [13:53:04] (03CR) 10Ema: [C: 032 V: 032] varnishstatsd: log ValueError exceptions [puppet] - 10https://gerrit.wikimedia.org/r/283179 (https://phabricator.wikimedia.org/T132430) (owner: 10Ema) [13:57:21] 06Operations, 06Performance-Team, 10Traffic, 13Patch-For-Review: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#2203163 (10BBlack) Notable: there's an ongoing report of 1.9.14 causing an HTTP/2 proto error in Chrome. We may need to be wary and stick with .13 or wait for .15: http://mailman.ngin... [13:57:49] akosiaris: starttls fails immediately; any suggestions on what to check or try? [13:58:44] 06Operations, 10Analytics, 10hardware-requests, 13Patch-For-Review: eqiad: (3) AQS replacement nodes - https://phabricator.wikimedia.org/T124947#2203165 (10Ottomata) @Robh, over in T132067 it looks like these nodes were ordered, is this correct? If so, any idea on ETA? [13:59:27] andrewbogott: fails ? interesting. Error message ? [13:59:28] (03CR) 10Faidon Liambotis: "What does half-baked mean here? I'm okay with this in concept and, if tested/baby-sat on, in implementation as well." [puppet] - 10https://gerrit.wikimedia.org/r/282830 (https://phabricator.wikimedia.org/T107702) (owner: 10Andrew Bogott) [14:00:16] (03PS1) 10Urbanecm: Fix typo in newikibooks namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283183 [14:00:54] 06Operations: rsync module doesnt work on trusty - https://phabricator.wikimedia.org/T132532#2203182 (10Dzahn) For some reason ms-be systems have a **/etc/init.d/rsync ** but when i put an "include rsync::server" on osmium we do not get that init script and it just didnt exist. Putting the identical class on a... [14:00:59] akosiaris: the php call 'ldap_start_tls' returns 'false.' Very informative. [14:01:37] https://www.irccloud.com/pastebin/CnEW4ryi/ [14:01:40] akosiaris: ^ [14:01:47] (03PS2) 10Urbanecm: Fix typo in newikibooks namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283183 (https://phabricator.wikimedia.org/T131754) [14:01:54] (03CR) 10Faidon Liambotis: [C: 031] Enable ChallengeResponseAuthentication [puppet] - 10https://gerrit.wikimedia.org/r/281629 (owner: 10Muehlenhoff) [14:02:07] andrewbogott: https://secure.php.net/manual/en/function.ldap-error.php [14:02:12] that should help explain what happens [14:02:25] ah [14:02:27] it's already there [14:02:30] (03CR) 10Andrew Bogott: "@faidon, I'm just pissed at the shitty librenmns ldap code that I had to dig through to get here. I will fix the commit message :)" [puppet] - 10https://gerrit.wikimedia.org/r/282830 (https://phabricator.wikimedia.org/T107702) (owner: 10Andrew Bogott) [14:02:45] and it does not return an error ? [14:03:01] "Fatal error: LDAP TLS required but not successfully negotiated:Connect error" [14:03:02] wth ? [14:03:08] hmm [14:03:13] ok lemme check this a bit then [14:03:35] thanks. I'm watching the slapd logs but there's nothing of interest there that I can see [14:05:00] akosiaris: could this be a missing cert on netmon1001? [14:05:35] meaning the CA cert ? maybe ... [14:05:44] gimme 10 mins and I 'll tell you what it is [14:05:47] well at least I hope [14:05:52] damned LDAP [14:05:58] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [14:05:58] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [14:06:18] akosiaris: works for me :) [14:06:48] Um, I mean, having you debug it works for me, not starttls. Starttls does not work for me, as established [14:07:31] (03PS1) 10Ottomata: Use @yarn_heapsize, not @hadoop_heapsize when setting $yarn_heapsize [puppet/cdh] - 10https://gerrit.wikimedia.org/r/283185 (https://phabricator.wikimedia.org/T102954) [14:08:09] (03CR) 10Ottomata: [C: 032 V: 032] Use @yarn_heapsize, not @hadoop_heapsize when setting $yarn_heapsize [puppet/cdh] - 10https://gerrit.wikimedia.org/r/283185 (https://phabricator.wikimedia.org/T102954) (owner: 10Ottomata) [14:09:09] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:09:42] (03PS1) 10Ottomata: Update cdh submodule with yarn_heapsize fix, set yarn_heapsize to 2048m [puppet] - 10https://gerrit.wikimedia.org/r/283186 [14:09:48] PROBLEM - Hadoop NodeManager on analytics1047 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [14:10:41] (03CR) 10jenkins-bot: [V: 04-1] Update cdh submodule with yarn_heapsize fix, set yarn_heapsize to 2048m [puppet] - 10https://gerrit.wikimedia.org/r/283186 (owner: 10Ottomata) [14:10:59] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [14:11:34] (03CR) 10Ottomata: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/283186 (owner: 10Ottomata) [14:13:49] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:13:49] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:13:54] (03CR) 10Ottomata: [C: 032] Update cdh submodule with yarn_heapsize fix, set yarn_heapsize to 2048m [puppet] - 10https://gerrit.wikimedia.org/r/283186 (owner: 10Ottomata) [14:13:58] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:13:58] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:15:48] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [14:16:02] (03CR) 10Alexandros Kosiaris: [C: 04-1] Use half-baked ldap auth for librenms (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/282830 (https://phabricator.wikimedia.org/T107702) (owner: 10Andrew Bogott) [14:16:19] andrewbogott: found it. it's 'ldap://ldap-labs.eqiad.wikimedia.org ldap://ldap-labs.codfw.wikimedia.org' otherwise cert names don't match DNS names [14:16:22] commented on the change [14:17:14] and I needed 10 mins and 10seconds (by some accounts)... missed my mark by 10 seconds [14:17:16] grrr [14:17:35] still pretty good [14:17:48] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [14:19:39] hm, akosiaris, can you log in now? [14:20:19] in librenms ? no [14:20:24] ah silly me [14:21:00] I was trying with full name instead of username, but that does not work either [14:21:35] yeah, it gets further now but still doesn't work... [14:21:40] let me figure out which change broke it [14:21:55] (03CR) 10Ottomata: [C: 032] Add stat1004 configuration to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/282936 (https://phabricator.wikimedia.org/T131877) (owner: 10Elukey) [14:22:35] 06Operations, 10ops-eqiad: mw1080 - readonly fs - depooled - please check sda - https://phabricator.wikimedia.org/T132529#2203236 (10Volans) @Dzahn I got help from @Joe and @fgiunchedi on IRC, I didn't know the whole thing. Looks like it is needed, but double check with them please. [14:22:59] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:23:09] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:23:18] (03PS2) 10Elukey: Add stat1004 configuration to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/282936 (https://phabricator.wikimedia.org/T131877) [14:23:34] mobrovac: I am betting that that gov site does not work again ^ [14:23:48] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:23:48] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:23:52] akosiaris: i think you have won that lottery [14:23:54] <_joe_> akosiaris: last time I checked, that was the case [14:24:16] yeah, like stealing candy from a baby [14:24:22] <_joe_> mobrovac: we should really add smarter timeouts to service_checker [14:24:40] (03PS1) 10ArielGlenn: don't allow en wiki dump jobs to overlap (yet) [puppet] - 10https://gerrit.wikimedia.org/r/283188 [14:24:48] 06Operations, 06Analytics-Kanban, 13Patch-For-Review: Upgrade stat1001 to Debian Jessie - https://phabricator.wikimedia.org/T76348#2203239 (10Milimetric) I'd say we should wait for @ezachte to figure out what we should do with those 2T. Pinging. [14:25:13] akosiaris: i'm quite sure you wouldn't be able to steal candy from my niece [14:25:20] _joe_: as in configurable? [14:25:29] RECOVERY - Hadoop NodeManager on analytics1047 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [14:25:36] !log completed upgrading TLS for cross-dc replica on shards s2 - T111654 [14:25:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:26:11] _joe_: also notice that it's not the service_checker script that is failing now, but check_nrpe [14:26:12] \ [14:26:16] (03CR) 10ArielGlenn: [C: 032] don't allow en wiki dump jobs to overlap (yet) [puppet] - 10https://gerrit.wikimedia.org/r/283188 (owner: 10ArielGlenn) [14:28:26] Morning anomie ostriches thcipriani MarkTraceur Krenair, how's it going? Just wondering who'll do the SWAT deploy this morning, and if it's possible to add a CN patch that needs to sync first to mw1017 for a small amount of extra testing? [14:29:04] 06Operations, 10ops-eqiad: mw1080 - readonly fs - depooled - please check sda - https://phabricator.wikimedia.org/T132529#2201642 (10Cmjohnson) it this something we want to fix or just decommission? [14:31:49] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [14:31:58] heya, anybody know what happened to the salt role grains? [14:32:07] they stopped working a LONG time ago, and I never investigated [14:32:10] https://wikitech.wikimedia.org/wiki/Salt#List.2Fping_all_nodes_with_a_puppet_role [14:32:14] is this not true? [14:32:20] 06Operations, 10ops-eqiad: mw1080 - readonly fs - depooled - please check sda - https://phabricator.wikimedia.org/T132529#2203275 (10Joe) decommission it, it would've been decommissioned after next week anyways. [14:32:36] (03PS6) 10Andrew Bogott: Use ldap auth for librenms [puppet] - 10https://gerrit.wikimedia.org/r/282830 (https://phabricator.wikimedia.org/T107702) [14:32:52] not sure about those, but you can match server groups via the debdeploy salt grains [14:33:00] (03PS2) 10BBlack: Use https://config-master.wm.o for rolematcher T132459 [puppet] - 10https://gerrit.wikimedia.org/r/283089 [14:33:08] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [14:33:10] (03CR) 10BBlack: [C: 032 V: 032] Use https://config-master.wm.o for rolematcher T132459 [puppet] - 10https://gerrit.wikimedia.org/r/283089 (owner: 10BBlack) [14:33:14] 06Operations, 10ops-eqiad: mw1080 - readonly fs - depooled - please check sda - https://phabricator.wikimedia.org/T132529#2203276 (10Joe) @Dzahn no, the immediate depooling is done via conftool; I asked to remove it from conftool-data since it's going to be decommissioned. [14:33:16] HMMM [14:33:17] ja [14:33:17] debdeploy-hadoop-worker [14:33:18] ok [14:33:19] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [14:33:29] those aren't very automated though, but i guess its cool [14:33:36] wondering if i should add it back into system::role or something [14:33:49] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [14:33:57] (03CR) 10jenkins-bot: [V: 04-1] Use ldap auth for librenms [puppet] - 10https://gerrit.wikimedia.org/r/282830 (https://phabricator.wikimedia.org/T107702) (owner: 10Andrew Bogott) [14:34:23] they're mostly based on the roles (except the canary ones) [14:34:41] right, but they all have to be added manually, i'm sure there are many roles that don't have debdeploy grains, no? [14:35:15] !log start cleanup on restbase100[569] - T128107 [14:35:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:35:59] (03PS7) 10Andrew Bogott: Use ldap auth for librenms [puppet] - 10https://gerrit.wikimedia.org/r/282830 (https://phabricator.wikimedia.org/T107702) [14:36:16] \o/ [14:36:24] 06Operations, 10ops-eqiad: mw1080 - readonly fs - depooled - please check sda - https://phabricator.wikimedia.org/T132529#2203301 (10Dzahn) @Volans @joe gotcha, thank you [14:37:24] 06Operations, 10Pybal, 10Traffic, 07HTTPS, 13Patch-For-Review: HTTPS redirects for config-master.wikimedia.org - https://phabricator.wikimedia.org/T132459#2203303 (10BBlack) Re: rolematcher - the only real host I could trace it to in puppetization was fluorine. However, post-merge the update did not get... [14:37:25] ottomata: all roles should have debdeploy grains (except a few corner cases), if I spot some systems not matched by existing grains I add them [14:37:37] so unless they're very fresh they should be in there [14:37:44] (03CR) 10Andrew Bogott: "Tested, looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/282830 (https://phabricator.wikimedia.org/T107702) (owner: 10Andrew Bogott) [14:38:17] !log restarting hadoop-yarn-nodemanager on all hadoop worker nodes one by one to apply increase in heap size [14:38:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:39:01] (03PS1) 10BBlack: config-master.wm.o HTTPS redirect T132459 [puppet] - 10https://gerrit.wikimedia.org/r/283190 [14:39:08] AndyRussG: I'll SWAT this morning, Go ahead and add your patch and we'll do it at the end of the window. [14:39:50] thcipriani: cool thanks much! [14:39:59] just preparing the patch itself [14:40:09] okie doke. [14:41:41] (03CR) 10BBlack: [C: 032] config-master.wm.o HTTPS redirect T132459 [puppet] - 10https://gerrit.wikimedia.org/r/283190 (owner: 10BBlack) [14:42:37] 06Operations, 10Traffic, 07HTTPS: let all services on misc-web enforce http->https redirects - https://phabricator.wikimedia.org/T103919#2203332 (10BBlack) [14:42:39] 06Operations, 10Pybal, 10Traffic, 07HTTPS, 13Patch-For-Review: HTTPS redirects for config-master.wikimedia.org - https://phabricator.wikimedia.org/T132459#2203330 (10BBlack) 05Open>03Resolved a:03BBlack [14:42:41] (03PS1) 10Volans: MariaDB: use Puppet cert for s1 cross-dc replica [puppet] - 10https://gerrit.wikimedia.org/r/283192 (https://phabricator.wikimedia.org/T111654) [14:42:48] (03PS1) 10Volans: Depool db1057 for TLS upgrade for s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283193 (https://phabricator.wikimedia.org/T111654) [14:44:11] (03PS1) 10Ema: Install test version of ${vcl}.inc.vcl.erb [puppet] - 10https://gerrit.wikimedia.org/r/283194 [14:44:22] (03PS2) 10Volans: MariaDB: use Puppet cert for s1 cross-dc replica [puppet] - 10https://gerrit.wikimedia.org/r/283192 (https://phabricator.wikimedia.org/T111654) [14:45:10] (03CR) 10BBlack: [C: 031] Install test version of ${vcl}.inc.vcl.erb [puppet] - 10https://gerrit.wikimedia.org/r/283194 (owner: 10Ema) [14:47:08] (03CR) 10Alexandros Kosiaris: [C: 031] Use ldap auth for librenms [puppet] - 10https://gerrit.wikimedia.org/r/282830 (https://phabricator.wikimedia.org/T107702) (owner: 10Andrew Bogott) [14:47:16] 06Operations, 10Traffic, 07HTTPS: HTTPS redirects for transparency.wikimedia.org - https://phabricator.wikimedia.org/T132464#2203360 (10BBlack) I don't see any mixed content in simple checks, and it seems to not use proto-absolute URLs in general. Since this site is clearly for human consumption, I'll proba... [14:47:17] (03PS3) 10Elukey: Add stat1004 configuration to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/282936 (https://phabricator.wikimedia.org/T131877) [14:48:17] (03PS8) 10Andrew Bogott: Use ldap auth for librenms [puppet] - 10https://gerrit.wikimedia.org/r/282830 (https://phabricator.wikimedia.org/T107702) [14:48:23] !log Yarn nodemanager Xmx size bumped up from 1000m to 2048 on all the analytics* hosts to overcome the OutOfMemory errors. [14:48:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:48:30] (03PS1) 10Ema: misc-backend.inc.vcl: set do_stream=false when testing [puppet] - 10https://gerrit.wikimedia.org/r/283195 [14:49:00] 06Operations, 10Analytics-Cluster, 10Traffic, 07HTTPS: HTTPS redirects for stats.wikimedia.org - https://phabricator.wikimedia.org/T132465#2203363 (10BBlack) I don't see any mixed content in simple checks (and we checked/fixed that in a much earlier ticket: (T93702). Since this site is clearly for human c... [14:49:12] (03CR) 10Elukey: [C: 032] Add stat1004 configuration to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/282936 (https://phabricator.wikimedia.org/T131877) (owner: 10Elukey) [14:50:00] (03PS3) 10Volans: MariaDB: use Puppet cert for s1 cross-dc replica [puppet] - 10https://gerrit.wikimedia.org/r/283192 (https://phabricator.wikimedia.org/T111654) [14:51:27] 06Operations, 06Release-Engineering-Team, 10Traffic, 05Gitblit-Deprecate, 07HTTPS: HTTPS redirects for git.wikimedia.org - https://phabricator.wikimedia.org/T132460#2203375 (10BBlack) As gitblit is on the chopping block for deprecation anyways, my inclination is to go ahead and enable HTTPS for this soon... [14:51:45] (03CR) 10Volans: [C: 032] Depool db1057 for TLS upgrade for s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283193 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [14:52:10] (03Merged) 10jenkins-bot: Depool db1057 for TLS upgrade for s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283193 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [14:52:25] (03CR) 10BBlack: [C: 031] "+1 Works for now! In the long term, we'll want to figure out how to test multiple layers/tiers..." [puppet] - 10https://gerrit.wikimedia.org/r/283195 (owner: 10Ema) [14:52:51] (03PS16) 10Mobrovac: Kafka config: Add config functions [puppet] - 10https://gerrit.wikimedia.org/r/279280 (https://phabricator.wikimedia.org/T130371) [14:53:01] 06Operations, 10Gitblit, 06Release-Engineering-Team, 10Traffic, 07HTTPS: HTTPS redirects for git.wikimedia.org - https://phabricator.wikimedia.org/T132460#2203384 (10greg) [14:53:25] !log volans@tin Synchronized wmf-config/db-eqiad.php: Depool db1057 to upgrade TLS on s1 - T111654 (duration: 00m 26s) [14:53:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:53:49] PROBLEM - Hadoop NodeManager on analytics1048 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [14:53:55] !log start upgrading TLS for cross-dc replica on shards s1 - T111654 [14:53:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:54:09] (03PS2) 10Ema: Install test version of ${vcl}.inc.vcl.erb [puppet] - 10https://gerrit.wikimedia.org/r/283194 [14:54:12] (03CR) 10Mobrovac: "@Ottomata {{done}}" [puppet] - 10https://gerrit.wikimedia.org/r/279280 (https://phabricator.wikimedia.org/T130371) (owner: 10Mobrovac) [14:54:18] (03CR) 10Ema: [C: 032 V: 032] Install test version of ${vcl}.inc.vcl.erb [puppet] - 10https://gerrit.wikimedia.org/r/283194 (owner: 10Ema) [14:54:29] (03PS2) 10Ema: misc-backend.inc.vcl: set do_stream=false when testing [puppet] - 10https://gerrit.wikimedia.org/r/283195 [14:54:40] (03CR) 10Ema: [C: 032 V: 032] misc-backend.inc.vcl: set do_stream=false when testing [puppet] - 10https://gerrit.wikimedia.org/r/283195 (owner: 10Ema) [14:55:30] 06Operations, 10Datasets-General-or-Unknown, 10Traffic, 07HTTPS, 13Patch-For-Review: HTTPS redirects for datasets.wikimedia.org - https://phabricator.wikimedia.org/T132463#2203387 (10Ottomata) datasets.wikimedia.or is hosted on stat1001, not a dataset100x host. I think HTTPS only is fine, but maybe @mil... [14:55:49] RECOVERY - Hadoop NodeManager on analytics1048 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [14:56:14] (03PS3) 10Ema: New misc VTC test: 09-chunked-response-add-cl.vtc [puppet] - 10https://gerrit.wikimedia.org/r/282895 (https://phabricator.wikimedia.org/T128813) [14:56:22] (03CR) 10Ema: [C: 032 V: 032] New misc VTC test: 09-chunked-response-add-cl.vtc [puppet] - 10https://gerrit.wikimedia.org/r/282895 (https://phabricator.wikimedia.org/T128813) (owner: 10Ema) [14:58:33] andrewbogott: so any luck ? did you manage to get librenms working ? [14:58:36] (03PS9) 10Andrew Bogott: Use ldap auth for librenms [puppet] - 10https://gerrit.wikimedia.org/r/282830 (https://phabricator.wikimedia.org/T107702) [14:58:50] oh still not merged, sorry [14:58:52] akosiaris: yep, all good, I going to merge just as soon as I can catch up with the git head [14:58:57] ahaha [14:58:57] ok [14:59:51] (03CR) 10Andrew Bogott: [C: 032] Use ldap auth for librenms [puppet] - 10https://gerrit.wikimedia.org/r/282830 (https://phabricator.wikimedia.org/T107702) (owner: 10Andrew Bogott) [15:00:04] anomie ostriches thcipriani marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160413T1500). [15:00:04] jdlrobson Urbanecm: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:11] \o [15:00:25] I'll SWAT today. [15:02:17] 06Operations: librsvg path patch needs to be applied for jessie - https://phabricator.wikimedia.org/T132584#2203430 (10MoritzMuehlenhoff) [15:02:26] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 13Patch-For-Review: setup stat1004/WMF4721 for hadoop client usage - https://phabricator.wikimedia.org/T131877#2203442 (10elukey) stat1004 is ready for a test! @Ottomata would you be the first one to try it? :P [15:02:28] Urbanecm: around for SWAT? I can get your patch out while I'm waiting on Jenkins for the others. [15:02:38] Yes. [15:02:56] 06Operations, 13Patch-For-Review: Configure librenms to use LDAP for authentication - https://phabricator.wikimedia.org/T107702#2203445 (10Andrew) 05Open>03Resolved a:03Andrew [15:03:08] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283183 (https://phabricator.wikimedia.org/T131754) (owner: 10Urbanecm) [15:03:27] 06Operations: librsvg path patch needs to be applied for jessie - https://phabricator.wikimedia.org/T132584#2203463 (10MoritzMuehlenhoff) [15:03:29] 06Operations, 10MediaWiki-General-or-Unknown, 07HHVM: Make all role::mediawiki::* classes compatible with debian jessie - https://phabricator.wikimedia.org/T131749#2203462 (10MoritzMuehlenhoff) [15:03:54] akosiaris: works for you now? [15:04:31] (03PS3) 10Thcipriani: Fix typo in newikibooks namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283183 (https://phabricator.wikimedia.org/T131754) (owner: 10Urbanecm) [15:04:33] (03CR) 10Volans: "change looks good:" [puppet] - 10https://gerrit.wikimedia.org/r/283192 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [15:06:18] Thcipriani: I can see that you're rebasing my changes everytime when you SWAT my patches. So should I rebase them on master everytime when I upload it to Gerrit? [15:06:54] 06Operations, 10media-storage, 07Tracking: [tracking] refresh swift hardware in codfw/eqiad - https://phabricator.wikimedia.org/T130012#2203496 (10fgiunchedi) for the current refresh we're replacing 12x 2TB machines in eqiad/codfw, each machine has 12x 1.9TB = 22.8TB usable, so 273.6TB and 144 disks per data... [15:07:15] Urbanecm: you can only merge the patch on the very tippy-top. So there's almost always a rebase step before a merge if the patch is more than a few minutes old. [15:07:25] Urbanecm: nah, I'm just making sure that I don't end up with merge commits. [15:07:34] yeah, what andrewbogott said. [15:07:34] thcipriani: ping me when you need me to test. I've got some examples ready [15:07:45] jdlrobson: kk, just waiting on Jenkins for now. [15:08:02] (03CR) 10Thcipriani: Fix typo in newikibooks namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283183 (https://phabricator.wikimedia.org/T131754) (owner: 10Urbanecm) [15:08:13] (03CR) 10Thcipriani: [C: 032] "SWAT again" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283183 (https://phabricator.wikimedia.org/T131754) (owner: 10Urbanecm) [15:08:38] (03Merged) 10jenkins-bot: Fix typo in newikibooks namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283183 (https://phabricator.wikimedia.org/T131754) (owner: 10Urbanecm) [15:08:44] ^ also in this instance this patch needed a rebase before it could merge (which I forgot to do before I +2'd) :\ [15:09:59] Andrewbogott and Thcipriani: So instead of git review -R I should run git review? So shouldn't we update https://www.mediawiki.org/wiki/Gerrit/Tutorial ? (If I'm asking in wrong time because we're SWATing, please tell it to me and I'll ask after SWAT) [15:10:25] andrewbogott: thank you so much for that librenms ldap change [15:10:36] (03PS2) 10BBlack: use https://parsoid-tests in testreduce T132462 [puppet] - 10https://gerrit.wikimedia.org/r/283088 [15:10:45] (03CR) 10BBlack: [C: 032 V: 032] use https://parsoid-tests in testreduce T132462 [puppet] - 10https://gerrit.wikimedia.org/r/283088 (owner: 10BBlack) [15:10:45] oh, I lost my dashboard [15:10:47] oh well [15:10:53] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Fix typo in newikibooks namespaces [[gerrit:283183]] (duration: 00m 30s) [15:10:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:11:10] ^ Urbanecm check please [15:11:38] Urbanecm: FWIW all your patches have seemed fine to me in terms of how they are submitted. [15:11:40] 06Operations, 10Analytics-Cluster, 10Traffic, 07HTTPS: HTTPS redirects for stats.wikimedia.org - https://phabricator.wikimedia.org/T132465#2203508 (10Ottomata) +1 should be fine to do. [15:11:52] 06Operations, 10Parsoid, 10Traffic, 07HTTPS, 13Patch-For-Review: HTTPS redirects for parsoid-tests.wikimedia.org - https://phabricator.wikimedia.org/T132462#2203509 (10BBlack) Brief discussion on mediawiki-parsoid IRC channel seems to indicate this is low risk, so going for it. [15:12:50] It seems ok. Thanks. [15:13:56] Urbanecm: thanks for checking! [15:15:35] (03CR) 10Mobrovac: "Still looking good - https://puppet-compiler.wmflabs.org/2426/" [puppet] - 10https://gerrit.wikimedia.org/r/279280 (https://phabricator.wikimedia.org/T130371) (owner: 10Mobrovac) [15:15:43] Thcipriani: So the pushing. I have one commit in branch in my local repo and I worked on it for one hour. Should I run git review or rebase them on master and then run git review -R? I think that these commands are same, so I can run only git review. Am I right? [15:16:55] (03CR) 10Muehlenhoff: [C: 031] "Looks good" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/263631 (owner: 10Chad) [15:17:06] thcipriani: just waiting on gate-and-submits... Since u said near the end of the window, I didn't rush too much, sorry [15:17:10] 06Operations, 10Analytics, 10hardware-requests, 13Patch-For-Review: eqiad: (3) AQS replacement nodes - https://phabricator.wikimedia.org/T124947#2203525 (10Eevans) [15:17:59] 06Operations, 10Datasets-General-or-Unknown, 10Traffic, 07HTTPS, 13Patch-For-Review: HTTPS redirects for datasets.wikimedia.org - https://phabricator.wikimedia.org/T132463#2203529 (10Halfak) +1 for HTTPS only being OK. [15:18:43] 06Operations, 10Traffic, 07HTTPS: HTTPS redirects for transparency.wikimedia.org - https://phabricator.wikimedia.org/T132464#2203530 (10Chmarkine) Redirect to https should be fine, since we enabled HSTS for transparency.wikimedia.org in May 2015.[1] But was there any reason that the redirect was dropped? [1... [15:18:52] (03PS1) 10Physikerwelt: Make MathML rendering default in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283199 (https://phabricator.wikimedia.org/T104550) [15:18:58] Hmmm looks like the queue is not so fast this morning! [15:19:00] (03CR) 10Ottomata: [V: 031] "COOOOL! Marko, let's apply this together this week, ja?" [puppet] - 10https://gerrit.wikimedia.org/r/279280 (https://phabricator.wikimedia.org/T130371) (owner: 10Mobrovac) [15:19:14] mobrovac: maybe my morning tomorrow? [15:19:30] we can apply in in beta first and run puppet in a few places to make sure it does what we think, and then fully merge it? [15:19:31] Urbanecm: I don't use git-review too much. FWIW, I think it's probably going to be easier to do a manual rebase and then do git review -R. I'm not sure what happens if your rebase fails using git review: it's possible that it's the same thing that would happen if your rebase failed running git rebase, but I'm not 100% on that. [15:19:54] ottomata: sounds good! [15:19:55] i.e. the error message may be more opaque thanks to git-review [15:20:13] PROBLEM - Hadoop NodeManager on analytics1055 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:20:20] Thanks. So no more manual rebases by SWATters, I'll remember it :) [15:20:30] thcipriani: do you mind if I merge on puppet? [15:20:37] yeah seriously andrewbogott thanks for the librenms/ldap work! works like a charm [15:21:05] (03PS4) 10Volans: MariaDB: use Puppet cert for s1 cross-dc replica [puppet] - 10https://gerrit.wikimedia.org/r/283192 (https://phabricator.wikimedia.org/T111654) [15:21:33] !log thcipriani@tin Synchronized php-1.27.0-wmf.21/extensions/WikidataPageBanner: SWAT: Attempt at fixing table of contents problem [[gerrit:282995]] (duration: 00m 29s) [15:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:21:58] ^ jdlrobson check please: .21 only right now. [15:22:05] (group0 wikis) [15:22:13] RECOVERY - Hadoop NodeManager on analytics1055 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:23:20] thcipriani: WikidataPageBanner is only on .20 right now [15:23:52] so i can't verify there until Wikivoyage is on .21 :/ [15:24:11] jdlrobson: kk, I'll go forward with .20 [15:25:39] godog: cool [15:26:45] well...I would move forward with .20. Jenkins is sure taking its time. [15:27:33] PROBLEM - HHVM rendering on mw1139 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:28:14] (03CR) 10Volans: [C: 032] MariaDB: use Puppet cert for s1 cross-dc replica [puppet] - 10https://gerrit.wikimedia.org/r/283192 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [15:28:16] 06Operations, 10Traffic, 07HTTPS: HTTPS redirects for transparency.wikimedia.org - https://phabricator.wikimedia.org/T132464#2203583 (10BBlack) @Chmarkine: not sure - those changes are still in puppet, and I've confirmed the backend server for it today (bromine) still has that config deployed as well. But i... [15:28:23] PROBLEM - Apache HTTP on mw1139 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:28:43] PROBLEM - DPKG on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:28:43] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:28:44] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:28:44] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:28:51] (03PS1) 10Giuseppe Lavagetto: scap: use conftool data to populate dsh groups [puppet] - 10https://gerrit.wikimedia.org/r/283201 (https://phabricator.wikimedia.org/T132529) [15:29:02] PROBLEM - dhclient process on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:29:03] PROBLEM - Disk space on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:29:13] PROBLEM - RAID on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:29:33] PROBLEM - SSH on mw1139 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:29:43] PROBLEM - nutcracker process on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:29:53] PROBLEM - configured eth on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:30:12] PROBLEM - salt-minion processes on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:30:40] (03CR) 10Mobrovac: [C: 031] Make MathML rendering default in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283199 (https://phabricator.wikimedia.org/T104550) (owner: 10Physikerwelt) [15:30:42] PROBLEM - puppet last run on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:30:42] PROBLEM - Check size of conntrack table on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:30:43] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [15:30:43] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [15:30:43] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [15:31:06] (03PS1) 10BBlack: parsoid-tests.wm.o HTTPS redirect T132462 [puppet] - 10https://gerrit.wikimedia.org/r/283202 [15:31:31] (03CR) 10BBlack: [C: 032 V: 032] parsoid-tests.wm.o HTTPS redirect T132462 [puppet] - 10https://gerrit.wikimedia.org/r/283202 (owner: 10BBlack) [15:31:50] hm, looks like mw1139 has had some historical issues. Is anyone messing with it just now? [15:32:03] PROBLEM - HHVM processes on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:32:09] 06Operations, 10Traffic, 07HTTPS: let all services on misc-web enforce http->https redirects - https://phabricator.wikimedia.org/T103919#2203620 (10BBlack) [15:32:12] 06Operations, 10Parsoid, 10Traffic, 07HTTPS, 13Patch-For-Review: HTTPS redirects for parsoid-tests.wikimedia.org - https://phabricator.wikimedia.org/T132462#2203618 (10BBlack) 05Open>03Resolved a:03BBlack [15:32:23] PROBLEM - nutcracker port on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:33:07] moritzm: do you know anything about 1139 other than what's in the SAL from a couple of weeks ago? [15:33:15] thcipriani: I think zuul is too slow today to get the CN patches ready within this window [15:33:53] AndyRussG: I just stopped a job that was hung at 16% for the past 30 mins :\ Queue should start to clear, hopefully. [15:33:57] andrewbogott: I don't eben remember what I put into SAL a few week ago, let me check [15:34:08] Ahh K thx! :) [15:34:10] (03PS1) 10BBlack: transparency.wm.o HTTPS redirect T132464 [puppet] - 10https://gerrit.wikimedia.org/r/283203 [15:34:12] (03PS1) 10BBlack: stats.wm.o HTTPS redirect T132465 [puppet] - 10https://gerrit.wikimedia.org/r/283204 [15:34:37] unfortunately it was a patch for SWAT :\ [15:34:37] (03CR) 10BBlack: [C: 032 V: 032] transparency.wm.o HTTPS redirect T132464 [puppet] - 10https://gerrit.wikimedia.org/r/283203 (owner: 10BBlack) [15:34:49] (03CR) 10BBlack: [C: 032 V: 032] stats.wm.o HTTPS redirect T132465 [puppet] - 10https://gerrit.wikimedia.org/r/283204 (owner: 10BBlack) [15:35:03] RECOVERY - Disk space on mw1139 is OK: DISK OK [15:35:03] andrewbogott: that a few weeks ago was just one of the occasional hhvm lockups, this here is different [15:35:09] ok [15:35:13] RECOVERY - RAID on mw1139 is OK: OK: no RAID installed [15:35:23] RECOVERY - SSH on mw1139 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [15:35:30] 06Operations, 10Traffic, 07HTTPS: let all services on misc-web enforce http->https redirects - https://phabricator.wikimedia.org/T103919#2203640 (10BBlack) [15:35:42] RECOVERY - nutcracker process on mw1139 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [15:35:42] RECOVERY - configured eth on mw1139 is OK: OK - interfaces up [15:35:43] 06Operations, 10Traffic, 07HTTPS: let all services on misc-web enforce http->https redirects - https://phabricator.wikimedia.org/T103919#1402411 (10BBlack) [15:35:43] RECOVERY - HHVM processes on mw1139 is OK: PROCS OK: 6 processes with command name hhvm [15:35:45] 06Operations, 10Analytics-Cluster, 10Traffic, 07HTTPS, 13Patch-For-Review: HTTPS redirects for stats.wikimedia.org - https://phabricator.wikimedia.org/T132465#2203641 (10BBlack) 05Open>03Resolved a:03BBlack [15:35:48] oh, it's back! And it didn't reboot... [15:35:54] RECOVERY - salt-minion processes on mw1139 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:36:03] RECOVERY - nutcracker port on mw1139 is OK: TCP OK - 0.000 second response time on port 11212 [15:36:12] PROBLEM - DPKG on labmon1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:36:23] RECOVERY - Check size of conntrack table on mw1139 is OK: OK: nf_conntrack is 0 % full [15:36:33] RECOVERY - DPKG on mw1139 is OK: All packages OK [15:36:45] RECOVERY - dhclient process on mw1139 is OK: PROCS OK: 0 processes with command name dhclient [15:37:31] moritzm: it was oom. What do you think, should I reboot it just to make a full recovery from all the services killed by oom-killer? [15:38:39] hm, it still can't fork [15:39:24] thcipriani: this is the patch for 20: https://gerrit.wikimedia.org/r/#/c/283205/ [15:39:58] bblack: re: redirect for transparency.wm seems to me like it already redirects, unless you just enabled it in varnish? [15:40:20] < HTTP/1.1 301 TLS Redirect [15:40:22] ACKNOWLEDGEMENT - Apache HTTP on mw1139 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50394 bytes in 0.039 second response time andrew bogott OOM -- I will reboot and investigate when I have a chance [15:40:22] ACKNOWLEDGEMENT - HHVM rendering on mw1139 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50394 bytes in 0.068 second response time andrew bogott OOM -- I will reboot and investigate when I have a chance [15:40:23] ACKNOWLEDGEMENT - puppet last run on mw1139 is CRITICAL: CRITICAL: puppet fail andrew bogott OOM -- I will reboot and investigate when I have a chance [15:40:28] < Location: https://transparency.wikimedia.org/ [15:40:36] * andrewbogott needs to step away but will look at 1139 as soon as able [15:41:59] AndyRussG: anything requiring a full scap in this update? [15:41:59] mutante: I just did, yes [15:42:15] mutante: something's broken about the apache redirect there, I didn't debug it, but it wasn't redirecting before [15:42:15] ah, and i just saw this: < Server: Varnish [15:42:19] andrewbogott: yeah, reboot won't hurt, whatever went wrong there [15:42:28] hmm, ok [15:42:34] thcipriani: not really. I'm pulling in changes from translatewiki, but it's OK if those only get updated when the train goes thru, no? [15:42:49] (03PS1) 10BBlack: datasets.wm.o HTTPS redirect T132463 [puppet] - 10https://gerrit.wikimedia.org/r/283207 [15:43:01] thcipriani: I'm also fine leaving this for another SWAT [15:43:08] (03CR) 10BBlack: [C: 032 V: 032] datasets.wm.o HTTPS redirect T132463 [puppet] - 10https://gerrit.wikimedia.org/r/283207 (owner: 10BBlack) [15:43:11] thcipriani: did we hit some issues on the wikivoyage patch? [15:43:53] RECOVERY - DPKG on labmon1001 is OK: All packages OK [15:43:55] jdlrobson: yeah. for some reason one of the tests hung at 16% for 30 mins or so. I had to kill it, rejected the patch. Flailed around a bit getting it resubmitted. [15:43:57] (03PS2) 10Bartosz Dziewoński: Enable $wgAbuseFilterProfile for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282806 (https://phabricator.wikimedia.org/T132200) [15:44:01] 06Operations, 10Traffic, 07HTTPS: let all services on misc-web enforce http->https redirects - https://phabricator.wikimedia.org/T103919#2203658 (10BBlack) [15:44:03] 06Operations, 10Datasets-General-or-Unknown, 10Traffic, 07HTTPS, 13Patch-For-Review: HTTPS redirects for datasets.wikimedia.org - https://phabricator.wikimedia.org/T132463#2203656 (10BBlack) 05Open>03Resolved a:03BBlack [15:44:36] owch. fingers crossed it works this time :) [15:45:09] (03PS1) 10Volans: Repool db1057 after TLS upgrade on s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283209 (https://phabricator.wikimedia.org/T111654) [15:46:15] 06Operations, 10ops-eqiad, 10Analytics-Cluster, 06Analytics-Kanban: Analytics hosts showed high temperature alarms - https://phabricator.wikimedia.org/T132256#2203677 (10Ottomata) [15:46:31] AndyRussG: let's push this to another SWAT if you don't mind. Jenkins queue is a bit backed up + a full scap seems like it'd run over by quite a bit. [15:46:41] thcipriani: yeah! [15:46:49] AndyRussG: thanks. [15:47:00] thcipriani: likewise! [15:47:51] thcipriani: quick question: since I +2'd the change for .21, does that mean it'll get synced with the train in a few minutes? [15:48:57] if so maybe we could quickly push it out to mw1017 [15:49:21] nothing magically pulls to tin [15:49:37] +2 on branches should really only be done by a deployer in the course of deploying [15:49:52] bd808: yes hmm well it was going to get deployed [15:49:56] 06Operations, 10Analytics, 10DNS, 10Traffic: Create analytics.wikimedia.org - https://phabricator.wikimedia.org/T132407#2203682 (10Nuria) [15:49:59] Sorry I should have held off tho [15:50:14] bd808: Could still cancel the gate-and-submit I think [15:50:20] sorry not screaming, just correcting gently :) [15:50:27] (03CR) 10Ori.livneh: [C: 031] "for a few days, sure" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282806 (https://phabricator.wikimedia.org/T132200) (owner: 10Bartosz Dziewoński) [15:50:43] AndyRussG: it's np. Cancelling now would requeue everything in zuul so don't do that. [15:50:51] 06Operations, 10Analytics, 10DNS, 10Traffic: Create analytics.wikimedia.org - https://phabricator.wikimedia.org/T132407#2197243 (10Nuria) [15:50:51] K [15:50:52] well, everything behind it. [15:51:22] fwiw it doesn't need a scap (just re-checked, no new i18n keys) [15:51:40] (03PS1) 10BBlack: HTTPS for graphite monitor URLs T132461 [puppet] - 10https://gerrit.wikimedia.org/r/283210 [15:52:43] thcipriani: hope this setback hasn't impacted the Wikivoyage patch going out today? [15:53:04] jdlrobson: going in 1 second. [15:53:11] phew :) [15:53:20] AndyRussG: I can get .21 out for you here in a minute. [15:53:41] thcipriani: K that would be great! thx much :) [15:54:44] AndyRussG: er, wait, did the .20 patch merge, too (just fetched it down) [15:54:49] yep [15:55:20] thcipriani: do you want to do mw1017? It was suggested that we smoke test there first since we're moving about some modules, but IMHO it's quite unlikely to be problematic [15:56:13] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:56:14] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:56:14] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:56:33] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:56:53] !log thcipriani@tin Synchronized php-1.27.0-wmf.20/extensions/WikidataPageBanner: SWAT: Attempt at fixing table of contents problem [[gerrit:282994]] (duration: 00m 28s) [15:56:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:57:00] ^ jdlrobson check please [15:57:59] (03CR) 10BBlack: [C: 032] "I've manually tested the effect on neon by looking at the check_graphite -related commands it runs, and re-running them manually with s/ht" [puppet] - 10https://gerrit.wikimedia.org/r/283210 (owner: 10BBlack) [15:58:12] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [15:58:12] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [15:58:12] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [15:58:23] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [15:58:33] thcipriani: on it! [16:02:09] thcipriani: looks fine thanks [16:02:17] jdlrobson: cool, thanks for checking. [16:02:31] AndyRussG: kk, I'll sync down to mw1017. [16:02:41] thcipriani: cool beans! thx :) [16:03:19] AndyRussG: kk, says it's done, give it a try. [16:03:23] (03PS1) 10BBlack: graphite.wm.o HTTPS redirect T132461 [puppet] - 10https://gerrit.wikimedia.org/r/283214 [16:05:39] 06Operations, 10Analytics-Cluster, 10hardware-requests, 10netops: setup/deploy server analytics1003/WMF4541 - https://phabricator.wikimedia.org/T130840#2203753 (10Ottomata) [16:06:35] thcipriani: loosk fine! [16:07:13] AndyRussG: kk, I'll run a sync-dir for CentralNotice on .21 then .20 if that plan sounds fine with you. [16:07:26] thcipriani: yeah that'd be amazing :) thanks so much! [16:07:27] 06Operations, 10Analytics, 10DNS, 10Traffic: Create analytics.wikimedia.org - https://phabricator.wikimedia.org/T132407#2197243 (10BBlack) If we're doing this in production, the frontend should probably be through cache_misc. I'm not sure what the backend looks like at all role/software-wise... [16:10:11] !log thcipriani@tin Synchronized php-1.27.0-wmf.21/extensions/CentralNotice: SWAT: Update CentralNotice [[gerrit:283206]] (duration: 00m 33s) [16:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:11:42] !log thcipriani@tin Synchronized php-1.27.0-wmf.20/extensions/CentralNotice: SWAT: Update CentralNotice [[gerrit:283205]] (duration: 00m 30s) [16:11:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:11:48] ^ AndyRussG check please [16:12:50] thcipriani: thx! [16:14:48] 06Operations, 10Traffic, 07Graphite, 07HTTPS, 13Patch-For-Review: HTTPS redirects for graphite.wikimedia.org - https://phabricator.wikimedia.org/T132461#2203793 (10BBlack) With the check_graphite stuff switched to HTTPS, so far neon doesn't seem to be suffering from any significant increase in overall CP... [16:15:10] (03CR) 10Filippo Giunchedi: [C: 04-1] "to be merged once the full switchover is in place (cfr. https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Media_storage.2FSwift )" [puppet] - 10https://gerrit.wikimedia.org/r/268080 (https://phabricator.wikimedia.org/T91869) (owner: 10Filippo Giunchedi) [16:17:51] thcipriani: looking good! :) [16:18:08] AndyRussG: cool. Thanks for checking. [16:18:13] !log rebooting mw1139 — OOM [16:18:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:18:33] thcipriani: likewise thx mcuh \o/ [16:18:35] andrewbogott: thanks a lot for librenms! [16:19:37] AndyRussG: yw. weird jenkins problem actually made things work out ok. [16:20:51] heh ... how so? [16:21:32] RECOVERY - Apache HTTP on mw1139 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 5.729 second response time [16:21:59] AndyRussG: just that your patches ended up landing before one of the other SWAT patches because of the jenkins issue. [16:22:14] Ahh heh right 8p [16:22:54] BTW I'm gonna update the deployments page just for the record... [16:23:30] ty! Meant to ask you to do that during the course of SWAT. [16:24:03] RECOVERY - HHVM rendering on mw1139 is OK: HTTP OK: HTTP/1.1 200 OK - 67187 bytes in 0.797 second response time [16:25:23] RECOVERY - puppet last run on mw1139 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:28:13] thcipriani: is the SWAT over? (I need to repool a DB) [16:28:50] volans: yes it is. sorry I missed your message earlier. [16:29:04] no problem, thanks [16:29:06] done! [16:30:38] (03CR) 10Volans: [C: 032] Repool db1057 after TLS upgrade on s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283209 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [16:31:05] (03Merged) 10jenkins-bot: Repool db1057 after TLS upgrade on s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283209 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [16:32:14] !log volans@tin Synchronized wmf-config/db-eqiad.php: Repool db1057 after TLS upgrade on s1 - T111654 (duration: 00m 26s) [16:32:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:32:26] 06Operations, 10ops-codfw, 06DC-Ops: db2018 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T128057#2203876 (10mark) p:05Normal>03Unbreak! a:05jcrespo>03RobH @RobH: please buy 4 appropriate disks today, fastest delivery. Hereby approved. [16:35:28] !log rebuilding raid1 array on aqs1001 after hot swapping sdh [16:35:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:37:15] 06Operations, 10ops-eqiad, 06DC-Ops: Broken disk on aqs1001.eqiad.wmnet - https://phabricator.wikimedia.org/T130816#2203889 (10Ottomata) @Cmjohnson has swapped the disk. Faidon helped get the device to show by doing ``` megacli -CfgForeign -Scan -a0 There are 1 foreign configuration(s) on controller 0. ..... [16:38:57] 06Operations, 10MediaWiki-Parser, 06Parsing-Team, 10Traffic, and 4 others: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#2203890 (10Jdlrobson) @Atsirlin can you make an update the pagebanner template to force a cache flush for the pages that use the... [16:39:14] Reedy: is read-only access adequate for your librenms needs? [16:39:53] andrewbogott: I'd presume so... I've no idea what benefit write access would actually provide [16:40:02] me neither [16:40:33] Presumably it's just configuration [16:43:03] 06Operations, 10MediaWiki-Parser, 06Parsing-Team, 10Traffic, and 4 others: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#2203893 (10Wrh2) >>! In T121135#2203890, @Jdlrobson wrote: > @Atsirlin can you make an update the pagebanner template to force a... [16:43:44] (03CR) 10BBlack: [C: 032] graphite.wm.o HTTPS redirect T132461 [puppet] - 10https://gerrit.wikimedia.org/r/283214 (owner: 10BBlack) [16:43:50] chasemp: is the ldap 'ops' group actually populated from the ops stanza in admin/data/data.yaml? Or are the two lists maintained separately? [16:44:13] the latter iirc [16:44:41] 06Operations, 10Traffic, 07HTTPS: let all services on misc-web enforce http->https redirects - https://phabricator.wikimedia.org/T103919#2203911 (10BBlack) [16:44:43] 06Operations, 10Traffic, 07Graphite, 07HTTPS, 13Patch-For-Review: HTTPS redirects for graphite.wikimedia.org - https://phabricator.wikimedia.org/T132461#2203909 (10BBlack) 05Open>03Resolved a:03BBlack [16:44:50] godog: is it somehow poor form for me to create a new ldap group without adding a corresponding section in puppet? (The puppet bits wouldn't do anything) [16:45:32] (03CR) 10Ori.livneh: [C: 032] Make MathML rendering default in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283199 (https://phabricator.wikimedia.org/T104550) (owner: 10Physikerwelt) [16:45:48] andrewbogott: I don't think so but I'm not sure tbh, what would be the group? iirc being added to ldap and admin in puppet happens during onboarding [16:46:13] godog: 'librenms-readers' [16:46:32] 06Operations, 10ops-eqiad, 06DC-Ops: Broken disk on aqs1001.eqiad.wmnet - https://phabricator.wikimedia.org/T130816#2203912 (10faidon) Just for the record, after clearing the foreign config, `megacli -PDMakeJBOD -PhysDrv\[32:7\] -a0` was also needed. [16:49:13] (03PS2) 10Ori.livneh: Make MathML rendering default in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283199 (https://phabricator.wikimedia.org/T104550) (owner: 10Physikerwelt) [16:49:56] andrewbogott: seems fine to me, and an update to https://wikitech.wikimedia.org/wiki/LDAP_Groups [16:51:00] godog: done, thanks [16:54:20] !log completed TLS upgrade for s1 - T111654 [16:54:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:55:02] (03PS1) 10Andrew Bogott: Give librenms access to members of ldap group librenms-readers [puppet] - 10https://gerrit.wikimedia.org/r/283221 (https://phabricator.wikimedia.org/T131252) [16:55:21] (03CR) 10Ori.livneh: [C: 032] "hellooooooo jenkins" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283199 (https://phabricator.wikimedia.org/T104550) (owner: 10Physikerwelt) [16:55:46] (03Merged) 10jenkins-bot: Make MathML rendering default in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283199 (https://phabricator.wikimedia.org/T104550) (owner: 10Physikerwelt) [16:58:20] (03PS7) 10Gehel: Add caching headers for nginx [puppet] - 10https://gerrit.wikimedia.org/r/274864 (https://phabricator.wikimedia.org/T126730) (owner: 10Smalyshev) [16:58:29] !log ori@tin Synchronized wmf-config/CommonSettings-labs.php: I5a0abcdc: Make MathML rendering default in labs (duration: 00m 39s) [16:58:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:59:13] 06Operations, 10ops-codfw, 06DC-Ops: db2018 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T128057#2203957 (10RobH) [17:02:00] (03PS3) 10Ori.livneh: Add apache::mod::security [puppet] - 10https://gerrit.wikimedia.org/r/278318 (https://phabricator.wikimedia.org/T132599) [17:03:00] (03PS1) 10Dzahn: install_server: let bast1001 use jessie [puppet] - 10https://gerrit.wikimedia.org/r/283224 (https://phabricator.wikimedia.org/T123721) [17:04:08] (03PS2) 10Dzahn: install_server: let bast1001 use jessie [puppet] - 10https://gerrit.wikimedia.org/r/283224 (https://phabricator.wikimedia.org/T123721) [17:04:19] (03CR) 10Dzahn: [C: 032 V: 032] install_server: let bast1001 use jessie [puppet] - 10https://gerrit.wikimedia.org/r/283224 (https://phabricator.wikimedia.org/T123721) (owner: 10Dzahn) [17:14:07] (03PS1) 10Ori.livneh: App servers: make Server response header equal to fqdn (mw1017 only) [puppet] - 10https://gerrit.wikimedia.org/r/283226 (https://phabricator.wikimedia.org/T132599) [17:14:19] !log lvs4003 going offline for maint (icinga has been silenced, i think ;) [17:14:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:14:33] (03PS4) 10Ori.livneh: Add apache::mod::security [puppet] - 10https://gerrit.wikimedia.org/r/278318 (https://phabricator.wikimedia.org/T132599) [17:14:43] (03CR) 10Ori.livneh: [C: 032 V: 032] Add apache::mod::security [puppet] - 10https://gerrit.wikimedia.org/r/278318 (https://phabricator.wikimedia.org/T132599) (owner: 10Ori.livneh) [17:15:08] !log rebooting bast1001 into PXE [17:15:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:15:18] (03CR) 10jenkins-bot: [V: 04-1] App servers: make Server response header equal to fqdn (mw1017 only) [puppet] - 10https://gerrit.wikimedia.org/r/283226 (https://phabricator.wikimedia.org/T132599) (owner: 10Ori.livneh) [17:16:12] (03PS2) 10Ori.livneh: App servers: make Server response header equal to fqdn (mw1017 only) [puppet] - 10https://gerrit.wikimedia.org/r/283226 (https://phabricator.wikimedia.org/T132599) [17:16:32] (03CR) 10Ori.livneh: [C: 032 V: 032] App servers: make Server response header equal to fqdn (mw1017 only) [puppet] - 10https://gerrit.wikimedia.org/r/283226 (https://phabricator.wikimedia.org/T132599) (owner: 10Ori.livneh) [17:16:58] (03PS1) 10Alex Monk: deployment-prep: keyholder shinken monitoring [puppet] - 10https://gerrit.wikimedia.org/r/283227 (https://phabricator.wikimedia.org/T111064) [17:20:05] andrewbogott they are separately managed [17:20:26] (In meetings fyi) [17:22:28] (03PS1) 10Ori.livneh: debug_proxy: don't clobber server header from backend [puppet] - 10https://gerrit.wikimedia.org/r/283229 [17:22:40] (03CR) 10Ori.livneh: [C: 032 V: 032] debug_proxy: don't clobber server header from backend [puppet] - 10https://gerrit.wikimedia.org/r/283229 (owner: 10Ori.livneh) [17:24:32] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 72.41% of data above the critical threshold [5000000.0] [17:25:23] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 68.97% of data above the critical threshold [5000000.0] [17:26:16] (03PS1) 10Ori.livneh: App servers: make Server response header equal to fqdn [puppet] - 10https://gerrit.wikimedia.org/r/283230 (https://phabricator.wikimedia.org/T132599) [17:26:55] (03CR) 10Ori.livneh: [C: 032 V: 032] App servers: make Server response header equal to fqdn [puppet] - 10https://gerrit.wikimedia.org/r/283230 (https://phabricator.wikimedia.org/T132599) (owner: 10Ori.livneh) [17:26:58] !log bast1001 - revoke puppet cert, delete salt key, reinstall with jessie [17:27:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:30:51] PROBLEM - Host lvs4003 is DOWN: PING CRITICAL - Packet loss = 100% [17:31:13] (03PS1) 10Dzahn: bast1001: rsync home dirs back from tungsten [puppet] - 10https://gerrit.wikimedia.org/r/283231 (https://phabricator.wikimedia.org/T123721) [17:32:28] (03PS2) 10Dzahn: bast1001: rsync home dirs back from tungsten [puppet] - 10https://gerrit.wikimedia.org/r/283231 (https://phabricator.wikimedia.org/T123721) [17:32:48] (03CR) 10Dzahn: [C: 032] bast1001: rsync home dirs back from tungsten [puppet] - 10https://gerrit.wikimedia.org/r/283231 (https://phabricator.wikimedia.org/T123721) (owner: 10Dzahn) [17:33:20] !log mwscript deleteEqualMessages.php --wiki cywiki (T45917) [17:33:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:34:07] 06Operations, 10Gitblit, 06Release-Engineering-Team, 10Traffic, 07HTTPS: HTTPS redirects for git.wikimedia.org - https://phabricator.wikimedia.org/T132460#2204120 (10BBlack) In lieu of anything more-concrete to go on, I've been monitoring the varnishlog live request flow for git.wikimedia.org this mornin... [17:41:31] I was connected to stat1002 and then got kicked out. Trying to SSH in again but getting the warning that remote host identification for bast1001.wikimedia.org has changed. [17:41:38] !log tungsten stop and remove rsync package and config [17:41:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:41:55] (03PS1) 10BBlack: git.wm.o HTTPS redirect T132460 [puppet] - 10https://gerrit.wikimedia.org/r/283233 [17:42:04] bearloga: it's being reinstalled right now, in scheduled maintenance [17:42:13] bearloga: will be back soon and until then you can use bast2001 or 3001 [17:43:21] mutante: phew! okay, thanks! Was worried something funky could be going on. Will I need to remove its entry from known_hosts and re-auth after it's back up? [17:43:59] (03CR) 10BBlack: [C: 032] git.wm.o HTTPS redirect T132460 [puppet] - 10https://gerrit.wikimedia.org/r/283233 (owner: 10BBlack) [17:44:02] RECOVERY - Host lvs4003 is UP: PING OK - Packet loss = 0%, RTA = 83.42 ms [17:44:06] bearloga: yes, i will give you a link with the host keys and send it on list.. maybe grab a coffee and you can use it again [17:44:15] bearloga: just copying back some data .. [17:45:03] bearloga: or feel free to change your ssh config and just replace bast1001 with bast2001 and you'd be fine [17:45:15] mutante: thanks! [17:45:19] 06Operations, 10media-storage: Unable to delete, restore/undelete, move or upload new versions of files on several wikis ("inconsistent state within the internal storage backends") - https://phabricator.wikimedia.org/T128096#2204146 (10aaron) I'm seeing very few sync errors in the logs lately. [17:46:01] !log lvs4003 rebooted and back online, lvs4004 offlining for maint. [17:46:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:48:03] 06Operations, 10RESTBase-Cassandra: restbase1007 not assembling raid after reboot - https://phabricator.wikimedia.org/T130930#2204150 (10fgiunchedi) [17:48:05] 06Operations: Boot time race condition when assembling root raid device on cp1052 - https://phabricator.wikimedia.org/T131961#2204152 (10fgiunchedi) [17:48:14] (03PS1) 10BBlack: git.wm.o HTTPS redirect - part 2 - T132460 [puppet] - 10https://gerrit.wikimedia.org/r/283234 [17:49:51] (03CR) 10BBlack: [C: 032] git.wm.o HTTPS redirect - part 2 - T132460 [puppet] - 10https://gerrit.wikimedia.org/r/283234 (owner: 10BBlack) [17:50:21] 06Operations: Boot time race condition when assembling root raid device on cp1052 - https://phabricator.wikimedia.org/T131961#2204165 (10fgiunchedi) came across the same problem in {T130930}, could be a jessie-specific issue as I don't remember seeing the same on trusty/precise [17:52:58] 06Operations, 10Gitblit, 06Release-Engineering-Team, 10Traffic, and 2 others: HTTPS redirects for git.wikimedia.org - https://phabricator.wikimedia.org/T132460#2204187 (10BBlack) 05Open>03Resolved a:03BBlack Resolving for now, although I suspect this is the most likely of the bunch to trigger some ki... [17:53:01] 06Operations, 10Traffic, 07HTTPS: let all services on misc-web enforce http->https redirects - https://phabricator.wikimedia.org/T103919#2204190 (10BBlack) [17:54:49] (03PS3) 10Muehlenhoff: Don't use package-> latest for apt-transport-https [puppet] - 10https://gerrit.wikimedia.org/r/282941 [17:55:01] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 56.67% of data above the critical threshold [5000000.0] [17:56:09] (03PS1) 10Aaron Schulz: Set descriptionCacheExpiry for Commons repo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283239 [17:56:11] (03CR) 10Bartosz Dziewoński: Enable $wgAbuseFilterProfile for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282806 (https://phabricator.wikimedia.org/T132200) (owner: 10Bartosz Dziewoński) [17:56:17] (03PS3) 10Bartosz Dziewoński: Enable $wgAbuseFilterProfile for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282806 (https://phabricator.wikimedia.org/T132200) [17:57:23] 06Operations, 13Patch-For-Review: Reinstall bast1001 with jessie - https://phabricator.wikimedia.org/T123721#2204201 (10Dzahn) bast1001 is back with jessie. data from home dirs is being copied back as i type the new fingerprints are: ``` +---------+---------+------------------------------------------------... [17:57:41] (03PS1) 10Bartosz Dziewoński: Disable $wgAbuseFilterProfile for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283243 (https://phabricator.wikimedia.org/T132200) [17:57:59] bearloga: should work again https://phabricator.wikimedia.org/T123721#2204201 [17:58:17] mutante: awesome, thanks! [18:00:02] 06Operations: librsvg path patch needs to be applied for jessie - https://phabricator.wikimedia.org/T132584#2204210 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [18:00:46] !log disable puppet, stop pybal on lvs400[12] (maint shutdown imminent, depooled from DNS since yesterday) [18:00:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:03:40] !log the cp sysetms in ulsfo will be rebooting into maint mode regularly for the next few hours. I'll be scheduling for each host as I get to them, but not echoing every cp host reboot in SAL [18:03:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:03:52] !log the cp sysetms in ulsfo will be rebooting into maint mode regularly for the next few hours. I'll be scheduling downtime for each host as I get to them, but not echoing every cp host reboot in SAL [18:03:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:06:06] !log bast1001 back with jessie [18:06:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:06:51] PROBLEM - pybal on lvs4001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [18:06:52] PROBLEM - PyBal backends health check on lvs4001 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 [18:07:01] PROBLEM - pybal on lvs4002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [18:08:30] PROBLEM - PyBal backends health check on lvs4002 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 [18:13:09] 06Operations, 10MediaWiki-Parser, 06Parsing-Team, 10Traffic, and 3 others: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#2204281 (10Jdlrobson) 05Open>03stalled p:05High>03Normal [18:13:29] 06Operations, 10MediaWiki-Parser, 06Parsing-Team, 10Traffic, and 3 others: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#1870420 (10Jdlrobson) Seems fixed, but we'll check in our next sprint after 2 weeks have passed. [18:13:36] (03PS1) 10Dzahn: bast1001: remove temp rsync for migration [puppet] - 10https://gerrit.wikimedia.org/r/283248 (https://phabricator.wikimedia.org/T123721) [18:13:57] (03PS2) 10Dzahn: bast1001: remove temp rsync for migration [puppet] - 10https://gerrit.wikimedia.org/r/283248 (https://phabricator.wikimedia.org/T123721) [18:14:17] (03CR) 10Dzahn: [C: 032] bast1001: remove temp rsync for migration [puppet] - 10https://gerrit.wikimedia.org/r/283248 (https://phabricator.wikimedia.org/T123721) (owner: 10Dzahn) [18:14:56] 06Operations, 10Gitblit, 06Release-Engineering-Team, 10Traffic, and 2 others: HTTPS redirects for git.wikimedia.org - https://phabricator.wikimedia.org/T132460#2204324 (10BBlack) For the record: the oddball `Mozilla/8.0 (Windows 2008 SP32 + 3patch)` seems to be making it through the HTTPS redirect just fin... [18:15:41] (03PS1) 10BBlack: HTTPS redirect for all: 1/3 remove VCL conditional [puppet] - 10https://gerrit.wikimedia.org/r/283249 (https://phabricator.wikimedia.org/T103919) [18:16:01] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 50.00% above the threshold [1000000.0] [18:16:20] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 50.00% above the threshold [1000000.0] [18:16:27] !log shutdown lvs400[12] [18:16:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:16:57] (03CR) 10Ottomata: [C: 031] stats/datasets: remove Apache virtual host stat1001.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/283086 (owner: 10Dzahn) [18:18:00] PROBLEM - Host lvs4001 is DOWN: PING CRITICAL - Packet loss = 100% [18:18:00] PROBLEM - Host lvs4002 is DOWN: PING CRITICAL - Packet loss = 100% [18:18:43] (03PS1) 10BBlack: HTTPS redirect for all: 2/3 remove vcl_config settings [puppet] - 10https://gerrit.wikimedia.org/r/283250 (https://phabricator.wikimedia.org/T103919) [18:18:45] (03PS1) 10BBlack: HTTPS redirect for all: 3/3 remove misc custom block [puppet] - 10https://gerrit.wikimedia.org/r/283251 (https://phabricator.wikimedia.org/T103919) [18:19:50] PROBLEM - IPsec on cp2012 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp4003_v4, cp4003_v6, cp4004_v4, cp4004_v6 [18:19:50] PROBLEM - IPsec on cp2006 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp4003_v4, cp4003_v6, cp4004_v4, cp4004_v6 [18:19:51] PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 120 not-conn: cp4003_v4, cp4003_v6, cp4004_v4, cp4004_v6 [18:20:00] PROBLEM - IPsec on cp2018 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp4003_v4, cp4003_v6, cp4004_v4, cp4004_v6 [18:20:02] PROBLEM - IPsec on cp2025 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp4003_v4, cp4003_v6, cp4004_v4, cp4004_v6 [18:20:10] PROBLEM - Host cp4003 is DOWN: PING CRITICAL - Packet loss = 100% [18:20:11] PROBLEM - IPsec on cp1061 is CRITICAL: Strongswan CRITICAL - ok: 20 not-conn: cp4003_v4, cp4003_v6, cp4004_v4, cp4004_v6 [18:20:11] PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 120 not-conn: cp4003_v4, cp4003_v6, cp4004_v4, cp4004_v6 [18:20:12] PROBLEM - IPsec on cp1051 is CRITICAL: Strongswan CRITICAL - ok: 20 not-conn: cp4003_v4, cp4003_v6, cp4004_v4, cp4004_v6 [18:20:22] PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 120 not-conn: cp4003_v4, cp4003_v6, cp4004_v4, cp4004_v6 [18:20:30] PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 120 not-conn: cp4003_v4, cp4003_v6, cp4004_v4, cp4004_v6 [18:20:31] PROBLEM - IPsec on cp1058 is CRITICAL: Strongswan CRITICAL - ok: 20 not-conn: cp4003_v4, cp4003_v6, cp4004_v4, cp4004_v6 [18:20:31] PROBLEM - IPsec on cp1045 is CRITICAL: Strongswan CRITICAL - ok: 20 not-conn: cp4003_v4, cp4003_v6, cp4004_v4, cp4004_v6 [18:20:36] (03CR) 10Aaron Schulz: [C: 04-1] "I'd like https://gerrit.wikimedia.org/r/#/c/283247/ to be dealt with first, unless there is something urgent here." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282806 (https://phabricator.wikimedia.org/T132200) (owner: 10Bartosz Dziewoński) [18:20:40] PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 120 not-conn: cp4003_v4, cp4003_v6, cp4004_v4, cp4004_v6 [18:20:52] PROBLEM - IPsec on kafka1018 is CRITICAL: Strongswan CRITICAL - ok: 120 not-conn: cp4003_v4, cp4003_v6, cp4004_v4, cp4004_v6 [18:21:36] 06Operations, 10Traffic, 07HTTPS: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#2204346 (10BBlack) [18:22:47] (03PS2) 10BBlack: HTTPS redirect for all: 1/3 remove VCL conditional [puppet] - 10https://gerrit.wikimedia.org/r/283249 (https://phabricator.wikimedia.org/T103919) [18:22:56] (03CR) 10BBlack: [C: 032 V: 032] HTTPS redirect for all: 1/3 remove VCL conditional [puppet] - 10https://gerrit.wikimedia.org/r/283249 (https://phabricator.wikimedia.org/T103919) (owner: 10BBlack) [18:24:01] RECOVERY - Host cp4003 is UP: PING OK - Packet loss = 0%, RTA = 75.25 ms [18:24:51] PROBLEM - Host cp4004 is DOWN: PING CRITICAL - Packet loss = 100% [18:25:11] (03PS1) 10Dzahn: bast1001: reorder includes, rm ganglia_aggregator [puppet] - 10https://gerrit.wikimedia.org/r/283252 [18:25:55] 06Operations, 13Patch-For-Review, 07Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2204382 (10Dzahn) [18:26:05] (03PS2) 10BBlack: HTTPS redirect for all: 2/3 remove vcl_config settings [puppet] - 10https://gerrit.wikimedia.org/r/283250 (https://phabricator.wikimedia.org/T103919) [18:27:25] 06Operations, 13Patch-For-Review, 07Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#1931698 (10Dzahn) [18:28:20] (03CR) 10ArielGlenn: "This is the right resting place for it now. If we have snpahsots in the future that only run misc crons then we may need to revisit it.Th" [puppet] - 10https://gerrit.wikimedia.org/r/282866 (owner: 10ArielGlenn) [18:28:30] mutante: do bast4001 too? [18:28:41] and don't forget to cleanup SLAACs from network.pp :) [18:28:42] (03PS2) 10ArielGlenn: add debdeploy and admin group configs for new snapshots [puppet] - 10https://gerrit.wikimedia.org/r/282866 [18:29:50] RECOVERY - IPsec on cp2025 is OK: Strongswan OK - 36 ESP OK [18:29:51] RECOVERY - Host cp4004 is UP: PING OK - Packet loss = 0%, RTA = 74.64 ms [18:29:51] RECOVERY - IPsec on cp1061 is OK: Strongswan OK - 24 ESP OK [18:29:52] RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 124 ESP OK [18:29:52] RECOVERY - IPsec on cp1051 is OK: Strongswan OK - 24 ESP OK [18:30:09] paravoid: 4001 - i need something else in ulsfo to put the install server on i'm afraid and i'm not sure what to use [18:30:10] RECOVERY - IPsec on kafka1012 is OK: Strongswan OK - 124 ESP OK [18:30:10] RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 124 ESP OK [18:30:11] RECOVERY - IPsec on cp1058 is OK: Strongswan OK - 24 ESP OK [18:30:19] just use carbon? [18:30:20] RECOVERY - IPsec on cp1045 is OK: Strongswan OK - 24 ESP OK [18:30:21] RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 124 ESP OK [18:30:30] also bast1001 seems to not have been installed properly [18:30:38] there is no raid configured [18:30:39] i tried that with bast2001, installing from carbon wouldnt work [18:30:40] RECOVERY - IPsec on kafka1018 is OK: Strongswan OK - 124 ESP OK [18:30:40] sdb is unused [18:30:48] why wouldn't it not work? [18:31:11] (03CR) 10ArielGlenn: [C: 032] add debdeploy and admin group configs for new snapshots [puppet] - 10https://gerrit.wikimedia.org/r/282866 (owner: 10ArielGlenn) [18:31:31] RECOVERY - IPsec on cp2012 is OK: Strongswan OK - 36 ESP OK [18:31:31] RECOVERY - IPsec on cp2006 is OK: Strongswan OK - 36 ESP OK [18:31:32] RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 124 ESP OK [18:31:41] RECOVERY - IPsec on cp2018 is OK: Strongswan OK - 36 ESP OK [18:33:42] paravoid: maybe just because of he hardware problems bast2001 had, i'll try it from carbon. re: SLAAC yep, doing that now .. re: RAID awww man.. so it was a manual setup , i changed nothing [18:33:56] (03PS3) 10BBlack: HTTPS redirect for all: 2/3 remove vcl_config settings [puppet] - 10https://gerrit.wikimedia.org/r/283250 (https://phabricator.wikimedia.org/T103919) [18:33:58] might have been misconfigured in the first place [18:34:03] (03CR) 10BBlack: [C: 032 V: 032] HTTPS redirect for all: 2/3 remove vcl_config settings [puppet] - 10https://gerrit.wikimedia.org/r/283250 (https://phabricator.wikimedia.org/T103919) (owner: 10BBlack) [18:34:31] (03PS2) 10BBlack: HTTPS redirect for all: 3/3 remove misc custom block [puppet] - 10https://gerrit.wikimedia.org/r/283251 (https://phabricator.wikimedia.org/T103919) [18:34:40] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [18:34:48] (03CR) 10BBlack: [C: 032 V: 032] HTTPS redirect for all: 3/3 remove misc custom block [puppet] - 10https://gerrit.wikimedia.org/r/283251 (https://phabricator.wikimedia.org/T103919) (owner: 10BBlack) [18:36:10] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] [18:36:28] ^ is this from the mw* apache thing? [18:36:56] mutante: in any case... needs a reinstall/fixing :) [18:37:21] does that mean writing a new partman recipe ? sigh [18:37:29] !log activated maintenance page for wdqs1002 (data load in progress) [18:37:29] a new one? [18:37:30] why? [18:37:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:37:40] we have one for raid1 already [18:38:00] ok [18:39:00] (03PS1) 10Dzahn: network: remove bast1001 SLAAC addresses [puppet] - 10https://gerrit.wikimedia.org/r/283254 [18:39:24] 5xx is still elevated for eqiad-text! [18:39:24] (03PS2) 10Dzahn: network: remove bast1001 SLAAC addresses [puppet] - 10https://gerrit.wikimedia.org/r/283254 [18:39:56] beh, i was using stat1002.eqiad.wmnet earlier today, and now i try to login and it tells me REMOTE HOST IDENTIFICATION HAS CHANGED! [18:40:09] read ops@ [18:40:17] bast1001 was reinstalled [18:41:26] interestingly, the 5xx spike really is only eqiad-text frontends, not e.g. eqiad-esams and such [18:41:38] could be traffic-induced [18:41:43] also mailed wikitech-l, the new fingerprints are here, yurik https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/bast1001.wikimedia.org [18:42:21] RECOVERY - Host lvs4001 is UP: PING OK - Packet loss = 0%, RTA = 76.49 ms [18:42:34] (03PS2) 10Andrew Bogott: Give librenms access to members of ldap group librenms-readers [puppet] - 10https://gerrit.wikimedia.org/r/283221 (https://phabricator.wikimedia.org/T131252) [18:42:35] bblack: isnt that what ori said earlier? [18:42:55] except taking longer than he expected [18:43:04] not plausibly related [18:44:33] (03CR) 10Andrew Bogott: [C: 032] Give librenms access to members of ldap group librenms-readers [puppet] - 10https://gerrit.wikimedia.org/r/283221 (https://phabricator.wikimedia.org/T131252) (owner: 10Andrew Bogott) [18:44:41] RECOVERY - Host lvs4002 is UP: PING OK - Packet loss = 0%, RTA = 75.42 ms [18:44:49] yeah ori's thing would've affected all the sites [18:45:45] !log lvs4001 - enable->run puppet post-reboot [18:45:48] (03PS8) 10Gehel: Add caching headers for nginx [puppet] - 10https://gerrit.wikimedia.org/r/274864 (https://phabricator.wikimedia.org/T126730) (owner: 10Smalyshev) [18:45:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:46:22] !log lvs4002 - enable->run puppet post-reboot [18:46:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:47:41] RECOVERY - pybal on lvs4001 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [18:47:42] RECOVERY - PyBal backends health check on lvs4001 is OK: PYBAL OK - All pools are healthy [18:47:59] (03CR) 10Gehel: [C: 032] Add caching headers for nginx [puppet] - 10https://gerrit.wikimedia.org/r/274864 (https://phabricator.wikimedia.org/T126730) (owner: 10Smalyshev) [18:48:11] Reedy: try now? https://librenms.wikimedia.org/ [18:48:21] !log activating cache headers for WDQS [18:48:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:49:26] 06Operations, 10Traffic, 07HTTPS: Enable HSTS on Wikimedia sites - https://phabricator.wikimedia.org/T40516#2204462 (10BBlack) [18:49:28] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review: let all services on misc-web enforce http->https redirects - https://phabricator.wikimedia.org/T103919#2204458 (10BBlack) 05Open>03Resolved a:03BBlack [18:50:17] (03PS2) 10Dzahn: bast1001: reorder includes, rm ganglia_aggregator [puppet] - 10https://gerrit.wikimedia.org/r/283252 (https://phabricator.wikimedia.org/T123721) [18:50:23] (03PS1) 10Dzahn: partman: make bast1001 use raid1-lvm [puppet] - 10https://gerrit.wikimedia.org/r/283255 (https://phabricator.wikimedia.org/T123721) [18:51:00] (03PS3) 10Dzahn: network: remove bast1001 SLAAC addresses [puppet] - 10https://gerrit.wikimedia.org/r/283254 (https://phabricator.wikimedia.org/T123721) [18:51:14] (03PS4) 10Dzahn: network: remove bast1001 SLAAC addresses [puppet] - 10https://gerrit.wikimedia.org/r/283254 (https://phabricator.wikimedia.org/T123721) [18:51:59] this is causing data inconsistencies on Wikidata right now [18:52:03] no idea? [18:53:33] hoo what is? [18:53:39] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Grant reedy access to librenms - https://phabricator.wikimedia.org/T131252#2204471 (10Andrew) 05Open>03Resolved @Reedy, you should be all set -- let me know if this doesn't work. [18:53:44] mutante: woops i think you just broke my time consuming sql query :-). [18:54:05] bblack: See for example https://commons.wikimedia.org/w/api.php?action=query&prop=info&redirects=1&converttitles=1&format=json&titles=File:Iogansen YuI.jpg [18:54:19] but thanks for the updates appreciated :) [18:54:23] We internally do such API requests (within the cluster) and get a security redirect foo bar [18:54:32] breaking page name normalization/ validation [18:54:40] hoo: was there some context before "this is causing..."? [18:55:19] bblack: No, just the html of the redirect page which MediaWiki serves me [18:55:25] jdlrobson: ugh, sorry, and i have to do it again. please switch to bast2001 or 3003, actually 3001 will be much closer for you if in UK [18:55:34] it's the new one in esams [18:56:03] so [18:56:03] hoo: I'm sorry, I still don't understand. I tried your URL and it does 200 OK with json output [18:56:06] what's with the &* [18:56:20] e.g. https://commons.wikimedia.org/w/api.php?origin=https%3A%2F%2Fen.wikipedia.org is redirecting to https://commons.wikimedia.org/w/api.php?origin=https%3A%2F%2Fen.wikipedia.org&* [18:56:21] bblack: Now, it 301s or 302s [18:56:24] Ü* no [18:56:26] this is breaking all CORS requests [18:56:33] (03PS3) 10Madhuvishy: analytics_cluster: Add wrapper script for beeline [puppet] - 10https://gerrit.wikimedia.org/r/282713 (https://phabricator.wikimedia.org/T116123) [18:57:05] hoo: it's 200 for me, your 'https://commons.wikimedia.org/w/api.php?action=query&prop=info&redirects=1&converttitles=1&format=json&titles=File:Iogansen YuI.jpg' [18:57:10] PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 120 not-conn: cp4008_v4, cp4008_v6, cp4010_v4, cp4010_v6 [18:57:10] PROBLEM - IPsec on cp2001 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp4008_v4, cp4008_v6, cp4010_v4, cp4010_v6 [18:57:10] PROBLEM - IPsec on cp2019 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp4008_v4, cp4008_v6, cp4010_v4, cp4010_v6 [18:57:21] PROBLEM - IPsec on kafka1018 is CRITICAL: Strongswan CRITICAL - ok: 120 not-conn: cp4008_v4, cp4008_v6, cp4010_v4, cp4010_v6 [18:57:40] PROBLEM - IPsec on cp1054 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp4010_v4, cp4010_v6 [18:57:41] PROBLEM - IPsec on cp2007 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4010_v4, cp4010_v6 [18:57:52] PROBLEM - IPsec on cp1053 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp4010_v4, cp4010_v6 [18:58:00] PROBLEM - IPsec on cp1068 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp4010_v4, cp4010_v6 [18:58:01] PROBLEM - IPsec on cp2023 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4010_v4, cp4010_v6 [18:58:06] I do see the Security Redirect on 'https://commons.wikimedia.org/w/api.php?origin=https%3A%2F%2Fen.wikipedia.org' [18:58:08] bblack: That's interesting [18:58:11] PROBLEM - IPsec on cp1052 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp4010_v4, cp4010_v6 [18:58:12] PROBLEM - IPsec on cp2010 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4010_v4, cp4010_v6 [18:58:19] it also works, if I curl the URL [18:58:20] PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 122 not-conn: cp4010_v4, cp4010_v6 [18:58:26] yeah I tesed with curl [18:58:26] "security redirect"? [18:58:29] but not served via browser or via MediaWiki's http class [18:58:31] PROBLEM - IPsec on cp1067 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp4010_v4, cp4010_v6 [18:58:32] PROBLEM - IPsec on cp2016 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4010_v4, cp4010_v6 [18:58:40] PROBLEM - IPsec on cp1055 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp4010_v4, cp4010_v6 [18:58:40] PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 122 not-conn: cp4010_v4, cp4010_v6 [18:58:41] (03PS4) 10Madhuvishy: analytics_cluster: Add wrapper script for beeline [puppet] - 10https://gerrit.wikimedia.org/r/282713 (https://phabricator.wikimedia.org/T116123) [18:58:41] PROBLEM - IPsec on cp1065 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp4010_v4, cp4010_v6 [18:58:42] (03PS2) 10Dzahn: partman: make bast1001 use raid1-lvm [puppet] - 10https://gerrit.wikimedia.org/r/283255 (https://phabricator.wikimedia.org/T123721) [18:58:50] PROBLEM - IPsec on cp2013 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4010_v4, cp4010_v6 [18:58:51] PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 122 not-conn: cp4010_v4, cp4010_v6 [18:58:51] PROBLEM - IPsec on cp2004 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4010_v4, cp4010_v6 [18:58:51] PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 122 not-conn: cp4010_v4, cp4010_v6 [18:58:53] hoo: when did it start? [18:59:00] PROBLEM - IPsec on cp1066 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp4010_v4, cp4010_v6 [18:59:01] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:59:10] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 50.00% above the threshold [1000000.0] [18:59:17] bblack: Dunno for sure [18:59:22] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:59:23] (03PS3) 10Dzahn: partman: make bast1001 use raid1-lvm [puppet] - 10https://gerrit.wikimedia.org/r/283255 (https://phabricator.wikimedia.org/T123721) [18:59:28] (03CR) 10Dzahn: [C: 032] partman: make bast1001 use raid1-lvm [puppet] - 10https://gerrit.wikimedia.org/r/283255 (https://phabricator.wikimedia.org/T123721) (owner: 10Dzahn) [18:59:43] hoo: since the word Security is in there, and because it's affecting CORS, I'd check in with csteipp [19:00:04] ostriches: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160413T1900). [19:00:25] it's *breaking* CORS [19:00:33] is this a security patch or something? [19:00:40] MatmaRex: Can't see anything [19:00:42] I have no idea [19:00:46] already checked prior to asking [19:00:48] i see nothing related in SAL :( [19:00:53] same here [19:01:06] ostriches: when deploying, please dont use bast1001 right now in your ssh config [19:01:06] are we sure it's new behavior? does some log point to when it changed in time? [19:01:14] i'm going to tell you when it started in a minute [19:01:20] mutante: What can I use instead? [19:01:30] bblack: Yes, we are sure... and no [19:01:35] ostriches: bast2001 [19:01:41] Sitelink handling on Wikidata is completely broken now [19:01:45] and people indeed notice taht [19:01:49] bblack: it started when these uploads stopped: https://commons.wikimedia.org/w/index.php?title=Special:RecentChanges&tagfilter=cross-wiki-upload [19:01:51] ostriches: or 3001 or 4001 even [19:02:03] first user report at 18:13 UTC [19:02:09] around 17:45 UTC [19:02:36] wait, I can even check the logs [19:02:44] Shouldn't be a security patch. For normal mediawiki requests we do the IE-stupidity redirect if we think IE will think this is a filename.. [19:02:57] Actually... might be a security patch. One sec. [19:03:19] No, nevermind. I don't think it's a security patch. [19:03:32] RECOVERY - IPsec on cp1054 is OK: Strongswan OK - 44 ESP OK [19:03:39] IEUrlExtension hasn't been touched in a while [19:03:41] RECOVERY - IPsec on cp2007 is OK: Strongswan OK - 56 ESP OK [19:03:51] RECOVERY - IPsec on cp1053 is OK: Strongswan OK - 44 ESP OK [19:03:51] RECOVERY - IPsec on cp1068 is OK: Strongswan OK - 44 ESP OK [19:04:01] RECOVERY - IPsec on cp2023 is OK: Strongswan OK - 56 ESP OK [19:04:02] RECOVERY - IPsec on cp1052 is OK: Strongswan OK - 44 ESP OK [19:04:11] RECOVERY - IPsec on cp2010 is OK: Strongswan OK - 56 ESP OK [19:04:11] RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 124 ESP OK [19:04:13] Did tin's ssh fingerprint change? [19:04:17] csteipp: looking at the code, that's not supposed to happen for POST requests ever, and it's happening [19:04:17] Since yesterday? [19:04:30] RECOVERY - IPsec on cp1067 is OK: Strongswan OK - 44 ESP OK [19:04:31] RECOVERY - IPsec on cp2016 is OK: Strongswan OK - 56 ESP OK [19:04:32] RECOVERY - IPsec on cp1055 is OK: Strongswan OK - 44 ESP OK [19:04:32] no, I just sshed in just fine [19:04:32] RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 124 ESP OK [19:04:41] RECOVERY - IPsec on cp1065 is OK: Strongswan OK - 44 ESP OK [19:04:42] RECOVERY - IPsec on cp2013 is OK: Strongswan OK - 56 ESP OK [19:04:42] RECOVERY - IPsec on kafka1012 is OK: Strongswan OK - 124 ESP OK [19:04:47] (03PS1) 10Chad: group1 to wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283257 [19:04:50] RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 124 ESP OK [19:04:50] RECOVERY - IPsec on cp2004 is OK: Strongswan OK - 56 ESP OK [19:04:51] RECOVERY - IPsec on cp1066 is OK: Strongswan OK - 44 ESP OK [19:04:56] * csteipp is going to back away slowly from tin... [19:05:00] RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 124 ESP OK [19:05:01] RECOVERY - IPsec on cp2001 is OK: Strongswan OK - 56 ESP OK [19:05:01] RECOVERY - IPsec on cp2019 is OK: Strongswan OK - 56 ESP OK [19:05:02] csteipp: bast1001 did, if you are going through that [19:05:10] RECOVERY - IPsec on kafka1018 is OK: Strongswan OK - 124 ESP OK [19:05:10] (03CR) 10Chad: [C: 032] group1 to wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283257 (owner: 10Chad) [19:05:12] ostriches: can you hold? shit's broken :( [19:05:19] Oh, yeah.. bast1001 ;) [19:05:30] ostriches: Yes, seriously [19:05:39] (03Merged) 10jenkins-bot: group1 to wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283257 (owner: 10Chad) [19:05:49] csteipp: i'll have to change that one more time, you might want to switch to just bast2001 [19:06:06] 06Operations, 10Wikimedia-General-or-Unknown: api.php gives a 302 redirect to URL with '&*' appended, breaking CORS requests - https://phabricator.wikimedia.org/T132612#2204512 (10matmarex) [19:06:18] I won't sync wikiversions. [19:06:31] 06Operations, 10Wikimedia-General-or-Unknown: api.php gives a 302 redirect to URL with '&*' appended, breaking CORS requests - https://phabricator.wikimedia.org/T132612#2204514 (10hoo) p:05High>03Unbreak! [19:06:42] csteipp: bast1001's fingerprint changed (but it's being reinstalled right this second to fix a raid setup issue) [19:07:11] 06Operations, 10Wikimedia-General-or-Unknown: api.php gives a 302 redirect to URL with '&*' appended, breaking CORS requests - https://phabricator.wikimedia.org/T132612#2204366 (10hoo) This also breaks changing sitelinks on Wikidata. This causes data inconsistencies between Wikidata and the clients (Wikipedias... [19:07:24] 06Operations, 10Wikimedia-General-or-Unknown: api.php gives a 302 redirect to URL with '&*' appended, breaking CORS requests - https://phabricator.wikimedia.org/T132612#2204366 (10matmarex) This started around 17:45 UTC today, judging by when the cross-wiki uploads stopped happening: https://commons.wikimedia.... [19:07:52] ostriches: Ok, you might also want to revert so that we can easily use tin for deployments once we have a fix [19:08:33] As long as nobody does sync-wikiversions you're fine :) [19:08:44] mh, ok [19:09:44] (03PS1) 10Dzahn: Revert "bast1001: rsync home dirs back from tungsten" [puppet] - 10https://gerrit.wikimedia.org/r/283258 [19:09:50] (03PS2) 10Dzahn: Revert "bast1001: rsync home dirs back from tungsten" [puppet] - 10https://gerrit.wikimedia.org/r/283258 [19:09:57] 06Operations, 10Wikimedia-General-or-Unknown: api.php gives a 302 redirect to URL with '&*' appended, breaking CORS requests - https://phabricator.wikimedia.org/T132612#2204552 (10matmarex) ``` $ curl -i https://commons.wikimedia.org/w/api.php?origin=https%3A%2F%2Fen.wikipedia.org HTTP/1.1 302 Found Date: Wed,... [19:10:06] I really don't have much hint to go on here, like you I don't see any obvious change in the right timeframe that would be remotely related [19:10:18] unless something unlogged happened [19:10:21] MatmaRex: "This started around 17:45 UTC today" I'm assuming nothing changed in the upload process that suddenly started using cors in a different way, right? [19:10:26] (03CR) 10Dzahn: [C: 032] Revert "bast1001: rsync home dirs back from tungsten" [puppet] - 10https://gerrit.wikimedia.org/r/283258 (owner: 10Dzahn) [19:10:29] csteipp: nope [19:10:37] csteipp: and this affects Wikidata stuff too, apparnetly [19:10:44] I can confirm there have been no unlogged security patches deployed in the last few days. [19:10:47] csteipp: Also MediaWikiPageNameNormalizer in core is also affected (as it does API requests) [19:10:55] this definitely looks like the result of WebRequest::doSecurityRedirect() [19:11:07] (03CR) 10Dzahn: "this is just to copy the home dir data one more time in the opposite direction" [puppet] - 10https://gerrit.wikimedia.org/r/283258 (owner: 10Dzahn) [19:11:59] MatmaRex: Yeah, but nothing about that changed AFAICT [19:12:01] MatmaRex: The redirect, yes. So something changed so that the redirect is suddenly being called. Someone want to add a stack trace somewhere to figure out the call path? [19:12:35] csteipp: i can tell you the call path :) api.php does $wgRequest->checkUrlExtension() [19:13:01] and this does the security redirect [19:13:22] now someone should figure out why IEUrlExtension::areServerVarsBad() is suddenly returning true [19:13:42] wait what about ori's Server: header thing? [19:13:43] anything in apache configuration changed? i'm not even sure what exactly that methods checks for [19:13:52] I discounted that as unrelated, but ... [19:13:55] but i know some of our apache rewrites are crazy [19:13:56] (03PS1) 10Chad: Try to tune back ldap logging a tad. Rather spammy. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283259 [19:14:02] it also loads a new apache module too [19:14:11] https://gerrit.wikimedia.org/r/#/c/283226/ [19:14:31] and then followed up with https://gerrit.wikimedia.org/r/#/c/283230/1 [19:14:41] sounds scary enough to me [19:14:57] bblack: Do we at some point mangle URLs as to change the order of get parameters? [19:15:16] hoo: we do very fucked up things, there's some other bug this is causing [19:15:19] https://commons.wikimedia.org/w/api.php?action=query&prop=info&redirects=1&converttitles=1&titles=File:Iogansen%20YuI.jpg&format=json [19:15:32] that works (because the .jpg is not the last parameter) [19:15:37] (03PS1) 10Dzahn: Revert "Revert "bast1001: rsync home dirs back from tungsten"" [puppet] - 10https://gerrit.wikimedia.org/r/283260 [19:16:00] PROBLEM - Host cp4011 is DOWN: PING CRITICAL - Packet loss = 100% [19:16:05] (03CR) 10Dzahn: "the changes are fine, revert means copy a->b and revert-revert means copy b->a :p" [puppet] - 10https://gerrit.wikimedia.org/r/283260 (owner: 10Dzahn) [19:16:12] hoo: I think the param order thing is a natural effect with the IE check. it's only trying to prevent filename ending problems [19:16:21] RECOVERY - Host cp4011 is UP: PING OK - Packet loss = 0%, RTA = 75.98 ms [19:16:40] ori: ? [19:16:42] bblack: hm... but why is this suddenly a problem? [19:17:06] bblack: catching up [19:17:44] ori: https://phabricator.wikimedia.org/T132612 can it be caused by https://gerrit.wikimedia.org/r/#/c/283230/ ? [19:18:16] yes, it's ori's thing [19:18:22] (indirectly) [19:18:27] in https://doc.wikimedia.org/mediawiki-core/REL1_25/php/IEUrlExtension_8php_source.html [19:18:37] static $whitelist = array( 261 'Apache', 262 'Zeus', 263 'LiteSpeed' ); 264 if ( preg_match( '/^(.*?)($|\/| )/', $serverSoftware, $m ) ) { 265 return in_array( $m[1], $whitelist ); [19:18:51] basically part of that code in there for the IE bug-check is looking for Apache in the server header [19:18:57] and that's been replaced with the hostname [19:19:12] ok [19:19:17] 06Operations, 10ops-ulsfo, 06DC-Ops: ulsfo temperature-related exceptions - https://phabricator.wikimedia.org/T119631#2204587 (10RobH) In addition to replacing the thermal paste on all lvs4001-4004, I've knocked out the following hosts from T125205. * cp4008 * cp4010 * cp4011 * cp4012 That took care of all... [19:19:39] I suggest live hacking that function to return true [19:19:40] ori: ^ [19:19:42] Server: mw1171.eqiad.wmnet makes that code ( IEUrlExtension::areServerVarsBad ) behave differently than Server: Apache [19:20:02] OK, I don't have the full picture yet, but I'll do what hoo suggests for now [19:20:04] just a moment [19:20:10] PROBLEM - Host cp4012 is DOWN: PING CRITICAL - Packet loss = 100% [19:22:06] hoo: (btw, https://phabricator.wikimedia.org/T123276 is the other bug caused by the URL getting rewritten before reaching apaches) [19:23:20] !log ori@tin Synchronized php-1.27.0-wmf.20/includes/libs/IEUrlExtension.php: Live-hack IEUrlExtension::haveUndecodedRequestUri() to always return true (duration: 00m 33s) [19:23:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:24:00] ori: Looks good [19:24:12] yeah [19:24:13] !log ori@tin Synchronized php-1.27.0-wmf.21/includes/libs/IEUrlExtension.php: Live-hack IEUrlExtension::haveUndecodedRequestUri() to always return true (duration: 00m 33s) [19:24:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:24:20] confirmed [19:24:46] OK, now that it's not an ongoing crisis, wtf happened exactly, and what is this evil code? [19:24:59] it looks like this would probably work (without the hack) if we made the server header more like "Apache/mw1171" [19:25:47] I want to first ascertain that this code has value [19:25:51] ori: when the bug is triggered, the output is a 302 that says: [19:25:52] We can't serve non-HTML content from the URL you have requested, because [19:25:56] Internet Explorer would interpret it as an incorrect and potentially dangerous [19:25:59] content type.

[19:26:01]

Instead, please use this URL, which is the same as the [19:26:04] URL you have requested, except that "&*" is appended. This prevents Internet [19:26:08] Explorer from seeing a bogus file extension. [19:26:09] ori: I don't think that code even "worked" before this [19:26:10] so it's some kind of IE bug workaround [19:26:11] ori: yes-ish. it prevents XSS for IE 6. :P [19:26:19] if it's only IE6 or lower that has said bug, IMHO we can nuke this code at WMF, because IE6 can't connect anyways [19:26:28] IE 6 is blacklisted [19:26:32] it can connect? [19:26:35] (due to TLS restrictions, it can't even make a connection) [19:26:47] Yeah, it's IE6 only, I don't think IE7 is doing that insane stuff anymore [19:26:51] right, and there's that [19:27:03] let me confirm that it's not an issue with IE7 and then i'll submit a patch to nuke this code [19:27:07] bblack: are you sure? isn't it just pre-service-pack IE6? [19:27:15] hold on, jesus [19:27:24] Yeah, you can enable TLS1.1 on IE6 or so [19:27:39] Unpatched IE6. You can flip the config to make it work on SP3 (iirc) [19:27:41] IE 6 can view Wikipedia [19:27:44] even if it couldn't [19:27:48] IE 6 can view other MediaWiki sites [19:27:52] which is what this check is meant for [19:27:55] hmm yeah that may be true, although it's rare [19:27:56] you can't just go and delete if c [19:28:01] …it because you don't like it [19:28:11] it's not rare. it's normal [19:28:14] Then slap a config in front of it [19:28:17] bblack: pretty common for corporate managed laptops... [19:28:28] and add a comment that it's IE6 only and supposed to get killed at some point [19:28:29] if anything, yes, this could be behind a config option, on by default [19:28:29] how does SERVER_SOFTWARE enter into it? [19:28:42] SERVER_SOFTWARE is the Server: header [19:28:44] ori: via PHP'S super global [19:28:45] What MatmaRex said. Config it, on by default. [19:28:46] the real problem is that our apaches have shit rewrites [19:29:01] changing the Server header or whatever should not have caused this [19:29:13] it can't be *that* common, it's still way under a percent of our traffic [19:29:16] if you want a real fix, please find out why QUERY_STRING is decoded [19:29:23] (just as an outside bound) [19:29:39] (03CR) 10Ottomata: analytics_cluster: Add wrapper script for beeline (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/282713 (https://phabricator.wikimedia.org/T116123) (owner: 10Madhuvishy) [19:29:42] bblack: out of all IE 6 traffic, i'd think most is upgraded as far as it can be. :) [19:29:50] out for now, will try to be back for the train and further catastrophes... but first, food [19:29:53] yeah but most traffic is not IE6 to begin with [19:29:53] (it's a small portion of all traffic either way) [19:30:10] ori: so [19:30:28] couldn't it also be user agent switched? [19:30:28] ori: the SERVER_SOFTWARE check is to see if we can rely on REQUEST_URI, or if we should use QUERY_STRING instead. [19:30:43] 0.31% market share for IE6 [19:30:53] I'm still parsing "When passed the value of $_SERVER['SERVER_SOFTWARE'], this function returns true if that server is known to have a REQUEST_URI variable with %2E not decod ed to ".". On such a server, it is possible to detect whether the script filename has been obscured. The function returns false if the server is not know n to have this behavior. Microsoft IIS in particular is known to decode escaped script fil [19:30:53] enames." [19:30:54] ori: apparently, QUERY_STRING is fucked up in WMF deployment. it probably shouldn't be automatically decoded. [19:31:28] I don't think it's related to WMF [19:31:33] This check returns true for Apache [19:31:34] not WMF Apache [19:31:41] and various other server softwares as well [19:31:44] ori: and we have other bugs about $_SERVER vars being fucked up, e.g. https://phabricator.wikimedia.org/T123276 [19:31:44] (that WMF does not use) [19:32:05] and at least one other one i can't find now [19:32:07] we do decode the whole URI to some custom degree in varnish, too [19:32:18] We can change the fqnd module to change instead of replace. e.g. Server: Apache (%fqdn) [19:32:19] only for MediaWiki [19:33:00] that would still be fucked up and would bite us again some day [19:33:14] I want a few minutes to understand it properly and think about it, are deployments pending? [19:33:23] train is held up [19:34:25] ostriches: can you wait a few? [19:34:26] https://github.com/wikimedia/operations-puppet/blob/production/templates/varnish/normalize_path.inc.vcl.erb is the varnish URI decoding that happens for MW (and a variant for RestBase too, now, relatedly) [19:34:32] (03CR) 10Dzahn: [C: 032] Revert "Revert "bast1001: rsync home dirs back from tungsten"" [puppet] - 10https://gerrit.wikimedia.org/r/283260 (owner: 10Dzahn) [19:34:37] the code has been slightly moved/updated in recent times, but it's ancient in origin [19:34:50] (including most of the commentary on why) [19:35:03] (03PS1) 10Dzahn: Revert "bast1001: remove temp rsync for migration" [puppet] - 10https://gerrit.wikimedia.org/r/283263 [19:35:18] ori: I was [19:35:29] (03PS2) 10Dzahn: Revert "bast1001: remove temp rsync for migration" [puppet] - 10https://gerrit.wikimedia.org/r/283263 [19:35:37] (03CR) 10Dzahn: [C: 032] Revert "bast1001: remove temp rsync for migration" [puppet] - 10https://gerrit.wikimedia.org/r/283263 (owner: 10Dzahn) [19:36:06] thanks [19:36:25] so, is the idea that QUERY_STRING would be un-decoded in cases where REQUEST_URI is decoded? [19:36:35] couldn't we just choose the one with more % in it? [19:36:39] (03PS1) 10Dzahn: Revert "Revert "bast1001: remove temp rsync for migration"" [puppet] - 10https://gerrit.wikimedia.org/r/283264 [19:36:50] and the other bug i know caused by this is https://phabricator.wikimedia.org/T128380 , found it now [19:37:06] oh, the varnish code I linked does stop at ?query - it only decodes the path part [19:37:14] so it's not that [19:37:28] ori: i think the best simplest long-term fix is to prefix the value with "Apache " in your patch [19:37:38] can we do that, verify, and run the train? [19:37:48] I'd say that's a medium-term fix at best, though [19:37:49] no, because I don't agree [19:37:58] we have talked about replacing apache with nginx, and that would bite us in that case too [19:38:01] we'll face this again when we dump apache for hhvm's own HTTP server, etc [19:38:05] or that [19:38:28] bblack: ori: we hopefully won't, if we also fix whatever is decoding the query string. i guess that is the longer-term fix. [19:38:30] it's bad code [19:38:45] https://phabricator.wikimedia.org/T128380#2072420 [19:38:46] "the data passed to HHVM is a mixed bag of already-decoded and non-decoded nonsense" [19:38:50] which is bad code? [19:39:28] haveUndecodedRequestUri [19:39:41] the part that cares about what "Server: " says in reference to some IE6 client side bug? [19:39:50] that server whitelist is suspect [19:40:15] I'll unblock the train by committing the live-hack into the production branch [19:40:32] then figure out how to clean this up, and revert the live-hack [19:40:55] it's from http://mediawiki.org/wiki/Special:Code/MediaWiki/89558 [19:42:39] the logic there is hard to follow heh [19:43:19] https://phabricator.wikimedia.org/T30235#331525 [19:45:43] (03PS1) 10Mobrovac: Math: increase the number of concurrent connections to 150 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283269 (https://phabricator.wikimedia.org/T132096) [19:45:54] (03PS2) 10Alex Monk: deployment-prep: keyholder shinken monitoring [puppet] - 10https://gerrit.wikimedia.org/r/283227 (https://phabricator.wikimedia.org/T111064) [19:47:22] 06Operations, 13Patch-For-Review: Reinstall bast1001 with jessie - https://phabricator.wikimedia.org/T123721#2204676 (10Dzahn) had to reinstall one more time because it did no use RAID1, now it does: root@bast1001:~# mdadm --detail /dev/md0 |grep Level Raid Level : raid1 ``` +---------+---------+-------... [19:48:10] what about the later comments below that, where it says: [19:48:13] So by sending "Content-Disposition: filename=api.php", we can avoid having to deal with the broken behaviour of GetFileExtensionFromUrl(). [19:48:50] seems like that might be a superior approach [19:49:10] and regardless, surely we could lock this whole thing down to "only if UA string matches IE6 in the first place"? [19:49:51] 06Operations, 10Wikimedia-General-or-Unknown: api.php gives a 302 redirect to URL with '&*' appended, breaking CORS requests - https://phabricator.wikimedia.org/T132612#2204693 (10matmarex) Worked around for now: [19:23] !log ori@tin Synchronized php-1.27.0-wmf.20/includes/libs/IEUrlExtension.ph... [19:49:53] ostriches: train unblocked [19:50:22] i merged the live-hack to both prod branches, will back it out and replace with a proper fix in a little while [19:50:33] (03PS5) 10Madhuvishy: analytics_cluster: Add wrapper script for beeline [puppet] - 10https://gerrit.wikimedia.org/r/282713 (https://phabricator.wikimedia.org/T116123) [19:51:50] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [19:52:01] (03PS6) 10Madhuvishy: analytics_cluster: Add wrapper script for beeline [puppet] - 10https://gerrit.wikimedia.org/r/282713 (https://phabricator.wikimedia.org/T116123) [19:52:38] I also don't see how it is correct to assume that Apache is the only server software in the request path [19:54:11] ori: yeah. there's https://phabricator.wikimedia.org/T47501 ;) [19:54:30] PROBLEM - NTP on cp4011 is CRITICAL: NTP CRITICAL: Offset unknown [19:54:31] yeah, there you go [19:54:58] before i go on ranting -- thank you, bblack, MatmaRex, hoo, ostriches [19:55:21] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [19:56:25] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: (no message) [19:56:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:57:35] mutante: I think that's you with un-puppet-merged stuff? [19:58:08] bblack: yes, sry, done [19:59:11] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [19:59:31] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [20:00:04] gwicke cscott arlolra subbu bearND mdholloway: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160413T2000). [20:00:36] no parsoid deploy today [20:01:11] PROBLEM - salt-minion processes on bast1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [20:02:00] (03PS1) 10Alex Monk: shinken: Add myself to Beta Cluster Administrators contact group [puppet] - 10https://gerrit.wikimedia.org/r/283278 [20:02:02] (03PS7) 10Madhuvishy: analytics_cluster: Add wrapper script for beeline [puppet] - 10https://gerrit.wikimedia.org/r/282713 (https://phabricator.wikimedia.org/T116123) [20:04:45] (03CR) 10Dzahn: [C: 032] shinken: Add myself to Beta Cluster Administrators contact group [puppet] - 10https://gerrit.wikimedia.org/r/283278 (owner: 10Alex Monk) [20:05:18] ty mutante [20:05:35] np [20:08:31] RECOVERY - Host cp4012 is UP: PING OK - Packet loss = 0%, RTA = 76.11 ms [20:08:55] (03PS8) 10Ottomata: analytics_cluster: Add wrapper script for beeline [puppet] - 10https://gerrit.wikimedia.org/r/282713 (https://phabricator.wikimedia.org/T116123) (owner: 10Madhuvishy) [20:10:37] (03CR) 10Ottomata: [C: 032] analytics_cluster: Add wrapper script for beeline [puppet] - 10https://gerrit.wikimedia.org/r/282713 (https://phabricator.wikimedia.org/T116123) (owner: 10Madhuvishy) [20:11:12] OH madhuvishy i think we don't need that hive.yaml file at all [20:11:14] since we grab sane defaults [20:11:19] we'd only need that to override [20:12:45] ottomata: hmmm - just set the port number in the client class too? [20:13:02] no you have a hiera default lookup [20:13:05] the defaults are good [20:13:17] ah [20:13:18] (03PS1) 10Ottomata: Remove unneeded hive/client.yaml role hiera file [puppet] - 10https://gerrit.wikimedia.org/r/283279 [20:13:28] right, okay [20:13:42] can add if needed [20:13:44] (03CR) 10Ottomata: [C: 032 V: 032] Remove unneeded hive/client.yaml role hiera file [puppet] - 10https://gerrit.wikimedia.org/r/283279 (owner: 10Ottomata) [20:13:53] and also override in labs if needed i guess [20:13:55] (03PS2) 10Dzahn: Revert "Revert "bast1001: remove temp rsync for migration"" [puppet] - 10https://gerrit.wikimedia.org/r/283264 [20:14:21] 06Operations, 06Analytics-Kanban, 13Patch-For-Review: Upgrade stat1001 to Debian Jessie - https://phabricator.wikimedia.org/T76348#2204736 (10ezachte) I can't login right now to check. The vast majority of that 2TB will be backups, which I thin out every half year or so. All html files in htdocs should be co... [20:14:34] in labs you have 2 options for that, hiera files in the repo or edit a special wiki page [20:15:09] (03CR) 10Dzahn: [C: 032] "done copying" [puppet] - 10https://gerrit.wikimedia.org/r/283264 (owner: 10Dzahn) [20:17:15] (03PS1) 10BBlack: Revert "Disable ulsfo T128424" [dns] - 10https://gerrit.wikimedia.org/r/283288 [20:17:20] (03PS2) 10BBlack: Revert "Disable ulsfo T128424" [dns] - 10https://gerrit.wikimedia.org/r/283288 [20:17:31] RECOVERY - NTP on cp4011 is OK: NTP OK: Offset -0.0001041889191 secs [20:19:28] (03CR) 10BBlack: [C: 032] Revert "Disable ulsfo T128424" [dns] - 10https://gerrit.wikimedia.org/r/283288 (owner: 10BBlack) [20:19:54] (03CR) 10Physikerwelt: Math: increase the number of concurrent connections to 150 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283269 (https://phabricator.wikimedia.org/T132096) (owner: 10Mobrovac) [20:20:08] !log re-pooling ulsfo traffic T128424 [20:20:10] RECOVERY - salt-minion processes on bast1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [20:20:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:23:04] 07Puppet, 10Beta-Cluster-Infrastructure, 06Services, 15User-mobrovac: deployment-(mathoid|sca0[12]|cxserver03) puppet failures "Error: Could not set home on user" due to user being in use by process - https://phabricator.wikimedia.org/T132265#2204755 (10mobrovac) `deployment-(mathoid|sca0[12])` have been f... [20:24:44] 07Puppet, 10Beta-Cluster-Infrastructure, 06Services, 15User-mobrovac: deployment-(mathoid|sca0[12]|cxserver03) puppet failures "Error: Could not set home on user" due to user being in use by process - https://phabricator.wikimedia.org/T132265#2204772 (10Krenair) Yeah, I can only log in as root there, not m... [20:28:15] (03PS1) 10Madhuvishy: analytics_cluster: Fix beeline path in wrapper script [puppet] - 10https://gerrit.wikimedia.org/r/283323 [20:30:22] 06Operations, 10Dumps-Generation, 07HHVM, 13Patch-For-Review: Convert snapshot hosts to use HHVM and trusty - https://phabricator.wikimedia.org/T94277#2204797 (10ArielGlenn) It's already installed and provides hhvm-gdb which I used above. [20:30:36] (03PS2) 10Madhuvishy: analytics_cluster: Fix beeline path in wrapper script [puppet] - 10https://gerrit.wikimedia.org/r/283323 [20:30:43] 07Puppet, 10Beta-Cluster-Infrastructure, 06Services, 15User-mobrovac: deployment-(mathoid|sca0[12]|cxserver03) puppet failures "Error: Could not set home on user" due to user being in use by process - https://phabricator.wikimedia.org/T132265#2192956 (10hashar) For deployment-cxserver03 I have filled {T132... [20:32:04] 07Puppet, 10Beta-Cluster-Infrastructure, 07Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#2204810 (10mobrovac) [20:33:22] (03CR) 10Ottomata: [C: 032 V: 032] analytics_cluster: Fix beeline path in wrapper script [puppet] - 10https://gerrit.wikimedia.org/r/283323 (owner: 10Madhuvishy) [20:33:44] (03CR) 10Mobrovac: Math: increase the number of concurrent connections to 150 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283269 (https://phabricator.wikimedia.org/T132096) (owner: 10Mobrovac) [20:34:20] (03PS5) 10Dzahn: network: remove bast1001 SLAAC addresses [puppet] - 10https://gerrit.wikimedia.org/r/283254 (https://phabricator.wikimedia.org/T123721) [20:34:58] (03PS6) 10Dzahn: network: remove bast1001 SLAAC addresses [puppet] - 10https://gerrit.wikimedia.org/r/283254 (https://phabricator.wikimedia.org/T123721) [20:35:27] (03CR) 10Dzahn: [C: 032] network: remove bast1001 SLAAC addresses [puppet] - 10https://gerrit.wikimedia.org/r/283254 (https://phabricator.wikimedia.org/T123721) (owner: 10Dzahn) [20:41:26] MatmaRex: how does it make sense to default to QUERY_STRING? [20:41:45] elukey: awake enough for a 10-second review? https://gerrit.wikimedia.org/r/#/c/283324/1 [20:43:50] ori: i don't know, but it's tim's code, so i'm assuming it does until proven otherwise. i think it's a WMF misconfiguration problem. [20:44:31] ori_: are you working on a fix? a $wg variable to choose which of REQUEST_URI, QUERY_STRING, PATH_INFO should be checked probably makes sense, on second thought [20:45:30] 06Operations, 10Wikimedia-Apache-configuration, 07Varnish: Data passed to HHVM ($_SERVER variables) is a mixed bag of already-decoded and non-decoded nonsense - https://phabricator.wikimedia.org/T132629#2204871 (10matmarex) [20:45:54] 06Operations, 10Wikimedia-Apache-configuration, 07Varnish: Data passed to HHVM ($_SERVER variables) is a mixed bag of already-decoded and non-decoded nonsense - https://phabricator.wikimedia.org/T132629#2204887 (10matmarex) [20:46:05] 06Operations, 10Wikimedia-Apache-configuration, 07Varnish: Data passed to HHVM ($_SERVER variables) is a mixed bag of already-decoded and non-decoded nonsense - https://phabricator.wikimedia.org/T132629#2204871 (10matmarex) [20:46:10] 06Operations, 10Wikimedia-General-or-Unknown: api.php gives a 302 redirect to URL with '&*' appended, breaking CORS requests - https://phabricator.wikimedia.org/T132612#2204891 (10matmarex) [20:46:18] MatmaRex: I propose checking whether $_SERVER['QUERY_STRING'] contains more %s than the query component of the URI as derived from $_SERVER['REQUEST_URI'] [20:46:18] 06Operations, 06Analytics-Kanban, 13Patch-For-Review: Upgrade stat1001 to Debian Jessie - https://phabricator.wikimedia.org/T76348#2204894 (10Dzahn) If that is mostly HTML and text let me try to compress that, we should achieve a high compression ratio and maybe it's not so bad then. [20:46:49] filed https://phabricator.wikimedia.org/T132629 , btw [20:46:54] 06Operations, 10DBA, 10MediaWiki-Special-pages, 10Wikidata, 07Performance: Batch updates create slave lag on s3 over WAN - https://phabricator.wikimedia.org/T122429#2204902 (10hoo) [20:48:46] ori_: i honestly don't know enough about this to say if that makes sense. it doesn't sound entirely unreasonable. it would be very silly if it resulted in false positives, though. [20:53:37] 06Operations, 10Beta-Cluster-Infrastructure, 07HHVM: hhvm apache fills /var/log/apache2 with access logs - https://phabricator.wikimedia.org/T75262#2204929 (10Krenair) [20:54:37] 06Operations, 10Beta-Cluster-Infrastructure, 07HHVM: hhvm apache fills /var/log/apache2 with access logs - https://phabricator.wikimedia.org/T75262#755023 (10ori) You can continue finding and disabling everything that needs disk space, or you can just increase the amount of disk space available to these mach... [20:58:35] 06Operations, 07Need-volunteer, 13Patch-For-Review: align puppet-lint config with coding style - https://phabricator.wikimedia.org/T93645#2204946 (10Andrew) [20:58:37] 06Operations, 13Patch-For-Review: fix all "top-scope variable being used without an explicit namespace" across the puppet repo - https://phabricator.wikimedia.org/T125042#2204944 (10Andrew) 05Open>03Resolved a:03Andrew [20:58:57] 06Operations, 10Beta-Cluster-Infrastructure, 07HHVM: Beta-cluster web server fills up /var/log with Apache logs - https://phabricator.wikimedia.org/T75262#2204949 (10Krinkle) [20:59:20] (03CR) 10Dzahn: [C: 032] "no diff, carbon is the real aggregator http://puppet-compiler.wmflabs.org/2431/" [puppet] - 10https://gerrit.wikimedia.org/r/283252 (https://phabricator.wikimedia.org/T123721) (owner: 10Dzahn) [20:59:41] (03PS4) 10Dzahn: bast1001: reorder includes, rm ganglia_aggregator [puppet] - 10https://gerrit.wikimedia.org/r/283252 (https://phabricator.wikimedia.org/T123721) [21:00:42] I've closed two bugs on the clinic-duty dashboard. That leaves only ~798 to go [21:00:51] there's a clinic-duty dashboard? [21:01:42] andrewbogott: wow, how.. you fixed all of them? [21:01:49] "top-scope vars" that is [21:02:10] looks [21:02:19] greg-g: Around? [21:02:27] mutante: I think so? You already merged the linter change didn't you? Or do I misunderstand that bug? [21:02:44] Krenair: https://phabricator.wikimedia.org/dashboard/view/45/ [21:02:52] Which at the moment is an out-of-control disaster [21:03:22] MatmaRex: is https://phabricator.wikimedia.org/T132612 still "Unbreak now!" after your workaround? [21:04:16] i'm looking.. 5 min [21:04:28] andrewbogott: ori's. probably not anymore [21:05:05] hoo: greg-g is out today [21:05:09] ok [21:05:15] 06Operations, 10Wikimedia-General-or-Unknown: api.php gives a 302 redirect to URL with '&*' appended, breaking CORS requests - https://phabricator.wikimedia.org/T132612#2204953 (10Andrew) p:05Unbreak!>03High [21:05:30] bd808: So... who is the go to person about deploys today? [21:05:41] 06Operations, 10Wikimedia-General-or-Unknown: api.php gives a 302 redirect to URL with '&*' appended, breaking CORS requests - https://phabricator.wikimedia.org/T132612#2204956 (10matmarex) This was indirectly caused by https://gerrit.wikimedia.org/r/#/c/283230/. That patch resulted in a change to the value of... [21:06:33] hoo: probably ostriches I would guess. He was running the train earlier [21:06:52] What's up? [21:07:04] ostriches: I would like to bump Wikibase (Wikidata) to master [21:07:12] jouncebot_: next [21:07:12] In 1 hour(s) and 52 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160413T2300) [21:07:18] Sure :) [21:07:32] Thanks :) [21:07:45] Ostrich der Lokomotivfuehrer und die Wilde wmf-13 [21:07:56] shuts up [21:07:58] :'D [21:09:17] !log https://tools.wmflabs.org/sal missing entries since 2016-04-13T09:21. Needs to be backfilled [21:09:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:10:01] andrewbogott: you did fix it, i didnt realize that's what you did with the submodule, thank you! [21:10:20] ori_: hmm, do me a favour in re https://phabricator.wikimedia.org/T132629? live-hack mw1017 or something with `var_dump( $_SERVER )`, then try visiting a few URLs and paste the results on the task? [21:10:21] https://commons.wikimedia.org/w/api.php?origin=https%3A%2F%2Fwww.mediawiki.org [21:10:24] https://en.wikipedia.org/wiki/Why_Me%3F_(Daniel_Johnston_Album) [21:10:28] https://test.wikipedia.org/wiki/Bug%3F?action=history [21:10:33] modules/kafka has a bunch of errors that were gone and got readded [21:10:42] which was only possible by overriding jenkins [21:10:45] i can't fix it, but i'm curious how broken it is. :) [21:10:54] bd808: why did it happen? [21:10:56] tssk tssk, please [21:10:57] MatmaRex: sure, give me a moment. [21:11:32] ori_: not sure exactly. the bot process was running but not in any channels. Maybe it netsplit? maybe something else. [21:11:33] ori_: thanks. also, strip your cookie and stuff from the pastes :D [21:11:47] ($_SERVER has HTTP_COOKIE) [21:11:58] my ssh is complaining about the ECDSA key of bast1001.wikimedia.org (208.80.154.149) [21:12:07] is that an expected key rotation? [21:12:08] there wasn't anything interesting the err log for the bot :/ [21:12:30] cscott: yeah. server was rebuilt today. should have an email about it [21:12:31] cscott: yes, expected, there's an email to the ops list about it [21:12:37] ok, thanks. [21:12:42] been off in parser land today [21:12:46] cscott: https://phabricator.wikimedia.org/T123721#2204676 [21:12:57] mutante: ? [21:13:40] (03PS1) 10Mattflaschen: Enable Echo survey on French-language wikis (retry) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283330 (https://phabricator.wikimedia.org/T131893) [21:14:05] (03CR) 10Mattflaschen: "Retry at https://gerrit.wikimedia.org/r/283330" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282414 (https://phabricator.wikimedia.org/T131893) (owner: 10Mattflaschen) [21:14:28] ottomata: re: puppet-lint, we fixed most of the warnings/errors globally so that for one type of error/warning there were 0 left across the repo , so then we could let jenkins vote on that specific one [21:14:37] (03PS2) 10Mattflaschen: Enable Echo survey on French-language wikis (retry) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283330 (https://phabricator.wikimedia.org/T131893) [21:14:51] ottomata: then i noticed that some came back [21:15:07] when i looked at that ticket that andrew just closed [21:15:26] it's a different check though, not the "top-scope var" thing, but just the alignment [21:15:33] already started fixing [21:16:04] oh ok [21:16:06] The thing I fixed was a logic error that would've been caught by the linter. So… +1 for linting. [21:16:35] just so that it can actually say OK for the "strict" check too [21:16:41] that we got used to never working. but now it can [21:16:41] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [5000000.0] [21:18:27] MatmaRex: https://gist.github.com/atdt/15a25c5f493d32b9e53e4add12a9b1d0 [21:19:34] ori: thank you. i'll put it on the task [21:21:38] Nginx is running on ~15% of all web servers and it is not included in the whitelist [21:23:09] MatmaRex: the problem with the config var approach is this: presumably the default value would be to check REQUEST_URI, since it appears to be the right thing to do for all web servers except ancient versions of Microsoft IIS [21:23:38] that means currently-secure installations using those versions of IIS will become insecure [21:24:56] IIS 5 was released 16 years ago [21:25:20] ori: i was thinking that we'd default to the current behavior (checking SERVER_SOFTWARE and deciding based on that) unless overriden by the config variable [21:26:40] ori: but, this is extremely not my area :D grab TimStarlin.g when he's out of the RFC meeting or csteip.p or someone [21:26:55] it's a good idea [21:27:05] defaulting to the current behavior, I mean [21:30:05] (03PS1) 10Dereckson: Add maps-cluster referer rules for pl.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/283332 (https://phabricator.wikimedia.org/T132510) [21:30:10] yurik: Hi. Like this for Varnish? ^ [21:30:35] (03PS2) 10Dereckson: Add maps-cluster referer rule for pl.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/283332 (https://phabricator.wikimedia.org/T132510) [21:30:38] oh one rule here [21:31:10] Dereckson, wikimedia is already listed above [21:31:16] you might want to merge them [21:31:36] and Dereckson, pl? means p with an optional l [21:32:25] Let's fix that. [21:32:45] Dereckson, (?i)^https?://(maps|phabricator|wikitech|incubator|pl)\.(m\.)?wikimedia\.org(/|$) [21:32:55] hmmmmm [21:33:15] bblack, ^ [21:33:18] why the optional https? [21:33:36] What about a line for the .m. ones? Incubator + pl? [21:34:15] Krenair, its a referrer, are we positive that every device on the planet will send https? [21:34:19] 06Operations, 10Wikimedia-Apache-configuration, 07Varnish: Data passed to HHVM ($_SERVER variables) is a mixed bag of already-decoded and non-decoded nonsense - https://phabricator.wikimedia.org/T132629#2205049 (10matmarex) Some examples courtesy of @ori, for each of the bugs mentioned above. (Something went... [21:34:24] its a voodoo :) [21:34:41] Dereckson, what do you mean? [21:35:05] my RE above adds optional m. to every wikimedia.org, just in case they exist [21:35:20] i wouldn't mind enabling it for all wikimedia.org, without the restriction [21:35:26] && req.http.referer !~ "(?i)^https?://(incubator|pl)\.(m\.)?wikimedia\.org(/|$)" [21:35:51] so we avoid not existing wikitech.m.wikimedia.org or phabricator.m.wikimedia.org [21:36:26] Dereckson, its a filter for referer headers, so it doesn't matter if they don't actually exist - the simpler the better :) [21:36:54] That would also make the file more easy to read, so simpler: [21:36:56] && req.http.referer !~ "(?i)^https?://(maps|phabricator|wikitech)\.wikimedia\.org(/|$)" [21:36:59] && req.http.referer !~ "(?i)^https?://(incubator|pl)\.(m\.)?wikimedia\.org(/|$)" [21:37:03] We have one line for "special" sites [21:37:09] Another for "regular" wikis [21:39:11] (03CR) 10Bartosz Dziewoński: "Thanks Aaron, I'll have that backported first." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282806 (https://phabricator.wikimedia.org/T132200) (owner: 10Bartosz Dziewoński) [21:39:17] !log hoo@tin Started scap: Update Wikibase to master (wmf21) [21:39:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:40:29] Dereckson, sure [21:40:55] Dereckson, the problem is that i would never remember which wikimedia.org has the m. and which dont [21:41:02] okay so one line [21:41:14] To avoid to create "traps" in config is a bad idea. [21:41:18] (a good idea) [21:41:33] exactly [21:42:10] otherwise we might by accident put some other domain in one line but not in the other, and accidentally not enable .m. for some site [21:42:27] I concur. [21:42:46] (03CR) 10Yurik: [C: 04-1] "pl? is wrong" [puppet] - 10https://gerrit.wikimedia.org/r/283332 (https://phabricator.wikimedia.org/T132510) (owner: 10Dereckson) [21:44:11] (03PS3) 10Dereckson: Add maps-cluster referer rules for pl.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/283332 (https://phabricator.wikimedia.org/T132510) [21:44:35] 06Operations, 06Analytics-Kanban, 13Patch-For-Review: Upgrade stat1001 to Debian Jessie - https://phabricator.wikimedia.org/T76348#2205087 (10Ottomata) > 2.0T wikistats Ja, this is why we don't backup! Too big! stat1001 is in the analytics cluster...so technically we could use HDFS as a holding pen. Migh... [21:47:02] Regexp looks good to me, tested with http://www.regexplanet.com/advanced/ruby/. [21:51:29] 06Operations, 10Wikimedia-Apache-configuration, 07Varnish: Data passed to HHVM ($_SERVER variables) is a mixed bag of already-decoded and non-decoded nonsense - https://phabricator.wikimedia.org/T132629#2205129 (10matmarex) And for comparison, similar requests from my local testing wiki, with Apache and a ve... [21:52:07] (03CR) 10Dereckson: "PS3: regex discussed on #wikimedia-operations, and tested through http://www.regexplanet.com/advanced/ruby/." [puppet] - 10https://gerrit.wikimedia.org/r/283332 (https://phabricator.wikimedia.org/T132510) (owner: 10Dereckson) [22:00:01] PROBLEM - Apache HTTP on mw1211 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:00:52] PROBLEM - HHVM rendering on mw1211 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:02:30] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 50.00% above the threshold [1000000.0] [22:03:40] twentyafterfour: https://gerrit.wikimedia.org/r/#/c/282478/ looks like it makes sense [22:03:47] merge ok? [22:03:55] (03CR) 10Yurik: [C: 031] Add maps-cluster referer rules for pl.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/283332 (https://phabricator.wikimedia.org/T132510) (owner: 10Dereckson) [22:05:29] 06Operations: fix all "top-scope variable being used without an explicit namespace" across the puppet repo - https://phabricator.wikimedia.org/T125042#2205246 (10Dzahn) [22:07:09] mutante: sure [22:07:17] (03CR) 1020after4: [C: 031] Fix viewing raw php files in diffusion [puppet] - 10https://gerrit.wikimedia.org/r/282478 (owner: 10Paladox) [22:07:30] 06Operations, 10Beta-Cluster-Infrastructure, 07HHVM: Beta-cluster web server fills up /var/log with Apache logs - https://phabricator.wikimedia.org/T75262#2205257 (10Dzahn) Though, disabling the logs means you dont have to deal with data-retention issues. [22:07:51] (03PS2) 10Dzahn: Fix viewing raw php files in diffusion [puppet] - 10https://gerrit.wikimedia.org/r/282478 (owner: 10Paladox) [22:07:57] (03CR) 10Dzahn: [C: 032] Fix viewing raw php files in diffusion [puppet] - 10https://gerrit.wikimedia.org/r/282478 (owner: 10Paladox) [22:11:23] !log hoo@tin Finished scap: Update Wikibase to master (wmf21) (duration: 32m 06s) [22:11:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:15:30] (03CR) 10Dzahn: [C: 031] Enable ChallengeResponseAuthentication [puppet] - 10https://gerrit.wikimedia.org/r/281629 (owner: 10Muehlenhoff) [22:15:36] yurik: what wikis should have wgKartographerWikivoyageMode set at true, wikivoyage + incubator? [22:16:11] Dereckson, wikivoyage + mediawiki.org (since it is used for documentation) [22:16:51] For Incubator, the rationale is because in the future, a new Wikivoyage project in a new language could start on it. [22:22:18] Oh, the extension isn't currently deployed on Incubator. [22:23:21] (03PS1) 10Dereckson: Set wgKartographerWikivoyageMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283339 [22:23:32] !log hoo@tin Synchronized php-1.27.0-wmf.21/./extensions/Wikidata/extensions/Wikibase/view/resources/jquery/wikibase/jquery.wikibase.statementview.RankSelector.js: touch (duration: 00m 26s) [22:23:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:25:10] (03PS1) 10Dereckson: Enable Kartographer on pl.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283340 (https://phabricator.wikimedia.org/T132510) [22:26:27] (03CR) 10Yurik: [C: 031] Set wgKartographerWikivoyageMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283339 (owner: 10Dereckson) [22:27:58] hm :/ [22:28:13] RL doesn't seem to pick up the new messages [22:28:16] Krinkle: ^ [22:28:21] Anything we can do about that [22:28:34] (03PS1) 10Ori.livneh: Force $_SERVER['SERVER_SOFTWARE'] to be "Apache" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283341 (https://phabricator.wikimedia.org/T132612) [22:30:09] mutante: we have a request to deploy an extension for a wiki, which is blocked by https://gerrit.wikimedia.org/r/#/c/283332/ (Varnish). Is that mergeable or should I include it for next puppet SWAT? [22:30:40] (03PS2) 10Ori.livneh: Force $_SERVER['SERVER_SOFTWARE'] to be "Apache" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283341 (https://phabricator.wikimedia.org/T132612) [22:30:45] hoo: Can you be more specific? Which message where, and what have you done so far? [22:31:47] Krinkle: New message has been added to Wikibase (and a RL module). I scaped it and the localization is present (as can be seen on https://www.wikidata.org/wiki/MediaWiki:Wikibase-statementview-references-counter) [22:32:01] But it's not propagating into RL [22:32:19] After the scap, I touched a file that is in the script array of the module, but not luck [22:34:10] hoo: file touching makes no difference to messages, ever. [22:34:12] just fyi [22:34:51] Ok, I thought it might re-pick up the whole module [22:35:57] https://www.wikidata.org/w/load.php?modules=jquery.wikibase.statementview&debug=false&_1 [22:36:07] "wikibase-statementview-references-counter":"\u003Cwikibase-statementview-references-counter\u003E" [22:36:25] https://www.wikidata.org/w/load.php?modules=jquery.wikibase.statementview&debug=false&_1&lang=nl [22:36:36] "wikibase-statementview-references-counter":"$1{{PLURAL:$2|0=|$3+$2$4}} [22:37:30] https://logstash.wikimedia.org/#/dashboard/elasticsearch/resourceloader [22:37:41] It exists in localisation cache / wfMessage now [22:37:44] but did not at first [22:38:12] "Failed to find wikibase-statementview-references-counter (de)" - /w/load.php?debug=false&lang=de&modules=startup&only=scripts&skin=vector [22:39:14] hoo: Looks like maybe the server started looking for that message too early. In which order did it get deployed? [22:39:45] Krinkle: what scap does, so mw-update-l10n first [22:40:17] Now that it is cached in MessageBlobStore it won't refresh until either 1) localisation cache bumps the touch key for this language, 2) the message is changed on-wiki for the specific language, 3) 1 week cache expires. [22:40:25] (03PS2) 10Dereckson: Enable Kartographer on pl.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283340 (https://phabricator.wikimedia.org/T132510) [22:41:23] hoo: Editing MeidaWiki: wikibase-statementview-references-counter should purge the blob for all languages [22:41:28] and then delete again [22:41:30] Try that? [22:41:42] Sounds reasonable [22:42:15] It seems like somehow scap bumped the cache key before it finished syncing [22:42:22] Which seems like a definitive possibility [22:43:14] scap *builds* the l10n on tin first but actually applies it on the hosts last [22:43:14] hoo: You could also new MessageBlobStore()->updateMessage( String $key ); [22:43:28] (note it is wiki specific) [22:44:14] bd808: Hm.. which script does it use? [22:44:15] bd808: hm... that sounds troublesome [22:44:26] I know that localisationUpdate (nightly) accounts for it [22:44:50] scap doesn't do the blob purge that l10nupdate does [22:45:15] Krinkle: That worked... but I found at least one other message that's missing [22:45:19] hoo: it's life. :) there is a long standing bug about this very case in Phab. [22:45:21] will try purging via MessageBlobStore [22:45:46] bd808: Yeah... I keep running into edge cases all the time :P [22:46:00] (03PS1) 10Dereckson: Add signature edit button for the Comments namespace to ru.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283347 (https://phabricator.wikimedia.org/T132241) [22:46:07] hoo: if your code would just ride the normal train... [22:46:57] an new branch deploy is synced to all hosts before the active versions is updated so it doesn't hit this particular issue [22:47:06] yeah, I would love that [22:47:52] I kind of forgot to branch in time this week... then Jan branched, but he got the wrong HEAD [22:47:58] so I updated our branch just now [22:48:08] doing all that per hand is a pain [22:48:32] hoo: LOoks fixed to me [22:48:48] indeed :) [22:48:57] 06Operations, 10Wikimedia-Apache-configuration, 07Varnish: Data passed to HHVM ($_SERVER variables) is a mixed bag of already-decoded and non-decoded nonsense - https://phabricator.wikimedia.org/T132629#2205393 (10BBlack) With the examples, could you be more-specific about what's broken in them? [22:48:58] Still missing another message [22:49:23] the general fix for this type of problem will be when scap3 knows how to depool servers while they are getting code changes [22:49:43] then we won't get mixed state from a server that is being updated [22:50:49] found another message :/ [22:51:05] bd808: I don't think that would solve it in this case. [22:51:21] The problem isn't lack of back-compat. If it synced messages first and then the code, it'd be fine. [22:51:25] was it not a completely new message? [22:51:28] And while it does do that (I think?) [22:51:33] bd808: It was [22:51:41] depooling would fix it then [22:51:48] bd808: The problem is that something purged the cache key (which is in memcached) before all servers got the code [22:52:32] (03PS1) 10Dereckson: Consider all pages as valid content articles on pl.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283349 (https://phabricator.wikimedia.org/T131771) [22:52:35] So it was recombobulated and repopulated from another server too early [22:52:51] Hm.. yeah, I'm not sure actually [22:53:10] I guess if old server stayed pooled and handling existing requests with old code then it wouldn't look for that key in the first plae. [22:53:11] place* [22:53:34] bd808: Wait, did you say it syncs the new l10n files last? [22:53:43] yes [22:53:46] why? [22:53:50] well it rebuilds the cdb files last [22:53:53] I remember it was historically first [22:53:56] nope [22:54:28] mw-do-l10n, mw-sync-l01n then sync-common-all [22:54:31] scap bash [22:54:38] before we started using json for the shipping encoding (December 2014) it would have been inode order random [22:55:20] right [22:56:02] if it did things in that order it was before I ever saw the scap source (i.e. before December 2013) [22:56:23] bd808: Can you point to where it builds the l10n files and where it syncs them? Perhaps there's a relatively easy way to (re)trigger the cache touchkey aftereward [22:56:38] I only looked at scap source once, around 2012. [22:57:02] (before the rewrite that is, I've looked at the new one since then) [22:57:34] sure. -- https://phabricator.wikimedia.org/diffusion/MSCA/browse/master/scap/main.py;a4fa153570e0c6f3494eadd6de9d9435754b6c18$328 [22:58:41] lines 278-288 give an outline of the step order (ignore 289: that doesn't normally happen) [23:00:04] RoanKattouw ostriches Krenair MaxSem Dereckson: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160413T2300). [23:00:04] MatmaRex matt_flaschen: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:27] hi [23:00:28] Hi. I've some patches to add in last minute to this windows. [23:00:52] (you can do everyone else before me, i'm in a short meeting) [23:01:33] Present [23:03:44] Okay, I can SWAT. [23:04:12] matt_flaschen: do you have a specific order between the Echo and config? [23:05:13] Dereckson, Echo first. [23:05:18] k [23:07:40] Tests are running. Meanwhile, I'll take one config patch. [23:08:09] MatmaRex: okay, ping when you're back [23:09:20] (03CR) 10Dereckson: [C: 032] Add signature edit button for the Comments namespace to ru.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283347 (https://phabricator.wikimedia.org/T132241) (owner: 10Dereckson) [23:09:54] (03Merged) 10jenkins-bot: Add signature edit button for the Comments namespace to ru.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283347 (https://phabricator.wikimedia.org/T132241) (owner: 10Dereckson) [23:10:27] Dereckson: (i'm around now) [23:11:10] k [23:11:45] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Add signature edit button for the Comments namespace to ru.wikinews (Task T132241, [[Gerrit:283347]]) (duration: 00m 38s) [23:11:46] T132241: Enabling signature button in toolbar for the Comments namespace in ruwikinews - https://phabricator.wikimedia.org/T132241 [23:11:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:12:59] Doesn't work, but could be cache related. Will retest later. [23:13:24] Zuul tests for Echo still running, MatmaRex, you're next. [23:15:30] MaxSem Krenair or greg-g: https://gerrit.wikimedia.org/r/#/c/283247/2 looks SWATtable ? [23:16:28] 283347 works. [23:18:20] (03PS2) 10Dereckson: Consider all pages as valid content articles on pl.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283349 (https://phabricator.wikimedia.org/T131771) [23:19:37] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283349 (https://phabricator.wikimedia.org/T131771) (owner: 10Dereckson) [23:21:03] (03Merged) 10jenkins-bot: Consider all pages as valid content articles on pl.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283349 (https://phabricator.wikimedia.org/T131771) (owner: 10Dereckson) [23:23:05] matt_flaschen: okay, Echo changes merged. We can deploy them. [23:23:08] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Consider all pages as valid content articles on pl.wikisource (Task T131771, [[Gerrit:283349]]) (duration: 00m 35s) [23:23:09] T131771: Set $wgArticleCountMethod to 'any' for plwikisource - https://phabricator.wikimedia.org/T131771 [23:23:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:25:31] !log Ran mwscript updateArticleCount.php --wiki=plwikisource --update [23:25:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:26:23] Sounds good. Dereckson, let me know when Echo is done, and I can test quickly before the mediawiki-config. [23:26:32] k [23:27:14] 06Operations, 10Traffic, 06Zero: Use Text IP for Mobile hostnames to gain SPDY/H2 coalesce between the two - https://phabricator.wikimedia.org/T124482#2205468 (10BBlack) The Zero picture is clearer now from some email threads with @DFoy and @dr0ptp4kt . We're clear for this change on the Zero front already,... [23:35:37] Dereckson, hmm [23:35:39] Dereckson, I would as long as Aaron or Bartosz are around [23:35:54] i'm here. [23:35:55] !log dereckson@tin Synchronized php-1.27.0-wmf.20/extensions/Echo/modules/ooui/mw.echo.ui.FooterNoticeWidget.js: Fixes for Echo ([[Gerrit:282714]] + [[Gerrit:282715]]) (duration: 00m 26s) [23:36:02] matt_flaschen: you can test ^ [23:36:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:36:14] i suppose there might be some performance stats we could watch for catastrophic regressions? not sure where to find them though [23:36:18] Krenair: ^ [23:36:33] i'm pretty sure we track page save duration across time [23:37:15] not sure. ori? [23:37:17] matt_flaschen: the .less file is no-op? [23:37:21] yeah, somewhere [23:37:23] was it gdash? [23:37:28] Dereckson, no, it's required. [23:37:32] k [23:37:33] no... [23:37:51] grafana? [23:38:01] not graphite [23:38:23] MatmaRex, https://grafana.wikimedia.org/dashboard/db/save-timing [23:38:27] logmsgbot? [23:38:28] !log dereckson@tin Synchronized php-1.27.0-wmf.20/extensions/Echo/modules/ooui/styles/mw.echo.ui.FooterNoticeWidget.less: Fix for Echo footer ([[Gerrit:282715]]) (duration: 00m 27s) [23:38:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:38:44] matt_flaschen: here you are ^ [23:39:28] (03PS3) 10Dereckson: Enable Echo survey on French-language wikis (retry) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283330 (https://phabricator.wikimedia.org/T131893) (owner: 10Mattflaschen) [23:39:58] Krenair: oh, neat. i guess it can't be filtered per-wiki? [23:40:09] not there... [23:40:39] (03PS1) 10Dzahn: dhcp: let ulsfo public subnet use carbon as TFTP [puppet] - 10https://gerrit.wikimedia.org/r/283359 (https://phabricator.wikimedia.org/T123674) [23:43:19] Dereckson, yeah, the config patch is good to go. [23:43:39] k [23:43:49] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283330 (https://phabricator.wikimedia.org/T131893) (owner: 10Mattflaschen) [23:44:00] (03PS2) 10Dzahn: dhcp: let ulsfo public subnet use carbon as TFTP [puppet] - 10https://gerrit.wikimedia.org/r/283359 (https://phabricator.wikimedia.org/T123674) [23:44:14] (03Merged) 10jenkins-bot: Enable Echo survey on French-language wikis (retry) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283330 (https://phabricator.wikimedia.org/T131893) (owner: 10Mattflaschen) [23:46:13] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Enable Echo survey on French-language wikis (Task T131893, [[Gerrit:283330]]) (duration: 00m 26s) [23:46:14] T131893: Invite French users to take the Notification Survey (using the Notifications panel) - https://phabricator.wikimedia.org/T131893 [23:46:16] matt_flaschen: and here you're with the config patch ^ [23:46:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:46:32] Thanks, Dereckson, testing now. [23:47:03] (03CR) 10Dzahn: [C: 032] dhcp: let ulsfo public subnet use carbon as TFTP [puppet] - 10https://gerrit.wikimedia.org/r/283359 (https://phabricator.wikimedia.org/T123674) (owner: 10Dzahn) [23:47:45] 06Operations, 10Wikimedia-Apache-configuration, 07Varnish: Data passed to HHVM ($_SERVER variables) is a mixed bag of already-decoded and non-decoded nonsense - https://phabricator.wikimedia.org/T132629#2205523 (10matmarex) For T123276: * P2895 https://commons.wikimedia.org/w/api.php?oxrigin=https%3A%2F%2Fw... [23:49:36] MatmaRex: okay, Zuul is running tests for 283333 [23:50:18] Dereckson, works as expected. Thanks. [23:50:25] (03PS1) 10Dzahn: network: remove bast4001 SLAAC IPs [puppet] - 10https://gerrit.wikimedia.org/r/283361 (https://phabricator.wikimedia.org/T123674) [23:50:31] You're welcome matt_flaschen. Thanks for testing. [23:50:57] Pending running tests and merge, let's resume our config patches. [23:51:43] MatmaRex's blocked by AbuseFilter update, so let's merge the hi.wikt one. [23:51:50] (03CR) 10Dereckson: [C: 032] Import sources on hi.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282843 (https://phabricator.wikimedia.org/T132417) (owner: 10Dereckson) [23:53:26] Needs rebase [23:54:21] PROBLEM - RAID on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:54:22] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:54:26] (03PS2) 10Dereckson: Import sources on hi.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282843 (https://phabricator.wikimedia.org/T132417) [23:54:31] PROBLEM - configured eth on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:54:50] PROBLEM - dhclient process on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:54:52] (03CR) 10Dereckson: [C: 032] Import sources on hi.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282843 (https://phabricator.wikimedia.org/T132417) (owner: 10Dereckson) [23:55:10] PROBLEM - salt-minion processes on alsafi is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [23:55:10] PROBLEM - restbase endpoints health on restbase-test2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:55:10] PROBLEM - Check size of conntrack table on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:55:12] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:55:22] PROBLEM - DPKG on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:55:31] PROBLEM - Disk space on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:55:40] PROBLEM - restbase endpoints health on restbase2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:55:40] PROBLEM - graphoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:55:51] PROBLEM - puppet last run on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:56:00] PROBLEM - restbase endpoints health on restbase-test2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:56:00] PROBLEM - graphoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:56:01] PROBLEM - restbase endpoints health on restbase2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:56:01] PROBLEM - restbase endpoints health on restbase-test2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:56:01] PROBLEM - restbase endpoints health on restbase2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:56:01] PROBLEM - restbase endpoints health on restbase2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:56:01] PROBLEM - restbase endpoints health on restbase2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:56:01] PROBLEM - restbase endpoints health on restbase2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:56:11] RECOVERY - configured eth on alsafi is OK: OK - interfaces up [23:56:12] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [23:56:12] !log ssh alsafi [23:56:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:56:27] Dereckson, I have another patch to deploy, can do it myself [23:56:31] RECOVERY - dhclient process on alsafi is OK: PROCS OK: 0 processes with command name dhclient [23:56:51] RECOVERY - salt-minion processes on alsafi is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [23:56:51] RECOVERY - Check size of conntrack table on alsafi is OK: OK: nf_conntrack is 0 % full [23:56:51] RECOVERY - restbase endpoints health on restbase-test2002 is OK: All endpoints are healthy [23:57:01] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [23:57:10] RECOVERY - DPKG on alsafi is OK: All packages OK [23:57:11] RECOVERY - Disk space on alsafi is OK: DISK OK [23:57:20] RECOVERY - graphoid endpoints health on scb2001 is OK: All endpoints are healthy [23:57:20] RECOVERY - restbase endpoints health on restbase2005 is OK: All endpoints are healthy [23:57:40] RECOVERY - puppet last run on alsafi is OK: OK: Puppet is currently enabled, last run 29 minutes ago with 0 failures [23:57:41] RECOVERY - restbase endpoints health on restbase-test2003 is OK: All endpoints are healthy [23:57:41] RECOVERY - graphoid endpoints health on scb2002 is OK: All endpoints are healthy [23:57:41] RECOVERY - restbase endpoints health on restbase2003 is OK: All endpoints are healthy [23:57:41] RECOVERY - restbase endpoints health on restbase2001 is OK: All endpoints are healthy [23:57:42] RECOVERY - restbase endpoints health on restbase2002 is OK: All endpoints are healthy [23:57:42] RECOVERY - restbase endpoints health on restbase2004 is OK: All endpoints are healthy [23:57:42] RECOVERY - restbase endpoints health on restbase-test2001 is OK: All endpoints are healthy [23:57:42] RECOVERY - restbase endpoints health on restbase2006 is OK: All endpoints are healthy [23:57:46] MaxSem: k, we've still two patches for MatmaRex (pending zuul) before [23:58:00] RECOVERY - RAID on alsafi is OK: OK: no RAID installed [23:59:13] whoo it merged at last. [23:59:16] (03PS3) 10Dereckson: Import sources on hi.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282843 (https://phabricator.wikimedia.org/T132417)