[00:00:43] RECOVERY - logstash process on logstash1003 is OK: PROCS OK: 1 process with UID = 998 (logstash), command name java, args logstash [00:00:48] mutante, I'm pretty clueless how the devs approach things. [00:01:08] !log Restarted logstash on logstash1003 again. The first try apparently didn't take [00:01:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:02:29] Cyberpower678: we can make the config change anytime and upload it to code review. it's more about the timing, getting somebody to actually deploy on the day you want it. there are fixed deployment scheduleds though, so you can pick the right one [00:02:51] Krenair: yeah, git is probably better [00:03:00] we should make a non-hdpi variant too [00:03:08] How long do you suppose it takes to make another 2000 articles? [00:03:18] Cyberpower678: probably easiest if you dump the info (link to new logo, which exact date it should be up, etc) to a phabricator ticket. if you can. people will take it from there [00:03:40] I can't find the appropriate project on phab [00:03:48] I just keep getting the apps [00:04:05] Cyberpower678: Wikimedia-Site-Requests [00:04:13] doing... [00:05:39] Krenair: actually, why not via common.css? [00:05:57] fuzheado has the high-res versions [00:06:22] https://upload.wikimedia.org/wikipedia/commons/9/9d/Wikipedia-logo-v2-en_5m_articles.png is already quite hi-res [00:06:53] optipng with -brute could not compress it further, too, which means whoever made did a thorough job [00:07:17] ori, I have no opinion on this, if you think it can just be common.css that's fine with me [00:07:53] I think it can just be Common.css, but a phab task would be good anyway [00:08:02] great, https://phabricator.wikimedia.org/T117139#1767472 [00:08:04] (03PS1) 10Mattflaschen: Remove Flow cache version override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249935 (https://phabricator.wikimedia.org/T117138) [00:08:24] PROBLEM - Restbase endpoints health on xenon is CRITICAL: /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/en.wikipedia.org/v1/page/mobile-html/Main_Page: Generic connection error: (Received response with content-encoding: gzip, but failed to decode it., error(Error -3 while decompressing: incorrect header check,)) [00:08:44] PROBLEM - Restbase endpoints health on praseodymium is CRITICAL: /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/en.wikipedia.org/v1/page/mobile-html/Main_Page: Generic connection error: (Received response with content-encoding: gzip, but failed to decode it., error(Error -3 while decompressing: incorrect header check,)) [00:08:54] !log radium - back in service [00:08:58] mutante, https://phabricator.wikimedia.org/T117139?workflow=create [00:09:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:09:03] !log catrope@tin Synchronized php-1.27.0-wmf.4/extensions/Flow/: Fix regressions from SWAT (duration: 00m 20s) [00:09:06] ori: I found undeployed Scribunto changes on tin (one already there, one that rode along with git pull) [00:09:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:09:34] yeah, I sync-file'd the fix and submitted on git second [00:09:41] Cyberpower678: ah, yes. looks good. so we have to calculate the growth rate [00:09:44] Oh, OK [00:10:08] is it possible to automatically switch the logo when we hit 5,000,000 ? :)_ [00:10:24] it's going to happen tomorrow or Saturday [00:10:47] Saturday has no swat .. hrmm [00:10:51] I'm preparing variants in the appropriate sizes [00:11:30] if it's common.css it won't need us to deploy [00:11:58] if it does need deployment, I'm comfortable running a logo change at the weekend [00:12:22] as long as it's planned [00:13:59] common.css takes additional time to load won't it? So won't the system end up loading two logos? [00:17:19] it'll take ~5 minutes to take effect [00:17:24] and yeah, I think they wll [00:17:45] but if ori says it's ok, then it's fine [00:18:57] yes, it's fine [00:19:07] I'd prefer just changing the logo. common.css isn't applied everywhere. [00:19:07] and kinda cool too [00:20:05] it won't be displayed on the log in page and preferences [00:20:07] that's about it [00:20:30] 6operations, 5Patch-For-Review: upgrade radium to jessie - https://phabricator.wikimedia.org/T116963#1767584 (10Dzahn) p:5Triage>3Normal [00:22:10] Meh. [00:22:19] I'll leave it to you guys. [00:22:30] * Cyberpower678 needs to get back to college. [00:23:47] it's not square [00:24:02] it's taller than it is wide [00:24:22] https://www.mediawiki.org/wiki/Manual:$wgLogo says it should be 135x135 (i.e., square). I'm not sure whether it matters in practice. [00:24:38] greg-g: Is it OK for me to add a deploy window on Monday for a Flow maintenance script? I did https://wikitech.wikimedia.org/w/index.php?title=Deployments&type=revision&diff=198362&oldid=198347 but if I'm supposed to do this differently, please advise [00:24:50] maybe we can make it the logo on en betawiki [00:25:12] ori, WP's is actually 135x160 [00:25:14] the that is styled has a height of 16px on desktop [00:25:23] *160px [00:25:39] 6operations, 5Patch-For-Review: upgrade radium to jessie - https://phabricator.wikimedia.org/T116963#1767608 (10Dzahn) 5Open>3Resolved made a backup of the entire /var/lib/tor scheduled downtime in icinga reinstalled server with jessie removed puppet-cert and salt-key and resigned after initial puppet run... [00:26:03] RoanKattouw: looks good, thank you [00:26:16] 6operations: upgrade radium to jessie - https://phabricator.wikimedia.org/T116963#1767611 (10Dzahn) [00:26:44] 135 x 155 [00:27:31] 6operations: build newer tor packages - https://phabricator.wikimedia.org/T116964#1767615 (10Dzahn) we have 0.2.6.9 in our own repo but only used distro package 0.2.5.10-1~trusty+1. now on jessie 0.2.5.12-1 so this wasn't actually blocking the jessie upgrade, but we still want newer packages [00:29:23] (03CR) 10Dzahn: "is the "b" network really a /21 while a, c and d are all /24 ?" [dns] - 10https://gerrit.wikimedia.org/r/249919 (owner: 10Rush) [00:33:50] going to deploy https://gerrit.wikimedia.org/r/#/c/249937/ [00:34:16] !log wdqs1001 - restarted NTP (unknown offset) [00:34:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:34:23] RECOVERY - Apache HTTP on mw1193 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.191 second response time [00:34:25] !log mw1193 - restarted HHVM (socket timeout) [00:34:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:34:37] !log labvirt1010 - started salt-minion (no such process) [00:34:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:35:33] RECOVERY - HHVM rendering on mw1193 is OK: HTTP OK: HTTP/1.1 200 OK - 70548 bytes in 1.541 second response time [00:37:14] (03CR) 10BryanDavis: Fixed getMWScriptWithArgs() user error message (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249803 (owner: 10Aaron Schulz) [00:39:15] (03PS2) 10BryanDavis: Fixed getMWScriptWithArgs() user error message [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249803 (owner: 10Aaron Schulz) [00:39:54] (03CR) 10BryanDavis: Fixed getMWScriptWithArgs() user error message (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249803 (owner: 10Aaron Schulz) [00:40:02] (03CR) 10BryanDavis: [C: 031] Fixed getMWScriptWithArgs() user error message [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249803 (owner: 10Aaron Schulz) [00:41:53] RECOVERY - NTP on wdqs1001 is OK: NTP OK: Offset -0.00390625 secs [00:42:44] thanks mutante [00:44:29] welcome. btw. the NTP "unknown offet" ones. they are fixed by stop/start and waiting a couple minutes.. be back later [00:48:46] !log tgr@tin Synchronized php-1.27.0-wmf.4/includes/specials/SpecialUserlogin.php: T117027 (duration: 00m 18s) [00:48:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:03:42] ori: fuzheado changed the font size a bit, is https://commons.wikimedia.org/wiki/File:Wikipedia-logo-v2-en_5m_articles_270_white.png ok or can it be optimized? (before I ask a commons sysop to protect it) [01:05:39] legoktm[NE]: can be optimized, sec [01:07:32] legoktm[NE]: i uploaded a new version that is optimized, can be locked now [01:07:56] ori: thanks, could you do https://commons.wikimedia.org/wiki/File:Wikipedia-logo-v2-en_5m_articles_135_white.png too? :) [01:09:02] legoktm[NE]: {{done}} [01:09:11] 6operations: wikimedia.org email alias with +something sub-addressing bounces - https://phabricator.wikimedia.org/T117144#1767684 (10bd808) 3NEW [01:10:01] thanks :D [01:15:19] 6operations, 10ops-codfw, 6Labs, 10Labs-Infrastructure: on-site tasks for labs deployment cluster - https://phabricator.wikimedia.org/T117107#1767700 (10Papaul) [01:18:07] in grafana, can I transclude a graph from one dashboard into another and thus get all changes to it automatically updated (iow: right now I'm just manually copying them to our "KPIs" dashboard, which seems wrong) [01:19:15] greg-g: as a one time thing or permanently, such that updating the graph in one place will always update it in the other? [01:31:40] ori: right [01:31:50] permanently [01:32:05] although, I guess I could deal with one-time [01:32:19] it'd save me 90% of my time doing this by itself [01:38:02] greg-g: you can 'view JSON' [01:38:09] copy the JSON [01:38:22] then go to dashboards -> import [01:38:33] and there's a possibility to import a dashboard from a JSON blob [01:38:54] it's clunky but it allows you to use $EDITOR to perform some text surgery [01:39:48] you can't update the JSON of an existing dashboard in-place AFAIK [01:40:04] so you have to actually create a new one, delete the old one, and rename the new one to the old name [01:41:31] * greg-g nods [01:41:34] that helps, thanks ori [01:43:08] (03PS1) 10Ori.livneh: Enable Scribunto function stats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249944 [01:43:30] (03CR) 10Ori.livneh: [C: 032] Enable Scribunto function stats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249944 (owner: 10Ori.livneh) [01:44:45] (03CR) 10Ori.livneh: [V: 032] Enable Scribunto function stats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249944 (owner: 10Ori.livneh) [02:07:43] 6operations, 7Mail: wikimedia.org email alias with +something sub-addressing bounces - https://phabricator.wikimedia.org/T117144#1767794 (10greg) [02:27:28] !log l10nupdate@tin Synchronized php-1.27.0-wmf.4/cache/l10n: l10nupdate for 1.27.0-wmf.4 (duration: 06m 13s) [02:27:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:30:22] RECOVERY - check_mysql on lutetium is OK: Uptime: 123733 Threads: 1 Questions: 25799097 Slow queries: 2185 Opens: 5656 Flush tables: 2 Open tables: 64 Queries per second avg: 208.506 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [02:30:50] !log l10nupdate@tin LocalisationUpdate completed (1.27.0-wmf.4) at 2015-10-30 02:30:50+00:00 [02:30:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:05:58] !log ori@tin Synchronized php-1.27.0-wmf.4/extensions/Scribunto/common/Hooks.php: I60b9eb617: When logging perf stats, include wfWikiId() in metric key (duration: 00m 18s) [04:06:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:06:36] I'm going to clear the data that is already in scribunto.traces.* so we don't end up with a mix of prefixed and unprefixed metrics [04:16:57] (03PS1) 10Ori.livneh: ScribuntoSlowFunctionThreshold: 0.90 -> 0.99 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249963 [04:17:33] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 23.81% of data above the critical threshold [100000000.0] [04:18:17] (03CR) 10Ori.livneh: [C: 032] ScribuntoSlowFunctionThreshold: 0.90 -> 0.99 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249963 (owner: 10Ori.livneh) [04:18:23] (03Merged) 10jenkins-bot: ScribuntoSlowFunctionThreshold: 0.90 -> 0.99 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249963 (owner: 10Ori.livneh) [04:19:25] !log ori@tin Synchronized wmf-config/CommonSettings.php: I1d0337b58: ScribuntoSlowFunctionThreshold: 0.90 -> 0.99 (duration: 00m 17s) [04:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:52:05] (03PS1) 10Ori.livneh: Update error page locations for Ibd8f69a54ac [puppet] - 10https://gerrit.wikimedia.org/r/249965 [04:52:07] (03PS1) 10Ori.livneh: Add perf-admins to Graphite role [puppet] - 10https://gerrit.wikimedia.org/r/249966 [04:58:55] (03CR) 10Krinkle: "See also https://github.com/wikimedia/operations-puppet/blob/812f280d16acfe3083259e8dfa7ce12ebf71da87/modules/mediawiki/templates/php/wmer" [puppet] - 10https://gerrit.wikimedia.org/r/249965 (owner: 10Ori.livneh) [05:00:51] (03PS2) 10Ori.livneh: Update error page locations for Ibd8f69a54ac [puppet] - 10https://gerrit.wikimedia.org/r/249965 [05:11:08] !log l10nupdate@tin ResourceLoader cache refresh completed at Fri Oct 30 05:11:08 UTC 2015 (duration 11m 7s) [05:11:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:36:56] 6operations, 7Mail, 15User-greg: Google Mail marking Phabricator and Gerrit notification emails as spam - https://phabricator.wikimedia.org/T115416#1767989 (10greg) [05:51:27] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1768017 (10Tgr) >>! In T102566#1761834, @saper wrote: > Question: wouldn't that be possible to ship the certificate as a para... [06:24:43] !log krinkle@tin Synchronized php-1.27.0-wmf.4/includes/jobqueue/JobQueueRedis.php: (no message) (duration: 00m 18s) [06:24:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:26:14] PROBLEM - Disk space on elastic1008 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=95%) [06:30:43] PROBLEM - puppet last run on mw2024 is CRITICAL: CRITICAL: puppet fail [06:30:53] PROBLEM - puppet last run on terbium is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:53] PROBLEM - puppet last run on mw2021 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:04] PROBLEM - puppet last run on aqs1002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:32] PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:33] PROBLEM - puppet last run on db1045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:03] PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:04] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:13] PROBLEM - puppet last run on mw2016 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:23] PROBLEM - puppet last run on mw2023 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:24] PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:33] PROBLEM - puppet last run on restbase2006 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:44] PROBLEM - puppet last run on wtp2017 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:52] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:52] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:52] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:13] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:44] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:52] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:02] PROBLEM - puppet last run on mw2136 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:14] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:51:02] PROBLEM - Hadoop NodeManager on analytics1038 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [06:54:44] RECOVERY - Incoming network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [06:55:12] RECOVERY - check_mysql on db1008 is OK: Uptime: 8431640 Threads: 2 Questions: 98038883 Slow queries: 56968 Opens: 95626 Flush tables: 2 Open tables: 64 Queries per second avg: 11.627 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [06:55:53] RECOVERY - puppet last run on restbase2006 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [06:56:04] RECOVERY - puppet last run on terbium is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:56:12] RECOVERY - puppet last run on wtp2017 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:56:23] RECOVERY - puppet last run on aqs1002 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:56:44] RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [06:56:44] RECOVERY - puppet last run on db1045 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:57:03] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:57:12] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [06:57:14] RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:23] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:23] RECOVERY - puppet last run on mw2016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:33] RECOVERY - puppet last run on mw2023 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:42] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:42] RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:53] RECOVERY - puppet last run on mw2024 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:58:02] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:02] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:03] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:03] RECOVERY - puppet last run on mw2021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:13] RECOVERY - Hadoop NodeManager on analytics1038 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [06:58:23] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:13] RECOVERY - puppet last run on mw2136 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:23] PROBLEM - puppet last run on analytics1053 is CRITICAL: CRITICAL: Puppet has 1 failures [07:24:42] RECOVERY - puppet last run on analytics1053 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:36:24] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [100000000.0] [07:54:24] RECOVERY - Incoming network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [08:26:03] RoanKattouw_away: u pinged ? [08:40:26] hmm, seems like there was something wrong potentially with TMH ? but it seems like it was trasient perhaps... [08:40:32] PROBLEM - citoid endpoints health on sca1001 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.32.153:1970/api: Timeout on connection while downloading http://10.64.32.153:1970/api [08:40:43] since I haven't been lynched, i'm assuming everything is ok now :) [08:41:34] <_joe_> thedj: tmh issues? when? [08:42:51] <_joe_> thedj: as far as the systems are concerned, I just see some load, but nothing remotely worrying [08:43:15] hu? elastic1008 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=95%) [08:43:33] <_joe_> dcausse: uhm lemme take a look [08:44:01] <_joe_> dcausse: it's / [08:44:04] RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy [08:44:08] <_joe_> not hte ES directory [08:44:24] not it's /, logs I thin [08:45:25] <_joe_> production-search-eqiad_index_indexing_slowlog.log [08:46:18] <_joe_> !log removed production-search-eqiad_index_indexing_slowlog.log.{7,8} on es1008 [08:46:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:47:03] <_joe_> dcausse: that log is horribly verbose indeed [08:47:08] <_joe_> can we tune that down? [08:47:13] yes it contains a copy of the doc :/ [08:47:23] RECOVERY - Disk space on elastic1008 is OK: DISK OK [08:47:28] <_joe_> how stupid is that? [08:47:37] <_joe_> let's complain with Elastic engineers [08:47:43] <_joe_> manybubbles: I complain!!! [08:47:48] :) [08:48:04] <_joe_> manybubbles: we want our money back, or at least you. [08:49:01] <_joe_> dcausse: jokes aside, I'd tune that logging down [08:49:10] _joe_: I'll create a task for that [08:49:29] <_joe_> dcausse: I guess we should do it today, or we're gonna run out of disk pretty soon [08:56:44] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api is CRITICAL: Test open graph via native scraper returned the unexpected status 520 (expecting: 200) [08:56:44] PROBLEM - citoid endpoints health on sca1001 is CRITICAL: /api is CRITICAL: Test open graph via native scraper returned the unexpected status 520 (expecting: 200) [08:58:57] _joe_: roan pinged last night on something RL'y regarding tmh [09:00:03] but i think it was just deploy causing some transient issues [09:01:37] <_joe_> I guess so, yes [09:01:58] <_joe_> these citoid errors sound both scary and unactionable [09:04:22] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [09:04:22] RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy [09:07:26] <_joe_> dcausse: I think there is something wrong with elastic1008 anyways [09:07:38] <_joe_> no other server has that much slow index logs [09:08:30] _joe_: checking, maybe it's related to wikidata reindex [09:11:28] 6218 lines in this files => 2.8G , only TRACE ... [09:11:50] <_joe_> dcausse: oh man [09:12:11] <_joe_> Trace? [09:12:12] <_joe_> why? [09:12:20] <_joe_> the log is set to INFO AFAICT [09:12:27] <_joe_> index.indexing.slowlog: INFO, index_indexing_slow_log_file [09:12:46] yes strange maybe something has been changed at runtime, looking [09:14:01] <_joe_> ok thanks :) [09:15:05] _joe_: looks like a bug: https://github.com/elastic/elasticsearch/issues/7461 [09:15:23] I will change the threshold for TRACE to disable it [09:16:01] <_joe_> dcausse: I was about to suggest that, yes [09:16:11] * _joe_ waves fist at manybubbles [09:16:16] 6operations, 10hardware-requests: eqiad: 1 hardware access request for labs on real hardware (mwoffliner) - https://phabricator.wikimedia.org/T117095#1768225 (10Metanish) FWIW, this would help offline server distributions like the [[ http://schoolserver.org | XSCE ]] and many others. [09:18:11] !log elastic in eqiad setting index.indexing.slowlog.threshold.index.trace to -1 [09:18:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:19:55] 6operations: migrate pollux/plutonium into VMs - https://phabricator.wikimedia.org/T117182#1768226 (10akosiaris) 3NEW a:3akosiaris [09:29:27] 6operations, 10vm-requests: Site: 2 VM request for OIT LDAP mirror - https://phabricator.wikimedia.org/T117183#1768240 (10akosiaris) 3NEW [09:29:57] 6operations: migrate pollux/plutonium into VMs - https://phabricator.wikimedia.org/T117182#1768249 (10akosiaris) [09:29:58] 6operations, 10vm-requests: Site: 2 VM request for OIT LDAP mirror - https://phabricator.wikimedia.org/T117183#1768250 (10akosiaris) [09:32:18] <_joe_> dcausse: we should persist the setting in the config maybe? [09:38:11] (03PS1) 10Giuseppe Lavagetto: elasticsearch: disable indexing slow log at TRACE level [puppet] - 10https://gerrit.wikimedia.org/r/249973 [09:38:18] <_joe_> dcausse: ^^ [09:38:23] PROBLEM - HTTP on dataset1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:40:03] RECOVERY - HTTP on dataset1001 is OK: HTTP OK: HTTP/1.1 200 OK - 5730 bytes in 0.001 second response time [09:42:10] (03CR) 10DCausse: "I think this setting is set in elasticsearch.yml.erb" [puppet] - 10https://gerrit.wikimedia.org/r/249973 (owner: 10Giuseppe Lavagetto) [09:42:50] <_joe_> dcausse: oh, I might have overlooked it [09:43:48] <_joe_> and yes, you're right [09:44:13] (03PS2) 10Giuseppe Lavagetto: elasticsearch: disable indexing slow log at TRACE level [puppet] - 10https://gerrit.wikimedia.org/r/249973 [09:44:53] (03PS3) 10DCausse: elasticsearch: disable indexing slow log at TRACE level [puppet] - 10https://gerrit.wikimedia.org/r/249973 (https://phabricator.wikimedia.org/T117181) (owner: 10Giuseppe Lavagetto) [09:45:09] (03CR) 10DCausse: [C: 031] elasticsearch: disable indexing slow log at TRACE level [puppet] - 10https://gerrit.wikimedia.org/r/249973 (https://phabricator.wikimedia.org/T117181) (owner: 10Giuseppe Lavagetto) [09:46:10] (03CR) 10Giuseppe Lavagetto: [C: 032] elasticsearch: disable indexing slow log at TRACE level [puppet] - 10https://gerrit.wikimedia.org/r/249973 (https://phabricator.wikimedia.org/T117181) (owner: 10Giuseppe Lavagetto) [09:47:27] _joe_: thanks [10:05:15] (03PS1) 10Alexandros Kosiaris: openldap: Minor comment fixes [puppet] - 10https://gerrit.wikimedia.org/r/249977 [10:06:30] (03PS2) 10Alexandros Kosiaris: openldap: Minor comment fixes [puppet] - 10https://gerrit.wikimedia.org/r/249977 [10:06:36] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] openldap: Minor comment fixes [puppet] - 10https://gerrit.wikimedia.org/r/249977 (owner: 10Alexandros Kosiaris) [10:17:16] 6operations, 10RESTBase-Cassandra, 5Patch-For-Review: puppet should safely manage cassandra start/stop - https://phabricator.wikimedia.org/T103134#1768360 (10fgiunchedi) [10:20:39] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1768363 (10saper) Right, might be difficult due to the way OpenSSL usually wants to have certificates. Will try to figure out! [10:30:42] !log elastic in eqiad wide settings for indexing slow threshold are ineffective, trying to set index settings (T117181) [10:30:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:42:32] !log elastic in eqiad setting indexing trace threshold to -1 for commons, ruwiki, frwiki and itwiki [10:42:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:53:17] _joe_: fyi, I'm supposed to be out today, I've sent a mail to erik and chase to inform them about this issue. Everything I've done is in T117181 [10:53:53] <_joe_> dcausse: oh I'm sorry [10:53:55] trace are no more logged for indices mentioned in the ticket [10:54:05] <_joe_> ok, thanks a lot! [10:54:48] _joe_: np, thanks for your help [11:14:53] 6operations, 10ops-eqiad, 10netops: asw-d-eqiad SNMP failures - https://phabricator.wikimedia.org/T112781#1768446 (10faidon) Unfortunately the SNMP failures are back as of 2015-10-30 07:18Z :( [11:16:12] PROBLEM - SSH on dataset1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:16:13] PROBLEM - configured eth on dataset1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:16:23] PROBLEM - HTTP on dataset1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:16:43] PROBLEM - Check size of conntrack table on dataset1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:16:56] apergos: ^ [11:17:02] PROBLEM - salt-minion processes on dataset1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:17:03] PROBLEM - dhclient process on dataset1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:17:14] PROBLEM - ElasticSearch health check for shards on nobelium is CRITICAL: CRITICAL - elasticsearch http://10.64.37.14:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.64.37.14, port=9200): Read timed out. (read timeout=4) [11:17:17] 6operations, 10ops-eqiad: Rack/Setup ms-be1019-ms-1021 - https://phabricator.wikimedia.org/T116542#1768450 (10fgiunchedi) I was trying to get OS install a try, looks like the cabled interface is detected as eth1 not eth0 (also 3c:a8:2a:0d:5f:e8 is what's in puppet). Disabling 10G interface from bios did the tr... [11:18:02] PROBLEM - Disk space on dataset1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:19:03] PROBLEM - DPKG on dataset1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:19:03] PROBLEM - puppet last run on dataset1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:19:12] PROBLEM - RAID on dataset1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:19:41] <_joe_> hoo: I'll take a look [11:20:12] thanks [11:22:46] RECOVERY - ElasticSearch health check for shards on nobelium is OK: OK - elasticsearch status labsearch: status: green, number_of_nodes: 1, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 1787, cluster_name: labsearch, relocating_shards: 0, active_shards: 1787, initializing_shards: 0, number_of_data_nodes: 1, delayed_unassigned_shards: 0 [11:25:46] <_joe_> !log powercycling dataset1001, stuck in some kernel task, unable to login from console [11:25:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:28:02] PROBLEM - HTTPS on dataset1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: No route to host [11:28:59] (03PS1) 10Hoo man: Also published bzip2 compressed Wikidata TTL dumps [puppet] - 10https://gerrit.wikimedia.org/r/249981 [11:29:02] RECOVERY - Disk space on dataset1001 is OK: DISK OK [11:29:04] RECOVERY - SSH on dataset1001 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2wmfprecise2 (protocol 2.0) [11:29:04] RECOVERY - configured eth on dataset1001 is OK: OK - interfaces up [11:29:13] RECOVERY - HTTP on dataset1001 is OK: HTTP OK: HTTP/1.1 200 OK - 5730 bytes in 0.028 second response time [11:29:42] RECOVERY - Check size of conntrack table on dataset1001 is OK: OK: nf_conntrack is 0 % full [11:29:42] what's up? [11:29:44] RECOVERY - HTTPS on dataset1001 is OK: SSL OK - Certificate dumps.wikimedia.org valid until 2016-03-26 10:47:38 +0000 (expires in 147 days) [11:29:44] oh I see [11:29:52] RECOVERY - salt-minion processes on dataset1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [11:29:53] RECOVERY - dhclient process on dataset1001 is OK: PROCS OK: 0 processes with command name dhclient [11:29:55] <_joe_> xfs niceties [11:30:04] RECOVERY - DPKG on dataset1001 is OK: All packages OK [11:30:13] RECOVERY - puppet last run on dataset1001 is OK: OK: Puppet is currently enabled, last run 42 minutes ago with 0 failures [11:30:14] RECOVERY - RAID on dataset1001 is OK: OK: optimal, 3 logical, 36 physical [11:30:15] <_joe_> apergos: I rebooted the box, you might want to check if some dump has failed/is corrupted [11:30:26] <_joe_> I'm off for a bit, then interview [11:31:26] !log ignoring asw-d-eqiad on librenms [11:31:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:35:03] (03CR) 10Hashar: [C: 031] cache: vary statsd_server with hiera [puppet] - 10https://gerrit.wikimedia.org/r/249490 (https://phabricator.wikimedia.org/T116898) (owner: 10Hashar) [11:35:04] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Puppet has 1 failures [11:39:42] PROBLEM - ElasticSearch health check for shards on nobelium is CRITICAL: CRITICAL - elasticsearch http://10.64.37.14:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.64.37.14, port=9200): Read timed out. (read timeout=4) [11:41:24] RECOVERY - ElasticSearch health check for shards on nobelium is OK: OK - elasticsearch status labsearch: status: green, number_of_nodes: 1, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 1787, cluster_name: labsearch, relocating_shards: 0, active_shards: 1787, initializing_shards: 0, number_of_data_nodes: 1, delayed_unassigned_shards: 0 [11:43:15] (03CR) 10Hashar: "Cherry picked on beta cluster puppet master. Puppet did refresh the services." [puppet] - 10https://gerrit.wikimedia.org/r/249490 (https://phabricator.wikimedia.org/T116898) (owner: 10Hashar) [11:45:43] (03PS1) 10Giuseppe Lavagetto: Add tunable number of keepalive retries to IdleConnectionMonitor [debs/pybal] - 10https://gerrit.wikimedia.org/r/249983 [11:45:45] (03PS1) 10Giuseppe Lavagetto: New release, add myself to the uploaders list [debs/pybal] - 10https://gerrit.wikimedia.org/r/249984 [11:51:34] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:51:38] (03PS1) 10Faidon Liambotis: reprepro: move tor to jessie [puppet] - 10https://gerrit.wikimedia.org/r/249985 [11:51:58] (03CR) 10Faidon Liambotis: [C: 032 V: 032] reprepro: move tor to jessie [puppet] - 10https://gerrit.wikimedia.org/r/249985 (owner: 10Faidon Liambotis) [11:57:37] !log upgrading tor to latest stable on radium [11:57:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:06:53] 6operations, 10ops-eqiad: Rack/Setup ms-be1019-ms-1021 - https://phabricator.wikimedia.org/T116542#1768515 (10fgiunchedi) also the raid configuration isn't correct, the two SSDs are in raid1 but should be raid0, fixed via hpssacli ``` set target controller slot=3 array all delete create type=arrayr0 drivetype... [12:15:04] PROBLEM - ElasticSearch health check for shards on nobelium is CRITICAL: CRITICAL - elasticsearch http://10.64.37.14:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.64.37.14, port=9200): Read timed out. (read timeout=4) [12:24:13] RECOVERY - ElasticSearch health check for shards on nobelium is OK: OK - elasticsearch status labsearch: status: green, number_of_nodes: 1, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 1787, cluster_name: labsearch, relocating_shards: 0, active_shards: 1787, initializing_shards: 0, number_of_data_nodes: 1, delayed_unassigned_shards: 0 [12:28:58] (03PS1) 10Faidon Liambotis: tor: update config file to remove default settings [puppet] - 10https://gerrit.wikimedia.org/r/249990 [12:29:13] (03CR) 10Faidon Liambotis: [C: 032 V: 032] tor: update config file to remove default settings [puppet] - 10https://gerrit.wikimedia.org/r/249990 (owner: 10Faidon Liambotis) [12:41:03] PROBLEM - ElasticSearch health check for shards on nobelium is CRITICAL: CRITICAL - elasticsearch http://10.64.37.14:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.64.37.14, port=9200): Read timed out. (read timeout=4) [12:42:43] RECOVERY - ElasticSearch health check for shards on nobelium is OK: OK - elasticsearch status labsearch: status: green, number_of_nodes: 1, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 1787, cluster_name: labsearch, relocating_shards: 0, active_shards: 1787, initializing_shards: 0, number_of_data_nodes: 1, delayed_unassigned_shards: 0 [12:56:31] (03PS1) 10Muehlenhoff: LDAP schemas to be used for labs/openldap [puppet] - 10https://gerrit.wikimedia.org/r/249995 [12:59:23] PROBLEM - ElasticSearch health check for shards on nobelium is CRITICAL: CRITICAL - elasticsearch http://10.64.37.14:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.64.37.14, port=9200): Read timed out. (read timeout=4) [13:03:12] RECOVERY - ElasticSearch health check for shards on nobelium is OK: OK - elasticsearch status labsearch: status: green, number_of_nodes: 1, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 1787, cluster_name: labsearch, relocating_shards: 0, active_shards: 1787, initializing_shards: 0, number_of_data_nodes: 1, delayed_unassigned_shards: 0 [13:07:49] apergos: ping? [13:08:01] paravoid: pong [13:08:05] snapshot patches? [13:08:09] yes [13:08:15] k [13:20:24] (03PS1) 10Ottomata: Enable varnishreqstats for all caches [puppet] - 10https://gerrit.wikimedia.org/r/249997 [13:25:39] (03CR) 10Ottomata: [C: 032] Enable varnishreqstats for all caches [puppet] - 10https://gerrit.wikimedia.org/r/249997 (owner: 10Ottomata) [13:33:25] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1768664 (10BBlack) It really shouldn't be hard for an OS/distribution/platform/language/whatever to have working TLS with a s... [13:34:59] (03PS1) 10Aude: Exclude LiquidThread namespaces from Special:UnconnectedPages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250000 (https://phabricator.wikimedia.org/T117174) [13:38:14] PROBLEM - ElasticSearch health check for shards on nobelium is CRITICAL: CRITICAL - elasticsearch http://10.64.37.14:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.64.37.14, port=9200): Read timed out. (read timeout=4) [13:41:52] RECOVERY - ElasticSearch health check for shards on nobelium is OK: OK - elasticsearch status labsearch: status: green, number_of_nodes: 1, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 1787, cluster_name: labsearch, relocating_shards: 0, active_shards: 1787, initializing_shards: 0, number_of_data_nodes: 1, delayed_unassigned_shards: 0 [13:47:24] !log krenair@tin Synchronized php-1.27.0-wmf.4/extensions/VisualEditor/VisualEditor.hooks.php: https://gerrit.wikimedia.org/r/#/c/249999/ (duration: 00m 17s) [13:47:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:47:42] !log revert commit for https://gerrit.wikimedia.org/r/#/c/249971/ on tin [13:47:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:53:48] (03PS1) 10Hashar: devpi a cache/proxy for python PyPI [puppet] - 10https://gerrit.wikimedia.org/r/250002 (https://phabricator.wikimedia.org/T117207) [13:56:59] (03CR) 10Hashar: [C: 04-1] "WIP" [puppet] - 10https://gerrit.wikimedia.org/r/250002 (https://phabricator.wikimedia.org/T117207) (owner: 10Hashar) [14:01:54] (03PS2) 10Hashar: devpi a cache/proxy for python PyPI [puppet] - 10https://gerrit.wikimedia.org/r/250002 (https://phabricator.wikimedia.org/T117207) [14:04:20] (03PS3) 10Hashar: devpi a cache/proxy for python PyPI [puppet] - 10https://gerrit.wikimedia.org/r/250002 (https://phabricator.wikimedia.org/T117207) [14:06:02] PROBLEM - ElasticSearch health check for shards on nobelium is CRITICAL: CRITICAL - elasticsearch http://10.64.37.14:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.64.37.14, port=9200): Read timed out. (read timeout=4) [14:07:42] RECOVERY - ElasticSearch health check for shards on nobelium is OK: OK - elasticsearch status labsearch: status: green, number_of_nodes: 1, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 1787, cluster_name: labsearch, relocating_shards: 0, active_shards: 1787, initializing_shards: 0, number_of_data_nodes: 1, delayed_unassigned_shards: 0 [14:12:12] (03PS4) 10Hashar: devpi a cache/proxy for python PyPI [puppet] - 10https://gerrit.wikimedia.org/r/250002 (https://phabricator.wikimedia.org/T117207) [14:16:53] PROBLEM - ElasticSearch health check for shards on nobelium is CRITICAL: CRITICAL - elasticsearch http://10.64.37.14:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.64.37.14, port=9200): Read timed out. (read timeout=4) [14:18:42] RECOVERY - ElasticSearch health check for shards on nobelium is OK: OK - elasticsearch status labsearch: status: green, number_of_nodes: 1, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 1787, cluster_name: labsearch, relocating_shards: 0, active_shards: 1787, initializing_shards: 0, number_of_data_nodes: 1, delayed_unassigned_shards: 0 [14:19:54] (03PS1) 10Hashar: contint: drop reference to pip.conf [puppet] - 10https://gerrit.wikimedia.org/r/250004 [14:27:53] PROBLEM - ElasticSearch health check for shards on nobelium is CRITICAL: CRITICAL - elasticsearch http://10.64.37.14:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.64.37.14, port=9200): Read timed out. (read timeout=4) [14:29:43] RECOVERY - ElasticSearch health check for shards on nobelium is OK: OK - elasticsearch status labsearch: status: green, number_of_nodes: 1, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 1787, cluster_name: labsearch, relocating_shards: 0, active_shards: 1787, initializing_shards: 0, number_of_data_nodes: 1, delayed_unassigned_shards: 0 [14:38:50] !log nobelium in downtime -- this is a temporary test host for discovery and is not ops actionable [14:38:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:41:55] (03CR) 10Rush: "yes daniel that's true" [dns] - 10https://gerrit.wikimedia.org/r/249919 (owner: 10Rush) [14:43:00] 6operations, 10hardware-requests: eqiad: 1 hardware access request for labs on real hardware (mwoffliner) - https://phabricator.wikimedia.org/T117095#1768862 (10chasemp) >>! In T117095#1768225, @Metanish wrote: > FWIW, this would help offline server distributions like the [[ http://schoolserver.org | XSCE ]] a... [14:43:18] (03PS1) 10Muehlenhoff: openldap: Allow specifying an additional set of LDAP schemas [puppet] - 10https://gerrit.wikimedia.org/r/250010 (https://phabricator.wikimedia.org/T101299) [14:43:20] (03PS1) 10Muehlenhoff: Add schema definitions for openldap/labs to be passed in $extra_schemas [puppet] - 10https://gerrit.wikimedia.org/r/250011 (https://phabricator.wikimedia.org/T101299) [14:43:54] (03Abandoned) 10Muehlenhoff: LDAP schemas to be used for labs/openldap [puppet] - 10https://gerrit.wikimedia.org/r/249995 (owner: 10Muehlenhoff) [14:44:34] (03CR) 10jenkins-bot: [V: 04-1] openldap: Allow specifying an additional set of LDAP schemas [puppet] - 10https://gerrit.wikimedia.org/r/250010 (https://phabricator.wikimedia.org/T101299) (owner: 10Muehlenhoff) [14:49:08] 6operations, 7Mail: wikimedia.org email alias with +something sub-addressing bounces - https://phabricator.wikimedia.org/T117144#1768867 (10chasemp) p:5Triage>3Normal [14:49:15] 6operations, 7Mail: wikimedia.org email alias with +something sub-addressing bounces - https://phabricator.wikimedia.org/T117144#1767684 (10chasemp) @faidon could you take a look at this? [14:52:53] (03PS1) 10Ottomata: Properly export env vars from hive service default files [puppet/cdh] - 10https://gerrit.wikimedia.org/r/250013 (https://phabricator.wikimedia.org/T76343) [14:53:17] (03PS2) 10Ottomata: Properly export env vars from hive service default files [puppet/cdh] - 10https://gerrit.wikimedia.org/r/250013 (https://phabricator.wikimedia.org/T76343) [14:53:37] (03CR) 10Ottomata: [C: 032] Properly export env vars from hive service default files [puppet/cdh] - 10https://gerrit.wikimedia.org/r/250013 (https://phabricator.wikimedia.org/T76343) (owner: 10Ottomata) [14:54:56] 6operations, 6Discovery: Investigate adding memory to elastic10{01...16} to bring more parity between the two types of servers running elasticsearch in eqiad - https://phabricator.wikimedia.org/T117110#1768884 (10faidon) Both groups of machines are fairly idle. Disks are almost completely idle (less than 3MB/s... [14:57:05] (03PS1) 10Ottomata: Increase hive-server2 heap size and update cdh submodule to allow this [puppet] - 10https://gerrit.wikimedia.org/r/250014 (https://phabricator.wikimedia.org/T76343) [14:59:07] (03PS1) 10Ottomata: Allow setting heapsize for hive metastore and server separately Bug: T76343 [puppet/cdh] - 10https://gerrit.wikimedia.org/r/250015 (https://phabricator.wikimedia.org/T76343) [14:59:49] (03PS2) 10Ottomata: Allow setting heapsize for hive metastore and server separately Bug: T76343 [puppet/cdh] - 10https://gerrit.wikimedia.org/r/250015 (https://phabricator.wikimedia.org/T76343) [15:00:23] (03CR) 10Ottomata: [C: 032] Allow setting heapsize for hive metastore and server separately Bug: T76343 [puppet/cdh] - 10https://gerrit.wikimedia.org/r/250015 (https://phabricator.wikimedia.org/T76343) (owner: 10Ottomata) [15:01:33] (03PS1) 10Giuseppe Lavagetto: mediawiki::maintenance: create symlink to /srv/mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/250016 [15:03:06] (03CR) 10Mark Bergsma: [C: 031] "I assume you've verified that the test hosts are also in row B, or will be moved there?" [dns] - 10https://gerrit.wikimedia.org/r/249914 (owner: 10Rush) [15:03:31] (03PS2) 10Ottomata: Increase hive-server2 heap size and update cdh submodule to allow this [puppet] - 10https://gerrit.wikimedia.org/r/250014 (https://phabricator.wikimedia.org/T76343) [15:03:57] (03CR) 10Rush: "for posterity: all but 2 and there is an active ticket to move them on papaul's plate now (confirmed there is space)" [dns] - 10https://gerrit.wikimedia.org/r/249914 (owner: 10Rush) [15:04:50] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::maintenance: create symlink to /srv/mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/250016 (owner: 10Giuseppe Lavagetto) [15:07:06] (03PS2) 10Muehlenhoff: openldap: Allow specifying an additional set of LDAP schemas [puppet] - 10https://gerrit.wikimedia.org/r/250010 (https://phabricator.wikimedia.org/T101299) [15:08:11] (03PS3) 10Ottomata: Increase hive-server2 heap size and update cdh submodule to allow this [puppet] - 10https://gerrit.wikimedia.org/r/250014 (https://phabricator.wikimedia.org/T76343) [15:08:22] (03CR) 10Ottomata: [C: 032 V: 032] Increase hive-server2 heap size and update cdh submodule to allow this [puppet] - 10https://gerrit.wikimedia.org/r/250014 (https://phabricator.wikimedia.org/T76343) (owner: 10Ottomata) [15:09:25] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1768954 (10Nemo_bis) > This almost feels like trying to ship a bundled TCP/IP stack and set of ethernet hardware drivers wit... [15:11:52] 6operations, 6Discovery: Investigate adding memory to elastic10{01...16} to bring more parity between the two types of servers running elasticsearch in eqiad - https://phabricator.wikimedia.org/T117110#1768964 (10chasemp) p:5Triage>3Normal [15:21:25] not sure if I should do this on my own repo or on wikimedia's? https://github.com/jynus/mysql-sys [15:31:06] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Proposed a simpler (IMHO) way of doing this inline. I do like the approach though. I think we could use this for the ACL policy part." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/250010 (https://phabricator.wikimedia.org/T101299) (owner: 10Muehlenhoff) [15:34:12] RECOVERY - Restbase endpoints health on xenon is OK: All endpoints are healthy [15:34:33] RECOVERY - Restbase endpoints health on praseodymium is OK: All endpoints are healthy [15:35:36] !log updated db205[6-8] to the 3.19 kernel [15:35:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:35:58] thank you very much, moritzm ! [15:36:23] I do not know what I would do without your help! [15:41:50] (03PS3) 10Muehlenhoff: openldap: Allow specifying an additional set of LDAP schemas [puppet] - 10https://gerrit.wikimedia.org/r/250010 (https://phabricator.wikimedia.org/T101299) [15:42:36] (03CR) 10jenkins-bot: [V: 04-1] openldap: Allow specifying an additional set of LDAP schemas [puppet] - 10https://gerrit.wikimedia.org/r/250010 (https://phabricator.wikimedia.org/T101299) (owner: 10Muehlenhoff) [15:48:31] (03CR) 10Mark Bergsma: "Difference between allocation and assignment. We've allocated larger (/20 I think), and assigned /24 initially, reassigned to larger upon " [dns] - 10https://gerrit.wikimedia.org/r/249919 (owner: 10Rush) [15:50:21] 6operations, 10hardware-requests: eqiad: 1 hardware access request for labs on real hardware (mwoffliner) - https://phabricator.wikimedia.org/T117095#1769111 (10Metanish) Yes. I am one of the volunteers with the project. I believe we grab them off of the kiwix.org download page through a web interface. More... [15:54:07] 6operations, 6Discovery: Investigate adding memory to elastic10{01...16} to bring more parity between the two types of servers running elasticsearch in eqiad - https://phabricator.wikimedia.org/T117110#1769121 (10faidon) So @chasemp mentioned that these metrics come directly from ElasticSearch's own instrument... [16:03:23] (03PS4) 10Muehlenhoff: openldap: Allow specifying an additional set of LDAP schemas [puppet] - 10https://gerrit.wikimedia.org/r/250010 (https://phabricator.wikimedia.org/T101299) [16:04:27] (03CR) 10jenkins-bot: [V: 04-1] openldap: Allow specifying an additional set of LDAP schemas [puppet] - 10https://gerrit.wikimedia.org/r/250010 (https://phabricator.wikimedia.org/T101299) (owner: 10Muehlenhoff) [16:14:48] 6operations, 10ops-eqiad, 10netops: asw-d-eqiad SNMP failures - https://phabricator.wikimedia.org/T112781#1769231 (10Cmjohnson) Removed sfp's from ports 26/27/28. [16:16:16] 6operations, 7Mail: wikimedia.org email alias with +something sub-addressing bounces - https://phabricator.wikimedia.org/T117144#1769235 (10faidon) (not sure why me, but anyway :) There is no such thing as "subaddressing" in email. There are a few email providers that provide -suffix or +suffix and strip it o... [16:24:38] 6operations, 7Mail: wikimedia.org email alias with +something sub-addressing bounces - https://phabricator.wikimedia.org/T117144#1769266 (10bd808) >>! In T117144#1769235, @faidon wrote: > All in all, this is the first time -to my knowledge- we've had this request and it's for a really minor use case (just use... [16:29:10] (03CR) 10JanZerebecki: [C: 031] admin: add jzerebecki to deployers [puppet] - 10https://gerrit.wikimedia.org/r/249818 (https://phabricator.wikimedia.org/T116487) (owner: 10Dzahn) [16:39:24] PROBLEM - puppet last run on mw1152 is CRITICAL: CRITICAL: Puppet has 1 failures [16:40:03] 6operations, 10MediaWiki-extensions-BounceHandler, 5Patch-For-Review, 5WMF-deploy-2015-10-27_(1.27.0-wmf.4): Need an administrative front end for BounceHandler - https://phabricator.wikimedia.org/T114020#1769306 (10Legoktm) https://grafana.wikimedia.org/dashboard/db/mediawiki-bouncehandler The bounces alw... [16:43:30] (03PS1) 10Muehlenhoff: Additional fine tuning of misc server groups [puppet] - 10https://gerrit.wikimedia.org/r/250036 [16:44:39] 6operations, 7Mail: wikimedia.org email alias with +something sub-addressing bounces - https://phabricator.wikimedia.org/T117144#1769326 (10faidon) Having exim accept is as simple as adding this to the router: ``` local_part_suffix = +* : -* local_part_suffix_optional ``` That would make "packagist-admin"... [16:45:00] (03PS1) 10Muehlenhoff: Update TODO [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/250037 [16:45:58] (03CR) 10Muehlenhoff: [C: 032 V: 032] Additional fine tuning of misc server groups [puppet] - 10https://gerrit.wikimedia.org/r/250036 (owner: 10Muehlenhoff) [16:46:09] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to (wmf-)deployment for jzerebecki - https://phabricator.wikimedia.org/T116487#1769346 (10greg) Thanks all! And sorry for the delay. Approved. Thanks especially, to Jan, for always being helpful and a pleasure to work with. [16:46:18] (03CR) 10Muehlenhoff: [C: 032 V: 032] Update TODO [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/250037 (owner: 10Muehlenhoff) [16:46:24] (03CR) 10Greg Grossmeier: [C: 031] admin: add jzerebecki to deployers [puppet] - 10https://gerrit.wikimedia.org/r/249818 (https://phabricator.wikimedia.org/T116487) (owner: 10Dzahn) [16:53:35] (03PS1) 10Muehlenhoff: impala: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/250038 [16:54:43] RECOVERY - Host mw1083 is UP: PING OK - Packet loss = 0%, RTA = 4.16 ms [16:56:26] !log reinstalling mw1083 https://phabricator.wikimedia.org/T116184 [16:56:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:04:13] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [17:04:13] PROBLEM - puppet last run on hooft is CRITICAL: CRITICAL: puppet fail [17:04:54] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [17:05:34] RECOVERY - puppet last run on mw1152 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:08:55] 6operations, 10MediaWiki-Cache, 6Performance-Team, 10hardware-requests, 7Availability: Setup a 2 server Kafka instance in both eqiad and codfw for reliable purge streams - https://phabricator.wikimedia.org/T114191#1769427 (10RobH) a:5RobH>3mark @ottomata: We can swap the dual 160GB SSD with dual 250G... [17:15:45] (03PS3) 10Ori.livneh: move error pages to errorpages/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249796 [17:15:54] (03CR) 10Ori.livneh: [C: 032] move error pages to errorpages/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249796 (owner: 10Ori.livneh) [17:16:00] (03Merged) 10jenkins-bot: move error pages to errorpages/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249796 (owner: 10Ori.livneh) [17:17:02] 6operations, 10MediaWiki-Cache, 6Performance-Team, 10hardware-requests, 7Availability: Setup a 2 server Kafka instance in both eqiad and codfw for reliable purge streams - https://phabricator.wikimedia.org/T114191#1769480 (10Ottomata) I'd like it if these systems were as close to homogenous as possible,... [17:17:18] (03CR) 10Mark Bergsma: [C: 031] "I'd personally start with a /24 assignment for all instance VLANs - this is a small test cluster anyway, AND the goal is to move away from" [dns] - 10https://gerrit.wikimedia.org/r/249919 (owner: 10Rush) [17:19:31] 6operations, 7Mail: wikimedia.org email alias with +something sub-addressing bounces - https://phabricator.wikimedia.org/T117144#1769486 (10Krenair) >>! In T117144#1769235, @faidon wrote: > - The above could apply with "+" as well, although I doubt we have addresses that currently include "+" right now; we don... [17:20:32] PROBLEM - check_load on bismuth is CRITICAL: CRITICAL - load average: 40.17, 32.34, 19.26 [17:22:22] !log ori@tin Synchronized docroot and w: (no message) (duration: 00m 18s) [17:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:23:05] 6operations, 6Parsing-Team, 10hardware-requests: Dedicated server for running Parsoid's roundtrip tests to get reliable parse latencies and use as perf. benchmarking tests - https://phabricator.wikimedia.org/T116090#1769532 (10RobH) I have a slight concern about moving things off ruthenium.eqiad.wmnet and on... [17:24:01] (03PS3) 10Ori.livneh: Update error page locations for Ibd8f69a54ac [puppet] - 10https://gerrit.wikimedia.org/r/249965 [17:24:41] (03CR) 10Ori.livneh: [C: 032 V: 032] Update error page locations for Ibd8f69a54ac [puppet] - 10https://gerrit.wikimedia.org/r/249965 (owner: 10Ori.livneh) [17:25:13] PROBLEM - Restbase endpoints health on xenon is CRITICAL: /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/en.wikipedia.org/v1/page/mobile-html/Main_Page: Generic connection error: (Received response with content-encoding: gzip, but failed to decode it., error(Error -3 while decompressing: incorrect header check,)) [17:25:45] PROBLEM - Restbase endpoints health on praseodymium is CRITICAL: /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/en.wikipedia.org/v1/page/mobile-html/Main_Page: Generic connection error: (Received response with content-encoding: gzip, but failed to decode it., error(Error -3 while decompressing: incorrect header check,)) [17:26:23] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [17:27:04] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [17:27:18] 6operations, 10ops-eqiad: check for spare 4GB DDR3 Synchronous 1333 MHz RAM - https://phabricator.wikimedia.org/T117248#1769549 (10RobH) 3NEW a:3Cmjohnson [17:29:12] PROBLEM - Host bismuth is DOWN: PING CRITICAL - Packet loss = 100% [17:29:40] 6operations, 10MediaWiki-Cache, 6Performance-Team, 10hardware-requests, 7Availability: Setup a 2 server Kafka instance in both eqiad and codfw for reliable purge streams - https://phabricator.wikimedia.org/T114191#1769568 (10mobrovac) What would the ETAs be for either options? I.e. if we go with spares,... [17:30:12] <_joe_> mobrovac: why do we have problems with rb again? [17:30:23] is bismuth down planned? [17:30:30] (its paged) [17:30:59] Jeff_Green: its eqiad frack [17:31:16] robh, thanks [17:31:20] hm why indeed _joe_j [17:31:30] could not find it on puppet [17:31:52] yea its console is locked up [17:31:54] serial. [17:32:09] so i think its crashed, i hesitate to reboot frack stuff, lemme see if i can find what it does [17:33:11] meh, its locked up though so reboot shoudl be find, rebooting [17:33:24] !log bismuth has no console output, appears locked up in OS, rebooting via mgmt [17:33:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:33:50] 6operations, 10ops-eqiad: aqs1001 getting multiple and repeated heat MCEs - https://phabricator.wikimedia.org/T116584#1769589 (10Milimetric) Hm, that sucks, now how are we going to cook our eggs? Thx Chris :) [17:34:11] bismuth is known. it's ok. [17:34:13] RECOVERY - puppet last run on hooft is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:34:18] Jeff_Green: shit... [17:34:21] just sent it a reboot! [17:34:26] ha ok [17:34:32] sorry [17:34:35] you were doing that already i take it? [17:34:52] sorry about that, i just didnt wanna leave it sitting =] [17:35:42] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [100000000.0] [17:36:01] robh: Cool, I was about to have Jeff reboot it anyway. So thanks :) [17:36:09] (03PS1) 10Ori.livneh: Stop building wikiversions.cdb; no longer in use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250044 [17:36:47] I should have warned you here.. didn't realize it would page. [17:37:31] (03PS2) 10Ori.livneh: Add perf-admins to Graphite role [puppet] - 10https://gerrit.wikimedia.org/r/249966 [17:38:20] Bleh. [17:38:24] * Coren looks at labstore1003 [17:40:13] RECOVERY - check_load on bismuth is OK: OK - load average: 1.14, 0.94, 0.45 [17:40:23] RECOVERY - Host bismuth is UP: PING OK - Packet loss = 0%, RTA = 0.86 ms [17:40:51] 6operations, 7Database, 5Patch-For-Review: Replicate flowdb from X1 to analytics-store - https://phabricator.wikimedia.org/T75047#1769627 (10Milimetric) @matthiasmullie might be interested in this, not sure. [17:41:32] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 7.69% of data above the critical threshold [500.0] [17:43:50] That trigger might be a little to sensitive. [17:46:12] (03PS2) 10Ori.livneh: Delete unused refreshWikiversionsCDB file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250044 [17:46:22] ostriches: ^ [17:47:12] beta cluster is down, the db is complaining of too many connections, kren.air is looking but help appreciated [17:47:53] 6operations, 7Mail: wikimedia.org email alias with +something sub-addressing bounces - https://phabricator.wikimedia.org/T117144#1769682 (10faidon) 5Open>3declined a:3faidon >>! In T117144#1769486, @Krenair wrote: >>>! In T117144#1769235, @faidon wrote: >> - The above could apply with "+" as well, althou... [17:47:57] 10Ops-Access-Requests, 6operations: Add perf-admin to graphite role - https://phabricator.wikimedia.org/T117256#1769685 (10ori) 3NEW [17:48:10] (03PS3) 10Ori.livneh: Add perf-admins to Graphite role [puppet] - 10https://gerrit.wikimedia.org/r/249966 [17:48:19] (03CR) 10Chad: [C: 032] Delete unused refreshWikiversionsCDB file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250044 (owner: 10Ori.livneh) [17:48:25] (03Merged) 10jenkins-bot: Delete unused refreshWikiversionsCDB file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250044 (owner: 10Ori.livneh) [17:48:32] greg-g: ack, looking [17:49:02] greg-g: where is this being debugged? -releng? [17:49:04] (03PS1) 10Dzahn: RT: move role to krypton [puppet] - 10https://gerrit.wikimedia.org/r/250047 (https://phabricator.wikimedia.org/T105555) [17:49:25] (ostriches: thanks!) [17:49:33] yw [17:49:53] ori: I'm not going to sync it explicitly. It can just go out whenever someone scaps or w/e. [17:50:01] kk [17:52:18] (03PS2) 10Dzahn: RT: move role to krypton [puppet] - 10https://gerrit.wikimedia.org/r/250047 (https://phabricator.wikimedia.org/T105555) [17:53:48] 6operations, 10MediaWiki-Cache, 6Performance-Team, 10hardware-requests, 7Availability: Setup a 2 server Kafka instance in both eqiad and codfw for reliable purge streams - https://phabricator.wikimedia.org/T114191#1769762 (10Ottomata) > Budget considerations aside, it seems to me the optimal way forward... [17:56:30] (03PS1) 10Dzahn: racktables: move role to krypton [puppet] - 10https://gerrit.wikimedia.org/r/250050 (https://phabricator.wikimedia.org/T105555) [18:00:06] (03CR) 10RobH: "Right now RT has to still process the maint-queue emails. So this cannot move unless the migrated version will still handle mail routing." [puppet] - 10https://gerrit.wikimedia.org/r/250047 (https://phabricator.wikimedia.org/T105555) (owner: 10Dzahn) [18:01:53] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [18:02:20] (03CR) 10RobH: [C: 031] "If I recall correctly, racktables frontend doesn't store any data locally, just shoves it to the db." [puppet] - 10https://gerrit.wikimedia.org/r/250050 (https://phabricator.wikimedia.org/T105555) (owner: 10Dzahn) [18:02:50] (03CR) 10RobH: "discussion update - the db permissions will have to change" [puppet] - 10https://gerrit.wikimedia.org/r/250050 (https://phabricator.wikimedia.org/T105555) (owner: 10Dzahn) [18:05:50] (03CR) 10Dzahn: [C: 031] Labs instance subnet allocation for Codfw [dns] - 10https://gerrit.wikimedia.org/r/249919 (owner: 10Rush) [18:13:03] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:20:02] 10Ops-Access-Requests, 6operations: Add perf-admin to graphite role - https://phabricator.wikimedia.org/T117256#1769855 (10chasemp) p:5Triage>3Normal [18:22:21] (03CR) 10Dzahn: [C: 04-1] "Error: Failed to compile catalog for node analytics1026.eqiad.wmnet: Must pass namenode_hosts to Class[Cdh::Hadoop] at /mnt/jenkins-worksp" [puppet] - 10https://gerrit.wikimedia.org/r/250038 (owner: 10Muehlenhoff) [18:23:05] 10Ops-Access-Requests, 6operations: Requesting access to add perf-admin group to graphite role - https://phabricator.wikimedia.org/T117256#1769873 (10chasemp) [18:23:08] (03CR) 10Dzahn: "cdh::hadoop class needs fixes for puppet-compiler" [puppet] - 10https://gerrit.wikimedia.org/r/250038 (owner: 10Muehlenhoff) [18:24:52] 10Ops-Access-Requests, 6operations: Requesting access to add perf-admin group to graphite role - https://phabricator.wikimedia.org/T117256#1769685 (10chasemp) So this group is badly named in that it grants root access and the convention has been -users, -admins (some sudo), -roots (sudo ALL) https://phabricat... [18:25:07] 10Ops-Access-Requests, 6operations: Requesting access to add perf-admin group to graphite role (this group has full sudo fyi) - https://phabricator.wikimedia.org/T117256#1769922 (10chasemp) [18:25:11] (03CR) 10Dzahn: [C: 032] "no diff: http://puppet-compiler.wmflabs.org/1128/" [puppet] - 10https://gerrit.wikimedia.org/r/247207 (owner: 10Muehlenhoff) [18:25:17] (03PS2) 10Dzahn: labnodepool1001: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247207 (owner: 10Muehlenhoff) [18:25:25] (03CR) 10Dzahn: [C: 032] labnodepool1001: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247207 (owner: 10Muehlenhoff) [18:28:17] (03PS2) 10Dzahn: labnet1001: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247205 (owner: 10Muehlenhoff) [18:29:06] (03CR) 10Dzahn: [C: 032] "no diff: http://puppet-compiler.wmflabs.org/1129/" [puppet] - 10https://gerrit.wikimedia.org/r/247205 (owner: 10Muehlenhoff) [18:33:42] (03PS2) 10Dzahn: Include base::firewall in the phabricator role [puppet] - 10https://gerrit.wikimedia.org/r/247260 (owner: 10Muehlenhoff) [18:35:33] PROBLEM - Restbase endpoints health on cerium is CRITICAL: /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/en.wikipedia.org/v1/page/mobile-html/Main_Page: Generic connection error: (Received response with content-encoding: gzip, but failed to decode it., error(Error -3 while decompressing: incorrect header check,)) [18:35:34] (03CR) 10Dzahn: [C: 032] "no diff on iridum: http://puppet-compiler.wmflabs.org/1130/" [puppet] - 10https://gerrit.wikimedia.org/r/247260 (owner: 10Muehlenhoff) [18:36:35] (03PS5) 10Alexandros Kosiaris: openldap: Allow specifying an additional set of LDAP schemas [puppet] - 10https://gerrit.wikimedia.org/r/250010 (https://phabricator.wikimedia.org/T101299) (owner: 10Muehlenhoff) [18:39:43] PROBLEM - Restbase endpoints health on restbase-test2003 is CRITICAL: /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/en.wikipedia.org/v1/page/mobile-html/Main_Page: Generic connection error: (Received response with content-encoding: gzip, but failed to decode it., error(Error -3 while decompressing: incorrect header check,)) [18:47:54] (03PS1) 10Dzahn: statsite,burrow,clamav: minimal quoting fixes [puppet] - 10https://gerrit.wikimedia.org/r/250051 [18:48:14] 6operations, 10ops-eqiad: check for spare 4GB DDR3 Synchronous 1333 MHz RAM - https://phabricator.wikimedia.org/T117248#1770027 (10Cmjohnson) I have several of these below we can use.. they're coming from R610 old varnish servers Brand: Hynix HMT31GR7BFR4C-H9 Type: DDR 3 ECC Registered Capacity: 8 GB Rank: 2... [18:48:51] (03PS3) 10Dzahn: puppetcompiler/puppet-tests: mini lint fixes (meta) [puppet] - 10https://gerrit.wikimedia.org/r/249655 [18:49:24] (03CR) 10jenkins-bot: [V: 04-1] puppetcompiler/puppet-tests: mini lint fixes (meta) [puppet] - 10https://gerrit.wikimedia.org/r/249655 (owner: 10Dzahn) [18:51:57] (03PS1) 10Dzahn: dnsrecursor: minimal quoting fixes [puppet] - 10https://gerrit.wikimedia.org/r/250052 [18:52:42] PROBLEM - puppet last run on cp4005 is CRITICAL: CRITICAL: puppet fail [18:56:22] !log disabling puppet and restarting mysql on es2008, es2009 and es2010- downtime planned for 2 hours - no production impact [18:56:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:56:52] (03PS1) 10Dzahn: postgresl,sslcert: minimal lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/250053 [19:13:43] (03PS1) 10Dzahn: snapshot: minimal lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/250058 [19:20:12] RECOVERY - puppet last run on cp4005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:37:38] 6operations: puppet compiler: NoneType' object is not iterable with node auto-select feature - https://phabricator.wikimedia.org/T117278#1770219 (10Dzahn) [19:38:34] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/1135/" [puppet] - 10https://gerrit.wikimedia.org/r/250058 (owner: 10Dzahn) [19:44:01] (03PS2) 10Dzahn: statsite,burrow,clamav: minimal quoting fixes [puppet] - 10https://gerrit.wikimedia.org/r/250051 [19:44:08] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/1136/" [puppet] - 10https://gerrit.wikimedia.org/r/250051 (owner: 10Dzahn) [19:44:54] 6operations: puppet compiler: NoneType' object is not iterable with node auto-select feature - https://phabricator.wikimedia.org/T117278#1770250 (10chasemp) p:5Triage>3Normal [19:46:54] (03PS2) 10Dzahn: dnsrecursor: minimal quoting fixes [puppet] - 10https://gerrit.wikimedia.org/r/250052 [19:48:40] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/1137/" [puppet] - 10https://gerrit.wikimedia.org/r/250052 (owner: 10Dzahn) [19:54:15] 6operations, 6Discovery: Investigate adding memory to elastic10{01...16} to bring more parity between the two types of servers running elasticsearch in eqiad - https://phabricator.wikimedia.org/T117110#1770269 (10EBernhardson) All good questions, thanks for looking at this. > Both groups of machines are fairl... [19:56:47] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1770275 (10Tgr) >>! In T102566#1768664, @BBlack wrote: > I haven't followed the low-level details too hard, but I think most... [20:02:42] Did https://www.mediawiki.org/wiki/InstantCommons#Scalability_considerations ever get implemented? [20:04:38] 6operations, 6Discovery: Investigate adding memory to elastic10{01...16} to bring more parity between the two types of servers running elasticsearch in eqiad - https://phabricator.wikimedia.org/T117110#1770286 (10faidon) Thanks. So part of what you're saying is that you confirm that the load is not uniformally... [20:04:48] ebernhardson: thanks -- I re-responded [20:07:20] (03PS1) 10Dzahn: swift: minimal lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/250063 [20:07:22] (03PS1) 10Dzahn: openstack: some more lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/250064 [20:09:57] (03PS1) 10Ori.livneh: remove old refreshWikiversionsCDB alias [puppet] - 10https://gerrit.wikimedia.org/r/250065 [20:11:09] (03PS2) 10Ori.livneh: remove old refreshWikiversionsCDB alias [puppet] - 10https://gerrit.wikimedia.org/r/250065 [20:13:04] (03PS1) 10Dzahn: udp2log,gerritl,labsdb,camus: minimal lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/250066 [20:14:27] 6operations, 10Sentry, 10hardware-requests: Procure hardware for Sentry - https://phabricator.wikimedia.org/T93138#1770295 (10RobH) @Tgr: Just following up on this request; I see the basic puppetization has been merged live. In the initial request, this states it cannot be run in labs due to private data.... [20:15:18] (03PS1) 10Dzahn: eventlogging,labsdns,redisdb: minimal lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/250067 [20:16:25] (03CR) 10Ori.livneh: [C: 032] remove old refreshWikiversionsCDB alias [puppet] - 10https://gerrit.wikimedia.org/r/250065 (owner: 10Ori.livneh) [20:20:25] (03PS1) 10Dzahn: kafka: use role keyword [puppet] - 10https://gerrit.wikimedia.org/r/250068 [20:21:55] (03PS1) 10Dzahn: labsdb1004 - use role keyword [puppet] - 10https://gerrit.wikimedia.org/r/250069 [20:24:46] (03PS1) 10Dzahn: analytics: no "if $hostname" within node blocks, use role [puppet] - 10https://gerrit.wikimedia.org/r/250070 [20:25:28] (03PS2) 10Dzahn: logstash: no "if $hostname" in node blocks, use role [puppet] - 10https://gerrit.wikimedia.org/r/250070 [20:25:56] 6operations, 7Database, 5Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#1770311 (10jcrespo) I wanted to setup an operational test with 1 time certs and keys, not for production usage, but to observe operational issues and performance. This resulted in a grea... [20:31:00] (03PS1) 10Dzahn: swift: no "if $hostname" in node blocks, use role [puppet] - 10https://gerrit.wikimedia.org/r/250072 [20:32:01] 6operations, 10MediaWiki-Cache, 6Performance-Team, 10hardware-requests, 7Availability: Setup a 2 server Kafka instance in both eqiad and codfw for reliable purge streams - https://phabricator.wikimedia.org/T114191#1770318 (10GWicke) Given the request volumes we expect the main consideration for homogenei... [20:32:53] 6operations, 10hardware-requests: eqiad: 1 hardware access request for labs on real hardware (mwoffliner) - https://phabricator.wikimedia.org/T117095#1770320 (10RobH) @yuvipanda or @chasemp: Is labs in a state to support bare metal deployments in eqiad at this time? I wasn't under the impression it was yet th... [20:33:40] (03PS2) 10Dzahn: swift: no "if $hostname" in node blocks, use role [puppet] - 10https://gerrit.wikimedia.org/r/250072 [20:33:46] 6operations, 7Database, 5Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#1770322 (10jcrespo) The puppetization tasks are trivial, but they cannot be done until the previous problem is solved. * Modify /etc/my.cnf on the [client] section: ``` #ssl ssl-ca=/etc... [20:33:53] 6operations: deploy eventlog2001 services - https://phabricator.wikimedia.org/T93220#1770325 (10RobH) [20:33:54] 6operations, 10hardware-requests: deploy eventlog2001 - https://phabricator.wikimedia.org/T90907#1770323 (10RobH) 5Open>3Resolved better to followup on T93220 [20:33:55] 6operations, 10hardware-requests: codfw/eqiad: (1) eventlogging node (per site) - https://phabricator.wikimedia.org/T90747#1770326 (10RobH) [20:34:05] 6operations: deploy eventlog2001 services - https://phabricator.wikimedia.org/T93220#1770327 (10RobH) a:5ori>3None [20:35:16] (03PS3) 10Dzahn: swift: no "if $hostname" in node blocks, use role [puppet] - 10https://gerrit.wikimedia.org/r/250072 [20:36:36] (03PS1) 10Dzahn: nobelium: use role keyword [puppet] - 10https://gerrit.wikimedia.org/r/250076 [20:38:46] (03PS1) 10Dzahn: elasticsearch: move base::firewall to role [puppet] - 10https://gerrit.wikimedia.org/r/250078 [20:38:50] (03PS1) 10Rush: fix wdqs-(roots|admin) name drift [puppet] - 10https://gerrit.wikimedia.org/r/250079 [20:40:22] (03PS1) 10Dzahn: palladium: add conftool::master to role keyword [puppet] - 10https://gerrit.wikimedia.org/r/250080 [20:40:39] (03CR) 10jenkins-bot: [V: 04-1] palladium: add conftool::master to role keyword [puppet] - 10https://gerrit.wikimedia.org/r/250080 (owner: 10Dzahn) [20:40:41] (03CR) 10Rush: [C: 032 V: 032] fix wdqs-(roots|admin) name drift [puppet] - 10https://gerrit.wikimedia.org/r/250079 (owner: 10Rush) [20:41:14] (03CR) 10Dzahn: "+1, thx" [puppet] - 10https://gerrit.wikimedia.org/r/250079 (owner: 10Rush) [20:43:28] (03PS1) 10Dzahn: snapshot: no $hostname checks in node groups, use role [puppet] - 10https://gerrit.wikimedia.org/r/250082 [20:44:38] (03PS1) 10Dzahn: tin,mira: labsdb::manager to role keyword [puppet] - 10https://gerrit.wikimedia.org/r/250083 [20:44:53] (03PS1) 10Rush: admin: perf-admins grants ALL change to perf-roots [puppet] - 10https://gerrit.wikimedia.org/r/250084 [20:45:55] (03CR) 10Rush: [C: 032] admin: perf-admins grants ALL change to perf-roots [puppet] - 10https://gerrit.wikimedia.org/r/250084 (owner: 10Rush) [20:46:52] (03PS1) 10Dzahn: tin,mira: move base::firewall to deployment role [puppet] - 10https://gerrit.wikimedia.org/r/250091 [20:48:02] (03PS1) 10Dzahn: tin,mira: move standard to role class [puppet] - 10https://gerrit.wikimedia.org/r/250103 [20:48:54] (03PS1) 10Rush: admin: perf-roots different gid [puppet] - 10https://gerrit.wikimedia.org/r/250110 [20:49:37] (03CR) 10Rush: [C: 032] admin: perf-roots different gid [puppet] - 10https://gerrit.wikimedia.org/r/250110 (owner: 10Rush) [20:50:53] (03PS2) 10Jcrespo: [WIP] Script to generate openssh TLS keys for mysql replication [software] - 10https://gerrit.wikimedia.org/r/247542 (https://phabricator.wikimedia.org/T111654) [20:51:05] PROBLEM - puppet last run on cp1071 is CRITICAL: CRITICAL: Puppet has 1 failures [20:51:22] PROBLEM - puppet last run on cp4014 is CRITICAL: CRITICAL: Puppet has 1 failures [20:51:34] PROBLEM - puppet last run on cp3030 is CRITICAL: CRITICAL: Puppet has 1 failures [20:51:43] PROBLEM - puppet last run on cp2010 is CRITICAL: CRITICAL: Puppet has 1 failures [20:51:43] PROBLEM - puppet last run on mc2014 is CRITICAL: CRITICAL: Puppet has 1 failures [20:51:44] PROBLEM - puppet last run on mw1206 is CRITICAL: CRITICAL: Puppet has 1 failures [20:51:44] PROBLEM - puppet last run on mw1208 is CRITICAL: CRITICAL: Puppet has 1 failures [20:51:53] PROBLEM - puppet last run on cp4019 is CRITICAL: CRITICAL: Puppet has 1 failures [20:52:03] PROBLEM - puppet last run on mc1012 is CRITICAL: CRITICAL: Puppet has 1 failures [20:52:03] PROBLEM - puppet last run on mw1179 is CRITICAL: CRITICAL: Puppet has 1 failures [20:52:23] PROBLEM - puppet last run on cp4005 is CRITICAL: CRITICAL: Puppet has 1 failures [20:52:24] PROBLEM - puppet last run on mw1020 is CRITICAL: CRITICAL: Puppet has 1 failures [20:52:34] PROBLEM - puppet last run on mw1075 is CRITICAL: CRITICAL: Puppet has 1 failures [20:52:42] PROBLEM - puppet last run on mw2111 is CRITICAL: CRITICAL: Puppet has 1 failures [20:52:42] PROBLEM - puppet last run on mw2168 is CRITICAL: CRITICAL: Puppet has 1 failures [20:52:42] PROBLEM - puppet last run on mw1137 is CRITICAL: CRITICAL: Puppet has 1 failures [20:52:54] PROBLEM - puppet last run on cp4015 is CRITICAL: CRITICAL: Puppet has 1 failures [20:53:03] PROBLEM - puppet last run on cp1065 is CRITICAL: CRITICAL: Puppet has 1 failures [20:53:04] PROBLEM - puppet last run on cp3006 is CRITICAL: CRITICAL: Puppet has 1 failures [20:53:05] Rush, revert? [20:53:14] PROBLEM - puppet last run on cp4017 is CRITICAL: CRITICAL: Puppet has 1 failures [20:53:14] PROBLEM - puppet last run on mw2062 is CRITICAL: CRITICAL: Puppet has 1 failures [20:53:22] PROBLEM - puppet last run on mw1230 is CRITICAL: CRITICAL: Puppet has 1 failures [20:53:32] PROBLEM - puppet last run on cp1073 is CRITICAL: CRITICAL: Puppet has 1 failures [20:53:32] PROBLEM - puppet last run on cp1062 is CRITICAL: CRITICAL: Puppet has 1 failures [20:53:32] PROBLEM - puppet last run on cp1099 is CRITICAL: CRITICAL: Puppet has 1 failures [20:53:32] PROBLEM - puppet last run on mw1078 is CRITICAL: CRITICAL: Puppet has 1 failures [20:53:34] PROBLEM - puppet last run on mw2098 is CRITICAL: CRITICAL: Puppet has 1 failures [20:53:42] PROBLEM - puppet last run on mc1008 is CRITICAL: CRITICAL: Puppet has 1 failures [20:53:43] PROBLEM - puppet last run on mw1138 is CRITICAL: CRITICAL: Puppet has 1 failures [20:53:44] PROBLEM - puppet last run on mw2101 is CRITICAL: CRITICAL: Puppet has 1 failures [20:53:46] I see [20:53:53] PROBLEM - puppet last run on mw1094 is CRITICAL: CRITICAL: Puppet has 1 failures [20:53:53] PROBLEM - puppet last run on mw1136 is CRITICAL: CRITICAL: Puppet has 1 failures [20:54:03] PROBLEM - puppet last run on mw2130 is CRITICAL: CRITICAL: Puppet has 1 failures [20:54:03] PROBLEM - puppet last run on cp2015 is CRITICAL: CRITICAL: Puppet has 1 failures [20:54:03] PROBLEM - puppet last run on mw2203 is CRITICAL: CRITICAL: Puppet has 1 failures [20:54:03] PROBLEM - puppet last run on cp2019 is CRITICAL: CRITICAL: Puppet has 1 failures [20:54:04] PROBLEM - puppet last run on mw2085 is CRITICAL: CRITICAL: Puppet has 1 failures [20:54:13] PROBLEM - puppet last run on mw1238 is CRITICAL: CRITICAL: Puppet has 1 failures [20:54:15] PROBLEM - puppet last run on cp3009 is CRITICAL: CRITICAL: Puppet has 1 failures [20:54:23] PROBLEM - puppet last run on mw2174 is CRITICAL: CRITICAL: Puppet has 1 failures [20:54:23] PROBLEM - puppet last run on mw1169 is CRITICAL: CRITICAL: Puppet has 1 failures [20:54:24] PROBLEM - puppet last run on mw1084 is CRITICAL: CRITICAL: Puppet has 1 failures [20:54:43] PROBLEM - puppet last run on mw2152 is CRITICAL: CRITICAL: Puppet has 1 failures [20:54:43] PROBLEM - puppet last run on mw2133 is CRITICAL: CRITICAL: Puppet has 1 failures [20:54:44] PROBLEM - puppet last run on mw2200 is CRITICAL: CRITICAL: Puppet has 1 failures [20:54:44] PROBLEM - puppet last run on mw2150 is CRITICAL: CRITICAL: Puppet has 1 failures [20:54:44] PROBLEM - puppet last run on mw2172 is CRITICAL: CRITICAL: Puppet has 1 failures [20:54:52] PROBLEM - puppet last run on mw2094 is CRITICAL: CRITICAL: Puppet has 1 failures [20:55:03] PROBLEM - puppet last run on mw1132 is CRITICAL: CRITICAL: Puppet has 1 failures [20:55:03] PROBLEM - puppet last run on mw1049 is CRITICAL: CRITICAL: Puppet has 1 failures [20:55:12] PROBLEM - puppet last run on mw2107 is CRITICAL: CRITICAL: Puppet has 1 failures [20:55:12] PROBLEM - puppet last run on mw1181 is CRITICAL: CRITICAL: Puppet has 1 failures [20:55:13] PROBLEM - puppet last run on cp3013 is CRITICAL: CRITICAL: Puppet has 1 failures [20:55:42] PROBLEM - puppet last run on mw1180 is CRITICAL: CRITICAL: Puppet has 1 failures [20:55:53] PROBLEM - puppet last run on mw1030 is CRITICAL: CRITICAL: Puppet has 1 failures [20:55:54] PROBLEM - puppet last run on mw1165 is CRITICAL: CRITICAL: Puppet has 1 failures [20:55:57] so.. is it the admin group change? [20:56:29] killed the bot [20:56:47] (duplicate gid) [20:56:59] I am checking that the latest commit fixes it [20:57:11] but it is sooooo slow [20:57:12] also checks puppet on a random one of these [20:57:15] same here, yea [20:57:36] chasemp: ^ did you see? [20:57:49] that should be fixed [20:57:58] looking [20:58:25] yes it does [20:58:33] but it took puppet ages to execute [20:58:36] separately i see interesting CRITs in icinga about salt-minions [20:58:53] a whole bunch of servers have salt-minion running 10 times [20:59:12] or 11, WTF? [20:59:17] yea [20:59:27] good times [20:59:45] and getting more ..looks global [20:59:51] so palladium is totally blasted on cpu [21:00:00] and it has puppet which is slowing down teh fix I pushed for the dupe gid [21:00:01] did somebody work on salt master? [21:00:04] and it has salt [21:00:26] we should be careful, if something is forking processes that could be very bad [21:00:31] yea, maybe we should use salt .. to kill salt [21:00:38] :-) [21:01:16] well as silly as it sounds last time this happened iirc we had to wait out puppet settling down [21:01:22] but I don't remember teh minion thing [21:01:41] let me check if the salt thing is real or a monitoring thing [21:01:44] restarign teh master just caused more issues and trying to manually run the agents just caused more load [21:02:09] a) puppet failures are gone, i could finish a run on mw1132 [21:02:30] it is a monitor thing, or it doesn't happen anymore [21:02:32] the salt-minion run does not seem to be true here at lesat [21:02:36] ack [21:02:43] RECOVERY - puppet last run on mw1132 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:02:55] ehm, yea, that's the one i ran manually [21:03:05] mw1165 finished fine fo rme [21:03:12] so it's just going to take a minute for puppet to settle I think [21:03:23] we are also back to just 70 CRITs already .. from hundreds [21:03:43] RECOVERY - puppet last run on mw1165 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:03:45] i still dont see how icinga got to the "11 processes" [21:03:47] yeah sorry guys, it's a house of cards and shook the table a bit [21:03:56] if you break puppet on enough hosts puppet goes insane for a moment? :) [21:04:20] "too much is broken, let's bail out completely" :p [21:05:52] 10Ops-Access-Requests, 6operations: Requesting access to add perf-roots group to graphite role - https://phabricator.wikimedia.org/T117256#1770364 (10chasemp) [21:06:11] (03CR) 10Rush: [C: 04-1] "can you update to perf-roots?" [puppet] - 10https://gerrit.wikimedia.org/r/249966 (owner: 10Ori.livneh) [21:08:17] my downtime finished, and everything seems ok [21:08:29] have a nice evening! [21:08:45] later jynus :) [21:09:26] mutante: I'm sorry I was distracted looking at the puppet thing, yes I used salt to cleanup the perf-admin group. I thought you knew. [21:09:33] but it finished fine and fast and the warning I have no idea [21:17:24] RECOVERY - puppet last run on cp4014 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [21:17:43] RECOVERY - puppet last run on cp3030 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [21:17:53] RECOVERY - puppet last run on cp2010 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [21:18:02] RECOVERY - puppet last run on cp4019 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [21:18:12] RECOVERY - puppet last run on mc1012 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [21:18:43] RECOVERY - puppet last run on mw1137 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:18:43] RECOVERY - puppet last run on mw2168 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [21:19:03] RECOVERY - puppet last run on cp4015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:19:13] RECOVERY - puppet last run on cp3006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:19:22] RECOVERY - puppet last run on mw2062 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:19:22] RECOVERY - puppet last run on cp4017 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [21:19:23] RECOVERY - puppet last run on mw1230 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:19:33] RECOVERY - puppet last run on cp1073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:19:33] RECOVERY - puppet last run on cp1062 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [21:19:33] RECOVERY - puppet last run on cp1099 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:19:33] RECOVERY - puppet last run on mw1078 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [21:19:42] RECOVERY - puppet last run on mw1206 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:19:42] RECOVERY - puppet last run on mw1208 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:19:42] RECOVERY - puppet last run on mw2098 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [21:19:43] RECOVERY - puppet last run on mc2014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:19:43] RECOVERY - puppet last run on mc1008 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [21:19:52] RECOVERY - puppet last run on mw2101 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:19:52] RECOVERY - puppet last run on mw1094 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [21:19:53] RECOVERY - puppet last run on mw1136 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [21:19:54] RECOVERY - puppet last run on mw1179 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:20:03] RECOVERY - puppet last run on mw2130 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:20:04] RECOVERY - puppet last run on mw2203 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:20:04] RECOVERY - puppet last run on cp2015 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [21:20:04] RECOVERY - puppet last run on cp2019 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [21:20:04] RECOVERY - puppet last run on mw2085 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:20:04] RECOVERY - puppet last run on mw1238 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [21:20:14] RECOVERY - puppet last run on mw1020 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:20:14] RECOVERY - puppet last run on cp4005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:20:14] RECOVERY - puppet last run on cp3009 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [21:20:23] RECOVERY - puppet last run on mw2174 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:20:23] RECOVERY - puppet last run on mw1169 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:20:23] RECOVERY - puppet last run on mw1084 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:20:23] RECOVERY - puppet last run on mw1075 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:20:32] RECOVERY - puppet last run on mw2111 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:20:43] RECOVERY - puppet last run on mw2152 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [21:20:43] RECOVERY - puppet last run on mw2133 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [21:20:52] RECOVERY - puppet last run on mw2150 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [21:20:52] RECOVERY - puppet last run on mw2172 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:20:52] RECOVERY - puppet last run on mw2200 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:20:52] RECOVERY - puppet last run on cp1065 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:20:53] RECOVERY - puppet last run on mw2094 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [21:21:03] RECOVERY - puppet last run on mw1049 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:21:13] RECOVERY - puppet last run on mw2107 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [21:21:22] RECOVERY - puppet last run on cp3013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:21:33] RECOVERY - puppet last run on mw1138 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:21:43] RECOVERY - puppet last run on mw1180 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:21:54] RECOVERY - puppet last run on mw1030 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [21:22:33] RECOVERY - puppet last run on mw2048 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:22:33] RECOVERY - puppet last run on mw2046 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [21:22:33] RECOVERY - puppet last run on mw2053 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:22:33] RECOVERY - puppet last run on mw2202 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [21:23:03] RECOVERY - puppet last run on mw1181 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:24:03] (03CR) 10Ori.livneh: "but the existing team is called perf-admin -- do you mean i should rename it?" [puppet] - 10https://gerrit.wikimedia.org/r/249966 (owner: 10Ori.livneh) [21:25:03] (03CR) 10Rush: "I already renamed it :)" [puppet] - 10https://gerrit.wikimedia.org/r/249966 (owner: 10Ori.livneh) [21:25:39] (03PS4) 10Ori.livneh: Add perf-roots to Graphite role [puppet] - 10https://gerrit.wikimedia.org/r/249966 [21:25:44] ^ chasemp [21:26:01] sweet [21:28:25] 6operations, 6Discovery: Investigate adding memory to elastic10{01...16} to bring more parity between the two types of servers running elasticsearch in eqiad - https://phabricator.wikimedia.org/T117110#1770408 (10EBernhardson) [21:33:30] (03PS1) 10Ori.livneh: Disable xhprof profiling to relieve storage pressure on graphite [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250143 [21:33:32] (03PS1) 10Ori.livneh: Remove build.xml; unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250144 [21:34:37] 6operations, 10RESTBase-Cassandra, 5Patch-For-Review: restbase endpoints health checks timing out - https://phabricator.wikimedia.org/T116739#1770438 (10GWicke) [21:34:38] 6operations, 10RESTBase, 6Services: restbase endpoint reporting incorrect content-encoding: gzip - https://phabricator.wikimedia.org/T116911#1770436 (10GWicke) 5Resolved>3Open Somehow I'm still getting gzip encoding in staging, using the curl command in the task description. I have verified that both the... [21:35:04] 6operations, 10ops-codfw, 6Labs, 10Labs-Infrastructure: on-site tasks for labs deployment cluster - https://phabricator.wikimedia.org/T117107#1770439 (10Papaul) labtestmetal2001 10.193.1.18 port ge-5//0/8 labtestvirt2001 10.193.1.19 port ge-5/0/17 labtestcontrol2001 10.193... [21:35:34] 6operations, 10hardware-requests: eqiad: 1 hardware access request for labs on real hardware (mwoffliner) - https://phabricator.wikimedia.org/T117095#1770446 (10chasemp) This box is to work out the alternative hw case outside of openstack if ironic doesn't pan out [21:40:04] (03CR) 10Ori.livneh: [C: 032] Disable xhprof profiling to relieve storage pressure on graphite [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250143 (owner: 10Ori.livneh) [21:40:13] (03Merged) 10jenkins-bot: Disable xhprof profiling to relieve storage pressure on graphite [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250143 (owner: 10Ori.livneh) [21:41:02] !log ori@tin Synchronized wmf-config/StartProfiler.php: I0bfa21b5: Disable xhprof profiling to relieve storage pressure on graphite (duration: 00m 18s) [21:41:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:41:44] !log Removing MediaWiki.xhprof.* from graphite{1,2}001 [21:41:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:49:53] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [21:50:00] chasemp: thanks. i was out for lunch [21:51:02] 6operations, 10MediaWiki-Cache, 6Performance-Team, 10hardware-requests, 7Availability: Setup a 2 server Kafka instance in both eqiad and codfw for reliable purge streams - https://phabricator.wikimedia.org/T114191#1770516 (10Ottomata) All in all, the weaker hardware would be fine as a start, and if we ha... [21:51:56] ACKNOWLEDGEMENT - Cassandra CQL query interface on restbase-test2001 is CRITICAL: Connection refused daniel_zahn per test in the hostname [21:51:56] ACKNOWLEDGEMENT - Cassandra CQL query interface on restbase-test2002 is CRITICAL: Connection refused daniel_zahn per test in the hostname [21:51:56] ACKNOWLEDGEMENT - Cassandra CQL query interface on restbase-test2003 is CRITICAL: Connection refused daniel_zahn per test in the hostname [21:52:04] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 28.00% of data above the critical threshold [100000000.0] [21:52:46] ACKNOWLEDGEMENT - tileratorui on maps-test2003 is CRITICAL: Connection refused daniel_zahn more test hostnames [21:52:46] ACKNOWLEDGEMENT - tileratorui on maps-test2004 is CRITICAL: Connection refused daniel_zahn more test hostnames [21:54:24] !log labvirt1010 - start salt [21:54:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:54:46] ugh, 503s in icinga [21:54:54] since 40 seconds or so [21:57:33] and gone again [21:58:22] ACKNOWLEDGEMENT - Restbase endpoints health on restbase-test2002 is CRITICAL: /page/mobile-html/{title} is CRITICAL: Test Get MobileApps Main Page returned the unexpected status 500 (expecting: 200): /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/en.wikipedia.org/v1/page/mobile-html/Main_Page: Generic connection error: (Received response with content-encoding: gzip, but failed to decode it., error(Err [21:58:22] ACKNOWLEDGEMENT - Restbase endpoints health on restbase-test2003 is CRITICAL: /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/en.wikipedia.org/v1/page/mobile-html/Main_Page: Generic connection error: (Received response with content-encoding: gzip, but failed to decode it., error(Error -3 while decompressing: incorrect header check,)) daniel_zahn test in host name [21:59:43] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [500.0] [22:01:44] !log labvirt1010: salt-minion always in "stop/waiting", doesn't restart [22:01:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:02:00] 6operations, 10hardware-requests: Add another redis jobqueue server master and slave - https://phabricator.wikimedia.org/T89400#1770551 (10RobH) @ori asked me about this task just now in IRC. I wasn't aware of it, since it lacked #hardware-requests. [22:02:53] (03PS2) 10Ori.livneh: Remove cruft [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250144 [22:03:15] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/1138/" [puppet] - 10https://gerrit.wikimedia.org/r/247204 (owner: 10Muehlenhoff) [22:03:20] (03PS2) 10Dzahn: labcontrol2001: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247204 (owner: 10Muehlenhoff) [22:03:44] (03CR) 10Dzahn: [C: 032] "no diff: http://puppet-compiler.wmflabs.org/1138/" [puppet] - 10https://gerrit.wikimedia.org/r/247204 (owner: 10Muehlenhoff) [22:05:21] (03PS2) 10Dzahn: labnet1002: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247206 (owner: 10Muehlenhoff) [22:05:23] (03CR) 10Ori.livneh: [C: 032] Remove cruft [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250144 (owner: 10Ori.livneh) [22:05:29] (03Merged) 10jenkins-bot: Remove cruft [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250144 (owner: 10Ori.livneh) [22:05:55] !log ori@tin Synchronized docroot and w: (no message) (duration: 00m 17s) [22:06:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:07:00] (03CR) 10Dzahn: [C: 032] "no diff: http://puppet-compiler.wmflabs.org/1139/" [puppet] - 10https://gerrit.wikimedia.org/r/247206 (owner: 10Muehlenhoff) [22:08:16] (03CR) 10Dzahn: [C: 04-1] "http://puppet-compiler.wmflabs.org/1140/labservices1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/247209 (owner: 10Muehlenhoff) [22:10:24] logmsgbot, docroot and w? [22:10:32] ori, ^ [22:10:33] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:10:43] how did you trigger that? [22:10:48] sync-docroot [22:11:39] 6operations, 10hardware-requests: Add another redis jobqueue server master and slave - https://phabricator.wikimedia.org/T89400#1770574 (10RobH) a:3mark The current rdb1004 is an R420 using an H310 controller with SSDs and 64GB of memory. We have a single spare that appears to match the existing rdb cluster... [22:11:59] robh: <3 [22:12:22] it needs a minimum of two though :( [22:12:23] they go in pairs [22:12:47] (03PS2) 10Dzahn: tin,mira: labsdb::manager to role keyword [puppet] - 10https://gerrit.wikimedia.org/r/250083 [22:13:32] (03CR) 10Dzahn: [C: 032] "no diff: http://puppet-compiler.wmflabs.org/1141/" [puppet] - 10https://gerrit.wikimedia.org/r/250083 (owner: 10Dzahn) [22:14:30] 6operations, 10hardware-requests: Add another redis jobqueue server master and slave - https://phabricator.wikimedia.org/T89400#1770587 (10RobH) a:5mark>3RobH [22:14:59] (03PS2) 10Dzahn: snapshot: no $hostname checks in node groups, use role [puppet] - 10https://gerrit.wikimedia.org/r/250082 [22:15:43] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [22:17:16] (03PS2) 10Dzahn: tin,mira: move base::firewall to deployment role [puppet] - 10https://gerrit.wikimedia.org/r/250091 [22:17:20] (03CR) 10jenkins-bot: [V: 04-1] tin,mira: move base::firewall to deployment role [puppet] - 10https://gerrit.wikimedia.org/r/250091 (owner: 10Dzahn) [22:18:25] (03PS3) 10Dzahn: tin,mira: move base::firewall to deployment role [puppet] - 10https://gerrit.wikimedia.org/r/250091 [22:20:21] (03CR) 10Dzahn: [C: 04-1] "Syntax error at ':'; expected '}' at /mnt/jenkins-workspace/workspace/pplint-HEAD/manifests/site.pp:2169" [puppet] - 10https://gerrit.wikimedia.org/r/250080 (owner: 10Dzahn) [22:21:45] 6operations, 10Analytics, 6Discovery, 10EventBus, and 7 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1770604 (10Ottomata) Today's [[ https://tools.wmflabs.org/meetbot/wikimedia-office/2015/wikimedia-office.2015-10-30-18.17.html | EventBus RFC ]] [[ http://bots.wmflabs.org/~wm-bot/logs/... [22:25:41] (03CR) 10Dzahn: [C: 04-1] "http://puppet-compiler.wmflabs.org/1144/" [puppet] - 10https://gerrit.wikimedia.org/r/250078 (owner: 10Dzahn) [22:27:09] (03PS4) 10Dzahn: tin,mira: move base::firewall to deployment role [puppet] - 10https://gerrit.wikimedia.org/r/250091 [22:27:32] (03CR) 10Dzahn: [C: 032] "no diff: http://puppet-compiler.wmflabs.org/1145/" [puppet] - 10https://gerrit.wikimedia.org/r/250091 (owner: 10Dzahn) [22:32:04] 6operations, 10hardware-requests: Add another redis jobqueue server master and slave - https://phabricator.wikimedia.org/T89400#1770630 (10RobH) The existing rdb systems are high performance misc systems using H310 controllers. A lot of the older H310 systems were used wherever we could find a place for them,... [22:33:56] (03CR) 10Krinkle: Fixed getMWScriptWithArgs() user error message (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249803 (owner: 10Aaron Schulz) [22:34:04] 6operations, 10hardware-requests: Add another redis jobqueue server master and slave - https://phabricator.wikimedia.org/T89400#1770633 (10RobH) a:5RobH>3None I'll be keeping an eye on this since it is now in #hardware-requests, however I'm not sure if @ori, @aaron, or @joe should field the questions I pos... [22:35:35] (03PS2) 10Dzahn: swift: minimal lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/250063 [22:36:00] (03CR) 10Dzahn: [C: 032] "no diff: http://puppet-compiler.wmflabs.org/1146/" [puppet] - 10https://gerrit.wikimedia.org/r/250063 (owner: 10Dzahn) [22:36:53] (03PS2) 10Dzahn: tin,mira: move standard to role class [puppet] - 10https://gerrit.wikimedia.org/r/250103 [22:38:08] (03PS3) 10Dzahn: tin,mira: move standard to role class [puppet] - 10https://gerrit.wikimedia.org/r/250103 [22:39:13] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/250103 (owner: 10Dzahn) [22:40:12] mutante: If you have a sec, https://gerrit.wikimedia.org/r/243838 [22:40:32] (03CR) 10Dzahn: [C: 032] "no diff: http://puppet-compiler.wmflabs.org/1147/" [puppet] - 10https://gerrit.wikimedia.org/r/250103 (owner: 10Dzahn) [22:44:54] !log ori@tin Synchronized php-1.27.0-wmf.4/includes/objectcache/ObjectCache.php: If12aedae7f: objectcache: Use singleton cache in newAccelerator() (duration: 00m 18s) [22:45:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:48:02] (03PS3) 10Aaron Schulz: Fix getMWScriptWithArgs() user error message [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249803 [22:48:35] (03PS2) 10Dzahn: labsdb1004 - use role keyword [puppet] - 10https://gerrit.wikimedia.org/r/250069 [22:49:38] (03CR) 10Dzahn: [C: 032] "no diff http://puppet-compiler.wmflabs.org/1148/" [puppet] - 10https://gerrit.wikimedia.org/r/250069 (owner: 10Dzahn) [22:58:47] (03CR) 10Dzahn: [C: 04-1] ""Role class role::scap::scripts not found"" [puppet] - 10https://gerrit.wikimedia.org/r/250076 (owner: 10Dzahn) [23:03:09] !log Added MaxSem and gwicke to Gerrit Project Creators group [23:03:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:03:38] !log fixed restbase labs install by deploying it; the code had gone missing [23:03:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:03:47] that group either needs to be renamed or gerrit ACLs fixed en-masse [23:04:22] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [23:04:30] Krenair, what do you mean? [23:04:55] it controls a large amount of projects/groups [23:05:00] (03PS2) 10Dzahn: eventlogging,labsdns,redisdb: minimal lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/250067 [23:05:23] (03CR) 10Dzahn: [C: 032] "no diff: http://puppet-compiler.wmflabs.org/1150/" [puppet] - 10https://gerrit.wikimedia.org/r/250067 (owner: 10Dzahn) [23:07:33] (03CR) 10Dzahn: [C: 04-1] ""Could not run: Conflicting value for cluster found in role analytics"" [puppet] - 10https://gerrit.wikimedia.org/r/250068 (owner: 10Dzahn) [23:08:28] (03CR) 10Dzahn: "so the roles analytics::kafka::server and analytics conflict with each other because of $cluster" [puppet] - 10https://gerrit.wikimedia.org/r/250068 (owner: 10Dzahn) [23:09:39] (03CR) 10Dzahn: [C: 04-1] "invalid secret ssl/gerrit.wikimedia.org.key" [puppet] - 10https://gerrit.wikimedia.org/r/250066 (owner: 10Dzahn) [23:20:45] (03PS2) 10Ori.livneh: Re-enabled sidebar cache per 47eb083a0fe4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249674 (owner: 10Aaron Schulz) [23:24:04] (03CR) 10Dzahn: "fixed in https://gerrit.wikimedia.org/r/#/c/250155/ along with all other missing keys" [puppet] - 10https://gerrit.wikimedia.org/r/250066 (owner: 10Dzahn) [23:25:50] (03PS2) 10Dzahn: udp2log,gerritl,labsdb,camus: minimal lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/250066 [23:25:59] (03CR) 10Dzahn: [C: 032] "no diff: http://puppet-compiler.wmflabs.org/1153/" [puppet] - 10https://gerrit.wikimedia.org/r/250066 (owner: 10Dzahn) [23:29:26] ori: You're deploying in a minute I presume? [23:29:31] yeah [23:29:42] https://gerrit.wikimedia.org/r/250157 [23:30:16] got it [23:37:25] !log ori@tin Synchronized php-1.27.0-wmf.4/extensions/DismissableSiteNotice: a20aa8a544: Updated mediawiki/core Project: mediawiki/extensions/DismissableSiteNotice 5b8f1cfac48704fd740eb61f873aabb600c4c5fb (duration: 00m 17s) [23:37:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:37:43] !log ori@tin Synchronized php-1.27.0-wmf.4/extensions/Collection: b89cff0a1d: Updated mediawiki/core Project: mediawiki/extensions/Collection 092204bf333e65b2c749e4bc4a32fd2b0254089b (duration: 00m 18s) [23:37:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:38:02] !log ori@tin Synchronized php-1.27.0-wmf.4/includes/skins/Skin.php: Id9a27ba2bbd3 (duration: 00m 17s) [23:38:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:38:20] !log ori@tin Synchronized php-1.27.0-wmf.4/includes/cache/MessageCache.php: Id9a27ba2bbd3 (duration: 00m 17s) [23:38:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:38:53] (03PS3) 10Ori.livneh: Re-enabled sidebar cache per 47eb083a0fe4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249674 (owner: 10Aaron Schulz) [23:38:57] (03CR) 10Ori.livneh: [C: 032] Re-enabled sidebar cache per 47eb083a0fe4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249674 (owner: 10Aaron Schulz) [23:39:05] (03Merged) 10jenkins-bot: Re-enabled sidebar cache per 47eb083a0fe4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249674 (owner: 10Aaron Schulz) [23:41:17] 6operations, 10Sentry, 10hardware-requests: Procure hardware for Sentry - https://phabricator.wikimedia.org/T93138#1770785 (10Tgr) Thanks for keeping this in mind, Rob :) There are two blockers, integrating with some Wikimedia authentication method (T97133) and fixing the SMTP configuration (T116709). They a... [23:43:30] (03PS3) 10Dzahn: logstash: no "if $hostname" in node blocks, use role [puppet] - 10https://gerrit.wikimedia.org/r/250070 [23:45:41] (03PS2) 10Dzahn: openstack: some more lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/250064 [23:46:27] (03CR) 10Dzahn: [C: 031] logstash: no "if $hostname" in node blocks, use role [puppet] - 10https://gerrit.wikimedia.org/r/250070 (owner: 10Dzahn) [23:46:32] (03PS4) 10Dzahn: logstash: no "if $hostname" in node blocks, use role [puppet] - 10https://gerrit.wikimedia.org/r/250070 [23:51:20] (03PS4) 10Dzahn: swift: no "if $hostname" in node blocks, use role [puppet] - 10https://gerrit.wikimedia.org/r/250072 [23:53:01] !log ori@tin Synchronized wmf-config/CommonSettings.php: Ifd3543d99: Re-enabled sidebar cache per 47eb083a0fe4 (duration: 00m 17s) [23:53:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:56:40] 6operations, 10Deployment-Systems, 5Patch-For-Review: install/deploy mira as codfw deployment server - https://phabricator.wikimedia.org/T95436#1770800 (10Dzahn) - moved "labsdbmanager" to role keyword: https://gerrit.wikimedia.org/r/#/c/250083/ - moved base::firewall to the deployment-server role https://g...