[00:00:04] RoanKattouw ostriches Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160212T0000). Please do the needful. [00:00:04] bd808 matt_flaschen Dereckson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:24] o/ [00:00:29] !log add niedzielski to nda LDAP group (T106064) [00:00:31] previous window is overrunning [00:00:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:01:11] PROBLEM - HHVM rendering on mw2173 is CRITICAL: Connection refused [00:01:12] I tried running the script again and it got further, but eventually died with the same error about not being able to connect to elastic1001:9300 [00:01:34] (03PS1) 10BBlack: misc-web: reduce TTLs from 1H to 600 [dns] - 10https://gerrit.wikimedia.org/r/270144 [00:01:42] PROBLEM - puppet last run on mw2173 is CRITICAL: Connection refused by host [00:01:42] PROBLEM - DPKG on mw2173 is CRITICAL: Connection refused by host [00:01:42] PROBLEM - Disk space on mw2173 is CRITICAL: Connection refused by host [00:01:51] Krenair: elastic1001 is listening on 9300, it's just firewalled off from everything except other elasticsearch nodes [00:02:05] Uh. [00:02:12] PROBLEM - HHVM processes on mw2173 is CRITICAL: Connection refused by host [00:02:12] PROBLEM - configured eth on mw2173 is CRITICAL: Connection refused by host [00:02:12] PROBLEM - salt-minion processes on mw2173 is CRITICAL: Connection refused by host [00:02:13] PROBLEM - Apache HTTP on mw2173 is CRITICAL: Connection refused [00:02:14] PROBLEM - RAID on mw2173 is CRITICAL: Connection refused by host [00:02:15] (03PS1) 10Dereckson: Use extension registration for Math [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270145 (https://phabricator.wikimedia.org/T119117) [00:02:17] Present [00:02:27] It needs to be accessible from terbium, ebernhardson [00:02:35] Krenair: no it doesn't, under any circumstances [00:02:44] Krenair: it is an internal inter-node communication port [00:02:48] probably :9200 [00:02:57] Then we'll have to disable CirrusSearch by default, ebernhardson [00:03:07] Krenair: i think you just don't understand the error message :P [00:03:32] RemoteTransportException[[elastic1001][inet[/10.64.0.108:9300]][indices:admin/mapping/put]]; nested:ProcessClusterEventTimeoutException[failed to process cluster event (put-mapping [page]) [00:03:33] and have you set up indexes for that separately [00:03:41] It cannot remain part of addWiki.php while behaving like this [00:03:42] PROBLEM - Check size of conntrack table on mw2173 is CRITICAL: Connection refused by host [00:03:43] that says that node X tried talking to node Y, and it timed out [00:03:55] it does not mean that terbium is talking on port 9300 [00:04:31] PROBLEM - dhclient process on mw2173 is CRITICAL: Connection refused by host [00:04:37] so something is wrong between the elastic nodes? [00:04:51] PROBLEM - nutcracker port on mw2173 is CRITICAL: Connection refused by host [00:04:54] Krenair: the elasticsearch cluster doesn't like having 3k indices, it would prefer to have 300 [00:05:00] Krenair: just increase the timeout to 10m or something [00:05:12] PROBLEM - nutcracker process on mw2173 is CRITICAL: Connection refused by host [00:05:13] Krenair: or run it at a less busy time of the day [00:05:49] I have no idea how to make the script use a non-default timeout [00:06:45] we really shouldn't have to schedule wiki creations just to fit in with the search cluster's less busy hours [00:06:55] eh, mw2173.. that server is isntalling [00:06:56] Krenair: https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/CirrusSearch-production.php#L11 [00:07:05] ori: [00:07:05] Krenair: buy me 40 more servers :P [00:07:19] Not my decision [00:07:41] Krenair: right, we don't live in an optimal world. We all have to work arround unoptimal things [00:07:42] > we really shouldn't have to schedule wiki creations just to fit in with the search cluster's less busy hours [00:07:44] why not? [00:07:49] how often are wikis created? [00:08:00] few times a year [00:08:25] let's take a look at addWiki.php and see if we can figure out how to increase the timeout [00:08:48] seems silly just to make it fit in with search, of all things [00:08:51] # Create new search index [00:08:51] $searchIndex = $this->runChild( 'CirrusSearch\Maintenance\UpdateSearchIndexConfig' ); [00:08:51] $searchIndex->mOptions[ 'baseName' ] = $dbName; [00:08:51] $searchIndex->execute(); [00:08:54] is the relevant code [00:09:06] essentially "mwscript extensions/CirrusSearch/maintenance/updateSearchIndexConfig.php aawiki --baseName adywiki" [00:09:07] i would have to make a small adjustment to a cirrus file to take the value over cli args, currently it takes it from the global param i linked above [00:09:34] https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/CirrusSearch-production.php#L11 [00:09:59] let's just set that to a higher value in addWiki.php? [00:10:24] (03PS2) 10Dereckson: Use extension registration for Math [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270145 (https://phabricator.wikimedia.org/T119117) [00:10:28] or does it have to already be set to the desired value when Cirrus's entrypoint is loaded? [00:10:54] ori: i can adjust he maint script to take it from the cli args, sec [00:11:19] ebernhardson: but presumably updating search indices should always run with a higher timeout, no? [00:11:55] doesn't matter too much I guess [00:12:10] ori: i mean, there probably isn't much harm to just making the default timeout 10m from mediawiki-config [00:12:15] I think all of the remaining adywiki issues can be resolved without deployments, let's do swat [00:12:47] (03PS1) 10Dereckson: Test file backend for Math on labtestwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270146 (https://phabricator.wikimedia.org/T126628) [00:12:57] ebernhardson: maybe vary on PHP_SAPI or something [00:13:24] ori: if anything is doing master operations from a web process something else is really wrong :) [00:13:39] Extension:Puppet [00:13:41] Krenair: wgMathPath is at //upload.wikimedia.org/math by default. If we use local-multiwrite, it should be relative to the current wiki shouldn't it? [00:13:50] ori: there is a puppet extension O.o [00:13:51] (that's not a real thing :P) [00:13:53] :P [00:14:52] bd808, sending your commit through jenkins [00:15:05] o/ [00:16:17] (03PS2) 10Dereckson: Test file backend for Math on labtestwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270146 (https://phabricator.wikimedia.org/T126628) [00:16:35] (03CR) 10Dereckson: "PS2: local wgMathPath" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270146 (https://phabricator.wikimedia.org/T126628) (owner: 10Dereckson) [00:16:39] (03CR) 10Alex Monk: [C: 032] Use Wikipedia logo for all Wikipedias for Echo notifications [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270121 (https://phabricator.wikimedia.org/T49662) (owner: 10Mattflaschen) [00:16:55] matt_flaschen, also doing yours ^ while waiting for jenkins in mediawiki repos [00:17:12] !log `nodetool stop -- COMPACTION' on restbase1002.eqiad to free disk space (https://phabricator.wikimedia.org/P2598) [00:17:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:17:50] (03Merged) 10jenkins-bot: Use Wikipedia logo for all Wikipedias for Echo notifications [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270121 (https://phabricator.wikimedia.org/T49662) (owner: 10Mattflaschen) [00:18:11] !log Depooled mw2173 [00:18:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:18:44] 6operations: mw2173 mystery install - https://phabricator.wikimedia.org/T126694#2021183 (10Dzahn) 3NEW [00:19:02] Thanks, Krenair [00:19:36] !log krenair@mira Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/270121/ (duration: 01m 14s) [00:19:39] matt_flaschen, ^ [00:19:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:19:51] Testing [00:20:52] Works on Spanish Wikipedia [00:21:03] ok, thanks [00:21:07] Krenair: i created those indices [00:21:10] ebernhardson, thanks [00:21:36] ebernhardson, hmm... do we have to do something special to get MW to index the main page now? [00:22:06] (03CR) 10Alex Monk: [C: 032] "thanks" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270140 (owner: 10Dereckson) [00:22:33] (03Merged) 10jenkins-bot: Fix wgUploadPath for labtestwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270140 (owner: 10Dereckson) [00:22:39] Krenair: typically https://wikitech.wikimedia.org/wiki/Search#Full_reindex i can run it [00:23:25] ebernhardson, it's not that important, I suppose the first edit will index this right? [00:23:46] Krenair: any edit will index it as well, yes [00:24:23] ok, that'll be fine [00:26:07] !log krenair@mira Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/270140/ (duration: 01m 17s) [00:26:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:28:51] (03PS4) 10Cmjohnson: admin: add shell users for frack pentest [puppet] - 10https://gerrit.wikimedia.org/r/268722 (https://phabricator.wikimedia.org/T126012) (owner: 10Dzahn) [00:30:13] (03CR) 10jenkins-bot: [V: 04-1] admin: add shell users for frack pentest [puppet] - 10https://gerrit.wikimedia.org/r/268722 (https://phabricator.wikimedia.org/T126012) (owner: 10Dzahn) [00:30:30] !log krenair@mira Synchronized php-1.27.0-wmf.13/includes/session/MetadataMergeException.php: https://gerrit.wikimedia.org/r/#/c/270126/ (duration: 01m 14s) [00:30:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:32:31] !log krenair@mira Synchronized php-1.27.0-wmf.13/autoload.php: https://gerrit.wikimedia.org/r/#/c/270126/ (duration: 01m 16s) [00:32:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:34:36] !log restbase deploy start of 6f6311f [00:34:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:35:16] !log krenair@mira Synchronized php-1.27.0-wmf.13: https://gerrit.wikimedia.org/r/#/c/270126/ (duration: 02m 26s) [00:35:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:36:27] bd808, ^ [00:37:05] * bd808 looks at logstash [00:37:53] Krenair: works! [00:39:49] (03CR) 10Alex Monk: [C: 032] Use extension registration for Math [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270145 (https://phabricator.wikimedia.org/T119117) (owner: 10Dereckson) [00:40:35] Dereckson, pin [00:40:37] ping [00:40:59] (03Merged) 10jenkins-bot: Use extension registration for Math [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270145 (https://phabricator.wikimedia.org/T119117) (owner: 10Dereckson) [00:40:59] pong [00:45:01] !log krenair@mira Synchronized wmf-config: https://gerrit.wikimedia.org/r/#/c/270145/ (duration: 01m 16s) [00:45:04] Dereckson, ^ [00:45:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:45:09] Testing. [00:45:36] lgtm [00:45:37] Works. [00:45:52] ok, swat done [00:46:06] Thanks for the deploy. [00:46:13] A quick question about Math: [00:46:29] it's hard [00:47:33] 00:13 < Dereckson> Krenair: wgMathPath is at //upload.wikimedia.org/math by default. If we use local-multiwrite, it should be relative to the current wiki shouldn't it? [00:48:09] Dereckson, I'm not familiar with file backends, but yes I think so [00:49:07] bd808: har har [00:49:21] Okay. [00:49:27] I'll be here all week. Try the veal! [00:54:27] okay, I think adywiki should be editable now [01:07:34] 6operations, 10OTRS, 5Patch-For-Review: Upload AgentLogo file to OTRS skins directory - https://phabricator.wikimedia.org/T125912#2021298 (10Az1568) 5Resolved>3Open Re-opening this after consulting with RD and coming up with a new and improved version of the background. Please upload this and replace the... [01:07:42] (03PS1) 10Tim Landscheidt: Remove unused type misc::limn::instance [puppet] - 10https://gerrit.wikimedia.org/r/270151 [01:12:51] RECOVERY - Restbase root url on restbase1007 is OK: HTTP OK: HTTP/1.1 200 - 15234 bytes in 0.009 second response time [01:13:04] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [01:18:48] still haven't figured out why the newprojects email is missing 'Wikipedia' in "for a in адыгабзэ" [01:18:52] PROBLEM - restbase endpoints health on restbase2002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.16.153, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [01:19:36] 6operations, 10OTRS, 5Patch-For-Review: Upload AgentLoginLogo file to OTRS skins directory - https://phabricator.wikimedia.org/T125911#2021319 (10Az1568) 5Resolved>3Open Re-opening this after consulting with RD, I'm uploading a version of this file that has been optimized for the login screen. Please rep... [01:19:52] PROBLEM - Restbase root url on restbase2002 is CRITICAL: Connection refused [01:22:57] !log restbase2002 re-enable puppet [01:23:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:23:32] PROBLEM - puppet last run on restbase2002 is CRITICAL: CRITICAL: Puppet last ran 1 day ago [01:24:46] you're a bit late icinga-wm [01:25:03] RECOVERY - Restbase root url on restbase2002 is OK: HTTP OK: HTTP/1.1 200 - 15234 bytes in 0.112 second response time [01:25:21] RECOVERY - puppet last run on restbase2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:26:01] RECOVERY - restbase endpoints health on restbase2002 is OK: All endpoints are healthy [01:26:15] !log restbase deploy end of 6f6311f [01:26:20] finally [01:26:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:00:03] 6operations, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, 6Services, and 2 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#2021398 (10mobrovac) @KartikMistry ping? It'd be nice to have this ready soon. [02:03:03] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [5000000.0] [02:04:31] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 63.64% of data above the critical threshold [5000000.0] [02:10:11] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 50.00% above the threshold [1000000.0] [02:11:33] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 50.00% above the threshold [1000000.0] [02:19:08] 6operations, 10ops-codfw: es2011-es2020 racking and onsite setup tasks - https://phabricator.wikimedia.org/T126006#2021411 (10Papaul) [02:30:12] PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: puppet fail [02:46:12] (03PS1) 10Tim Landscheidt: access_new_install: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270167 [02:52:04] (03PS3) 10Andrew Bogott: Switch keystone to mysql assignment from ldap. [puppet] - 10https://gerrit.wikimedia.org/r/268325 (https://phabricator.wikimedia.org/T115029) [02:54:26] 6operations, 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-CentralAuth, 10Wikimedia-Apache-configuration: Special:CentralAutoLogin/checkLoggedIn redirects to wikimediafoundation.org on Beta Cluster - https://phabricator.wikimedia.org/T126697#2021446 (10MaxSem) [02:57:53] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:57:58] (03PS1) 10Tim Landscheidt: aqs: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270168 [03:01:10] (03PS1) 10Andrew Bogott: Update admin_project_id in the keystone config [puppet] - 10https://gerrit.wikimedia.org/r/270169 (https://phabricator.wikimedia.org/T115029) [03:19:55] (03PS1) 10Tim Landscheidt: smokeping: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270170 [03:26:11] PROBLEM - puppet last run on mw1240 is CRITICAL: CRITICAL: Puppet has 1 failures [03:29:18] (03PS1) 10Tim Landscheidt: annualreport: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270171 [03:29:41] (03PS1) 10Tim Landscheidt: archiva: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270172 [03:30:00] (03PS1) 10Tim Landscheidt: cassandra: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270173 [03:30:22] (03PS1) 10Tim Landscheidt: citoid: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270174 [03:30:37] (03PS1) 10Tim Landscheidt: cxserver: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270175 [03:30:55] (03PS1) 10Tim Landscheidt: diamond: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270176 [03:31:29] (03PS1) 10Tim Landscheidt: etcd: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270177 [03:31:55] (03PS1) 10Tim Landscheidt: etherpad: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270178 [03:32:28] (03PS1) 10Tim Landscheidt: extdist: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270179 [03:32:52] (03PS1) 10Tim Landscheidt: ganeti: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270180 [03:33:14] (03PS1) 10Tim Landscheidt: gdash: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270181 [03:33:31] (03PS1) 10Tim Landscheidt: gitblit: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270182 [03:33:54] (03PS1) 10Tim Landscheidt: grafana: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270183 [03:34:12] (03PS1) 10Tim Landscheidt: graphoid: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270184 [03:34:29] (03PS1) 10Tim Landscheidt: horizon: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270185 [03:34:45] (03PS1) 10Tim Landscheidt: icinga: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270186 [03:35:03] (03PS1) 10Tim Landscheidt: iegreview: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270187 [03:35:19] (03PS1) 10Tim Landscheidt: ipmi: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270188 [03:35:35] (03PS1) 10Tim Landscheidt: ipsec: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270189 [03:35:37] (03CR) 10Andrew Bogott: "Thank you for the organizing! I'm interested in these but probably won't have room in my brain to look at them until next week." [puppet] - 10https://gerrit.wikimedia.org/r/269902 (owner: 10Tim Landscheidt) [03:35:50] (03PS1) 10Tim Landscheidt: ipv6relay: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270190 [03:36:06] (03PS1) 10Tim Landscheidt: ircyall: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270191 [03:36:21] (03PS1) 10Tim Landscheidt: jobqueue_redis: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270192 [03:36:37] (03PS1) 10Tim Landscheidt: jsbench: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270193 [03:36:52] (03PS1) 10Tim Landscheidt: kibana: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270194 [03:37:11] (03PS1) 10Tim Landscheidt: librenms: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270195 [03:37:25] (03PS1) 10Tim Landscheidt: mathoid: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270196 [03:37:41] (03PS1) 10Tim Landscheidt: memcached: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270197 [03:37:59] (03PS1) 10Tim Landscheidt: mobileapps: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270198 [03:38:16] (03PS1) 10Tim Landscheidt: mw_rc_irc: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270199 [03:38:37] (03PS1) 10Tim Landscheidt: noc: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270200 [03:39:00] (03PS1) 10Tim Landscheidt: ntp: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270201 [03:39:18] (03PS1) 10Tim Landscheidt: ocg: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270202 [03:39:47] (03PS1) 10Tim Landscheidt: otrs: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270203 [03:40:14] (03PS1) 10Tim Landscheidt: parsoid_rt_client: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270204 [03:40:30] (03PS1) 10Tim Landscheidt: parsoid_rt_server: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270205 [03:40:44] (03PS1) 10Tim Landscheidt: parsoid_vd_client: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270206 [03:41:02] (03PS1) 10Tim Landscheidt: parsoid_vd_server: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270207 [03:41:17] (03PS1) 10Tim Landscheidt: performance: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270208 [03:41:31] (03PS1) 10Tim Landscheidt: phragile: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270209 [03:41:51] (03PS1) 10Tim Landscheidt: piwik: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270210 [03:42:40] (03PS1) 10Tim Landscheidt: planet: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270211 [03:44:15] (03PS1) 10Tim Landscheidt: pmacct: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270212 [03:44:30] (03PS1) 10Tim Landscheidt: poolcounter: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270213 [03:44:46] (03PS1) 10Tim Landscheidt: puppet_compiler: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270214 [03:45:19] (03PS1) 10Tim Landscheidt: pybal_config: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270215 [03:45:36] (03PS1) 10Tim Landscheidt: racktables: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270216 [03:46:02] (03PS1) 10Tim Landscheidt: rancid: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270217 [03:46:17] (03PS1) 10Tim Landscheidt: sca: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270218 [03:46:32] (03PS1) 10Tim Landscheidt: scb: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270219 [03:47:06] (03PS1) 10Tim Landscheidt: sentry: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270220 [03:48:01] (03PS1) 10Tim Landscheidt: servermon: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270221 [03:48:24] (03PS1) 10Tim Landscheidt: simplelap: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270222 [03:48:41] (03PS1) 10Tim Landscheidt: simplestatic: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270223 [03:48:57] (03PS1) 10Tim Landscheidt: spare: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270224 [03:49:15] (03PS1) 10Tim Landscheidt: statsdlb: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270225 [03:49:58] (03PS1) 10Tim Landscheidt: statsite: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270226 [03:50:14] (03PS1) 10Tim Landscheidt: tcpircbot: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270227 [03:50:54] (03PS1) 10Tim Landscheidt: tendril: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270228 [03:51:14] (03PS1) 10Tim Landscheidt: torrus: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270229 [03:51:59] (03PS1) 10Tim Landscheidt: transparency: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270230 [03:52:12] RECOVERY - puppet last run on mw1240 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [03:52:15] (03PS1) 10Tim Landscheidt: url_downloader: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270231 [03:52:47] (03PS1) 10Tim Landscheidt: ve: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270232 [03:53:07] (03PS1) 10Tim Landscheidt: wdqs: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270233 [03:53:22] (03PS1) 10Tim Landscheidt: webperf: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270234 [03:53:37] (03PS1) 10Tim Landscheidt: wikimania_scholarships: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270235 [03:54:05] (03PS1) 10Tim Landscheidt: wikistats: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270236 [03:54:22] (03PS1) 10Tim Landscheidt: xenon: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270237 [03:54:38] (03PS1) 10Tim Landscheidt: yubiauth: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270238 [03:54:58] (03PS1) 10Tim Landscheidt: zotero: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270239 [03:58:21] 6operations, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, 6Services, and 2 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#2021542 (10KartikMistry) @mobrovac, Yes. We will move cxserver first (Scheduled 16th Feb) and me (and Alex) will w... [04:15:43] 10Ops-Access-Requests, 6operations, 6Services, 5Patch-For-Review: Requesting restbase-roots access to RESTBase cluster for Petr Pchelko - https://phabricator.wikimedia.org/T126283#2021553 (10Cmjohnson) User added but adding to restbase-roots group will require an approval in ops meeting. [04:16:58] 6operations: Decom berkelium/curium? - https://phabricator.wikimedia.org/T125962#2021556 (10Cmjohnson) Any update on this? [04:31:30] 6operations, 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-CentralAuth, 10Wikimedia-Apache-configuration: Special:CentralAutoLogin/checkLoggedIn redirects to wikimediafoundation.org on Beta Cluster - https://phabricator.wikimedia.org/T126697#2021579 (10Tgr) Maybe during T124804 some redirect respon... [04:38:01] PROBLEM - puppet last run on db2053 is CRITICAL: CRITICAL: puppet fail [04:50:22] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps wave]BR [04:55:32] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 [05:04:12] RECOVERY - puppet last run on db2053 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [05:09:02] RECOVERY - cassandra-a CQL 10.64.0.230:9042 on restbase1007 is OK: TCP OK - 0.003 second response time on port 9042 [05:21:01] (03PS1) 10Tim Landscheidt: analytics: Move role classes to module role [puppet] - 10https://gerrit.wikimedia.org/r/270242 [05:40:11] PROBLEM - puppet last run on mw1193 is CRITICAL: CRITICAL: Puppet has 1 failures [05:48:58] !log restbase *staging*: started low-concurrency test dump run from ruthenium against xenon [05:49:01] !log re-enabling cr2-knams xe-0/0/2 (Tele2) [05:49:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:49:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:07:43] RECOVERY - puppet last run on mw1193 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:30:11] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:12] PROBLEM - puppet last run on db2056 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:42] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:02] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:42] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:21] PROBLEM - puppet last run on nobelium is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:22] PROBLEM - puppet last run on mw2035 is CRITICAL: CRITICAL: puppet fail [06:32:31] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 3 failures [06:32:41] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:42] PROBLEM - puppet last run on ms-be1005 is CRITICAL: CRITICAL: puppet fail [06:32:42] PROBLEM - puppet last run on ms-be1010 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:43] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:43] PROBLEM - puppet last run on mw2088 is CRITICAL: CRITICAL: puppet fail [06:33:02] PROBLEM - puppet last run on db1049 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:12] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:21] PROBLEM - puppet last run on mw1195 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:32] PROBLEM - puppet last run on mw1085 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:51] PROBLEM - puppet last run on wtp2015 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:51] PROBLEM - puppet last run on wtp2019 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:52] PROBLEM - puppet last run on cp3036 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:05] PROBLEM - puppet last run on mw2198 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:11] PROBLEM - puppet last run on wtp1012 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:11] PROBLEM - puppet last run on mw1157 is CRITICAL: CRITICAL: Puppet has 1 failures [06:38:17] <_joe_> uhm quite a lot today [06:51:02] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [5000000.0] [06:56:11] RECOVERY - puppet last run on db2056 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:56:24] RECOVERY - puppet last run on nobelium is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:56:33] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [06:56:43] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:56:52] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:56:52] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:57:21] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [06:57:41] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [06:57:51] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:52] RECOVERY - puppet last run on wtp2015 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:57:52] RECOVERY - puppet last run on wtp2019 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:58:01] RECOVERY - puppet last run on cp3036 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:58:11] RECOVERY - puppet last run on wtp1012 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:58:12] RECOVERY - puppet last run on mw1157 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:58:22] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:32] RECOVERY - puppet last run on ms-be1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:52] RECOVERY - puppet last run on db1049 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:02] RECOVERY - puppet last run on mw1195 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:21] RECOVERY - puppet last run on mw1085 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:53] RECOVERY - puppet last run on mw2198 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:00:11] RECOVERY - puppet last run on mw2035 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [07:00:22] RECOVERY - puppet last run on ms-be1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:00:32] RECOVERY - puppet last run on mw2088 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:01:34] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 50.00% above the threshold [1000000.0] [07:13:52] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/2/0: down - Core: cr1-eqord:xe-0/0/1 (Telia, IC-313592, 51ms) {#1502} [10Gbps wave]BR [07:14:23] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/0/1: down - Core: cr1-ulsfo:xe-1/2/0 (Telia, IC-313592, 51ms) {#11372} [10Gbps wave]BR [07:19:41] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [07:20:42] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 72, down: 0, dormant: 0, excluded: 0, unused: 0 [07:55:36] !log repooling elastic1023, hw problem has been fixed [07:55:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:16:16] (03PS3) 10Ema: Display a message in motd if puppet agent is disabled [puppet] - 10https://gerrit.wikimedia.org/r/268684 [08:20:07] (03CR) 10Ema: Display a message in motd if puppet agent is disabled (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/268684 (owner: 10Ema) [08:20:27] (03PS2) 10Muehlenhoff: Add ferm rules for rsyncd used in eventlogging [puppet] - 10https://gerrit.wikimedia.org/r/269983 [08:20:33] (03CR) 10Ema: [C: 032 V: 032] Display a message in motd if puppet agent is disabled [puppet] - 10https://gerrit.wikimedia.org/r/268684 (owner: 10Ema) [08:20:50] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add ferm rules for rsyncd used in eventlogging [puppet] - 10https://gerrit.wikimedia.org/r/269983 (owner: 10Muehlenhoff) [08:20:56] 6operations, 6Performance-Team, 10Traffic: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#2021716 (10Peter) Would be great if we could make timeline for when we can start to test performance for the user. Want to know if T125208 will be an issue also for H/2 or if it's only SPDY (hopefully not but... [08:22:56] <_joe_> ema: we should add that to the icinga check too [08:23:09] <_joe_> so i can see from icinga the reason puppet is disabled [08:24:14] _joe_: yes. And maybe also add ori's awesome script to puppet [08:24:20] (03PS2) 10Muehlenhoff: Add ferm rules for eventlogging udp receiver [puppet] - 10https://gerrit.wikimedia.org/r/269986 (https://phabricator.wikimedia.org/T113343) [08:24:35] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add ferm rules for eventlogging udp receiver [puppet] - 10https://gerrit.wikimedia.org/r/269986 (https://phabricator.wikimedia.org/T113343) (owner: 10Muehlenhoff) [08:25:51] oh, it is already (though in his $HOME) [08:27:37] <_joe_> yeah or.i's dotfiles are a teasure trove :P [08:30:07] (03CR) 10DCausse: "Awesome!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/269100 (https://phabricator.wikimedia.org/T109101) (owner: 10Gehel) [08:41:19] (03CR) 10Hashar: [C: 031] contint: Use slave-scripts/bin/php wrapper script [puppet] - 10https://gerrit.wikimedia.org/r/269370 (https://phabricator.wikimedia.org/T126211) (owner: 10Legoktm) [08:44:09] (03Abandoned) 10Hashar: package_builder: set HOOKDIR only when it exists [puppet] - 10https://gerrit.wikimedia.org/r/269095 (https://phabricator.wikimedia.org/T125999) (owner: 10Hashar) [08:46:31] PROBLEM - puppet last run on mw1146 is CRITICAL: CRITICAL: Puppet has 7 failures [08:47:23] (03CR) 10Ema: [C: 031] VCL: drop default ttl_cap to 21 days [puppet] - 10https://gerrit.wikimedia.org/r/269968 (https://phabricator.wikimedia.org/T124954) (owner: 10BBlack) [09:03:33] (03PS2) 10Muehlenhoff: Don't automatically update openssh-client [puppet] - 10https://gerrit.wikimedia.org/r/264096 [09:06:09] (03CR) 10Muehlenhoff: [C: 032 V: 032] Don't automatically update openssh-client [puppet] - 10https://gerrit.wikimedia.org/r/264096 (owner: 10Muehlenhoff) [09:36:44] !log restbase1002 cleanup nodetool cleanup local_group_wikipedia_T_parsoid_dataW4ULtxs1oMqJ && stop compactions [09:36:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:39:25] !log restbase1001 nodetool cleanup && nodetool stop compaction [09:39:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:46:19] * godog hugs bd808 https://phabricator.wikimedia.org/T108720#2021623 [09:53:32] (03PS4) 10Ema: Display a message in motd if puppet agent is disabled [puppet] - 10https://gerrit.wikimedia.org/r/268684 [09:53:58] (03CR) 10Ema: [V: 032] Display a message in motd if puppet agent is disabled [puppet] - 10https://gerrit.wikimedia.org/r/268684 (owner: 10Ema) [10:08:13] PROBLEM - puppet last run on strontium is CRITICAL: CRITICAL: puppet fail [10:09:28] (03CR) 10Faidon Liambotis: [C: 031] "No objection, but given the amount of these hostnames and the low popularity of most of them, I do wonder if we'd be better served by addi" [dns] - 10https://gerrit.wikimedia.org/r/270144 (owner: 10BBlack) [10:10:11] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 629 [10:12:25] 10Ops-Access-Requests, 6operations: Add Riccardo to ops mailing lists - https://phabricator.wikimedia.org/T126432#2021854 (10Joe) Riccardo is now subscribed. [10:12:42] 10Ops-Access-Requests, 6operations: Onboarding of Riccardo Coccioli - https://phabricator.wikimedia.org/T126425#2021858 (10Joe) [10:12:44] 10Ops-Access-Requests, 6operations: Add Riccardo to ops mailing lists - https://phabricator.wikimedia.org/T126432#2021855 (10Joe) 5Open>3Resolved a:3Joe [10:19:42] PROBLEM - puppet last run on labcontrol1001 is CRITICAL: CRITICAL: puppet fail [10:20:11] RECOVERY - check_mysql on db1008 is OK: Uptime: 2054515 Threads: 3 Questions: 13638521 Slow queries: 13798 Opens: 4972 Flush tables: 2 Open tables: 402 Queries per second avg: 6.638 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [10:20:51] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: puppet fail [10:25:33] (03PS3) 10Giuseppe Lavagetto: Rationalize services definitions for labs too. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269955 [10:25:35] (03PS3) 10Giuseppe Lavagetto: Configure redis LockManager in both DCs, use the master everywhere. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266514 [10:25:37] (03PS3) 10Giuseppe Lavagetto: Add references to wmfServices for Cirrusearch. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266512 (https://phabricator.wikimedia.org/T114273) [10:25:39] (03PS3) 10Giuseppe Lavagetto: Use wmfMasterDatacenter for picking the master redis config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266513 (https://phabricator.wikimedia.org/T114273) [10:25:41] (03PS5) 10Giuseppe Lavagetto: Reduce poolcounter configuration complexity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266511 (https://phabricator.wikimedia.org/T114273) [10:25:43] (03PS14) 10Giuseppe Lavagetto: Define service entries for InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266510 (https://phabricator.wikimedia.org/T114273) [10:26:02] PROBLEM - puppet last run on labcontrol1002 is CRITICAL: CRITICAL: puppet fail [10:27:04] most recent puppet commit broke puppet on the masters [10:27:19] (03CR) 10jenkins-bot: [V: 04-1] Configure redis LockManager in both DCs, use the master everywhere. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266514 (owner: 10Giuseppe Lavagetto) [10:27:28] (03CR) 10jenkins-bot: [V: 04-1] Add references to wmfServices for Cirrusearch. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266512 (https://phabricator.wikimedia.org/T114273) (owner: 10Giuseppe Lavagetto) [10:27:39] (03CR) 10jenkins-bot: [V: 04-1] Use wmfMasterDatacenter for picking the master redis config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266513 (https://phabricator.wikimedia.org/T114273) (owner: 10Giuseppe Lavagetto) [10:27:41] https://gerrit.wikimedia.org/r/#/c/268684/ [10:27:43] because [10:27:57] Duplicate declaration: File[/var/lib/puppet] is already declared in file /etc/puppet/modules/puppetmaster/manifests/ssl.pp:38; cannot redeclare at /etc/puppet/modules/base/manifests/puppet.pp:100 [10:30:02] ema: around? ^^ [10:30:07] apergos: oh yes, sorry about that [10:30:46] the goal there was just to change the permissions of /var/lib/puppet [10:31:40] might want to pull out the file declaration to a class that can be included in both [10:31:55] <_joe_> to a place where it makes sense, yes [10:35:11] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 694 [10:35:31] (03PS1) 10Muehlenhoff: Cherrypick 64193c8218540499984cd63cda41f3cd491f3f59 from the 1.0.2 branch to fix spurious log messages if SSL clients quit during the SSL handshake [debs/openssl] - 10https://gerrit.wikimedia.org/r/270257 (https://phabricator.wikimedia.org/T126616) [10:36:52] (03CR) 10Muehlenhoff: [C: 032 V: 032] Cherrypick 64193c8218540499984cd63cda41f3cd491f3f59 from the 1.0.2 branch to fix spurious log messages if SSL clients quit during the SSL ha [debs/openssl] - 10https://gerrit.wikimedia.org/r/270257 (https://phabricator.wikimedia.org/T126616) (owner: 10Muehlenhoff) [10:38:40] _joe_, apergos: or maybe just remove the declaration in modules/puppetmaster/manifests/ssl.pp? [10:39:20] <_joe_> ema: sorry, I'm trying to understand what is wrong with my tests :/ [10:40:11] RECOVERY - check_mysql on db1008 is OK: Uptime: 2055715 Threads: 2 Questions: 13666472 Slow queries: 13817 Opens: 4973 Flush tables: 2 Open tables: 403 Queries per second avg: 6.648 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [10:48:22] PROBLEM - puppet last run on cp4002 is CRITICAL: CRITICAL: puppet fail [10:48:39] (03PS4) 10Giuseppe Lavagetto: Configure redis LockManager in both DCs, use the master everywhere. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266514 [10:48:41] (03PS4) 10Giuseppe Lavagetto: Add references to wmfServices for Cirrusearch. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266512 (https://phabricator.wikimedia.org/T114273) [10:48:43] (03PS4) 10Giuseppe Lavagetto: Use wmfMasterDatacenter for picking the master redis config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266513 (https://phabricator.wikimedia.org/T114273) [10:49:35] (03CR) 10jenkins-bot: [V: 04-1] Configure redis LockManager in both DCs, use the master everywhere. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266514 (owner: 10Giuseppe Lavagetto) [10:49:41] (03CR) 10jenkins-bot: [V: 04-1] Add references to wmfServices for Cirrusearch. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266512 (https://phabricator.wikimedia.org/T114273) (owner: 10Giuseppe Lavagetto) [10:49:56] (03CR) 10jenkins-bot: [V: 04-1] Use wmfMasterDatacenter for picking the master redis config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266513 (https://phabricator.wikimedia.org/T114273) (owner: 10Giuseppe Lavagetto) [10:54:21] 6operations, 5Patch-For-Review, 7user-notice: reinstall eqiad memcache servers with jessie - https://phabricator.wikimedia.org/T123711#2021924 (10elukey) Stopping the work due to a latency regression spotted by Ori: https://phabricator.wikimedia.org/T126700 [11:03:26] 6operations, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, 6Services, and 2 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#2021943 (10akosiaris) Well, if you are almost done with syncing all packages to Debian, no. I 'd say it's worth to... [11:12:32] PROBLEM - Apache HTTP on mw1146 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:12:42] PROBLEM - HHVM rendering on mw1146 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:15:32] RECOVERY - puppet last run on cp4002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:16:02] RECOVERY - Apache HTTP on mw1146 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 499 bytes in 0.185 second response time [11:16:11] RECOVERY - HHVM rendering on mw1146 is OK: HTTP OK: HTTP/1.1 200 OK - 71144 bytes in 0.403 second response time [11:16:50] (03CR) 10Hashar: contint: Use slave-scripts/bin/php wrapper script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/269370 (https://phabricator.wikimedia.org/T126211) (owner: 10Legoktm) [11:17:12] RECOVERY - puppet last run on mw1146 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [11:21:29] (03PS1) 10Ema: Fix multiple definition of /var/lib/puppet [puppet] - 10https://gerrit.wikimedia.org/r/270263 [11:23:24] apergos, _joe_: ^ [11:25:16] !log deploying required filtering before adding adywiki to labs [11:28:33] !log updated cp1008 to openssl 1.0.2f-1~wmf3 [11:31:09] 6operations, 6Editing-Department, 6Performance-Team, 7Performance: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2021997 (10mark) [11:36:16] (03PS5) 10Giuseppe Lavagetto: Configure redis LockManager in both DCs, use the master everywhere. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266514 [11:36:18] (03PS5) 10Giuseppe Lavagetto: Add references to wmfServices for Cirrusearch. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266512 (https://phabricator.wikimedia.org/T114273) [11:36:20] (03PS5) 10Giuseppe Lavagetto: Use wmfMasterDatacenter for picking the master redis config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266513 (https://phabricator.wikimedia.org/T114273) [11:36:52] !log uploaded openssl 1.0.2f-1~wmf3 for jessie-wikimedia to carbon [11:38:47] 6operations, 6Editing-Department, 6Performance-Team, 7Performance: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2022027 (10Joe) @jcrespo correlation does not mean causation, and in fact I am not sure at all that the two events are causally linked. I proposed to stop the reima... [11:42:27] 6operations, 10Traffic, 5Patch-For-Review: openssl-1.0.2f introduced minor bug with nginx - https://phabricator.wikimedia.org/T126616#2022038 (10MoritzMuehlenhoff) 1.0.2f-1~wmf3 has been built and uploaded to carbon. I've updated cp1008 to that version already. So far seems all fine. But since traffic to 100... [11:46:21] PROBLEM - NTP on multatuli is CRITICAL: NTP CRITICAL: No response from NTP server [11:48:05] !log reboot ms-be2003 to pick up new disk in the right order T125200 [11:48:08] 6operations, 10ops-codfw: ms-be2003.codfw.wmnet: slot=6 dev=sdg failed - https://phabricator.wikimedia.org/T125200#2022047 (10Stashbot) {nav icon=file, name=Mentioned in SAL, href=https://tools.wmflabs.org/sal/log/AVLVTzl6W8txF7J0uR5v} [2016-02-12T11:48:05Z] reboot ms-be2003 to pick up new disk in the... [11:50:43] I've stopped/started ntp on multatuli, but that means that the fix from https://phabricator.wikimedia.org/rOPUP823499e7f52eb8ab57f58869ed2c630f7a5e4b2d is not sufficient ... [11:51:41] RECOVERY - NTP on multatuli is OK: NTP OK: Offset 0.0001215934753 secs [11:57:50] 6operations, 10ops-codfw: ms-be2003.codfw.wmnet: slot=6 dev=sdg failed - https://phabricator.wikimedia.org/T125200#2022064 (10fgiunchedi) 5Open>3Resolved complete [12:07:40] (03CR) 10JanZerebecki: [C: 031] Basic "Identifiers" statement section config for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263046 (https://phabricator.wikimedia.org/T123112) (owner: 10Thiemo Mättig (WMDE)) [12:09:00] 6operations, 10MediaWiki-Interface, 10Traffic: Purge pages cached with mobile editlinks - https://phabricator.wikimedia.org/T125841#2022074 (10Aklapper) p:5Unbreak!>3Normal [12:09:26] 6operations, 5Patch-For-Review, 7user-notice: reinstall eqiad memcache servers with jessie - https://phabricator.wikimedia.org/T123711#2022082 (10Johan) OK, a couple of questions as this was tagged with user-notice (thanks, Legoktm!): *) Will this affect Wikimedians who won't find out any other way? Who nee... [12:13:06] 6operations, 10hardware-requests, 5Patch-For-Review: Upgrade restbase100[7-9] to match restbase100[1-6] hardware - https://phabricator.wikimedia.org/T119935#2022101 (10fgiunchedi) status update, restbase1007-a has been bootstrapped and the old node removed via `nodetool removenode`. note this has involved a... [12:21:48] !log ongoing conversion on db1024, expect some lag (depooled, downtimed) [12:24:17] (03PS6) 10Krinkle: [DONT MERGE] Set $wgResourceBasePath to "/w" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268715 (https://phabricator.wikimedia.org/T99096) [12:25:41] 6operations: mw2173 mystery install - https://phabricator.wikimedia.org/T126694#2022139 (10fgiunchedi) see also {T124408}, got a new disk recently it seems [12:25:46] 6operations, 5Patch-For-Review, 7user-notice: reinstall eqiad memcache servers with jessie - https://phabricator.wikimedia.org/T123711#2022141 (10Joe) To be very clear: we don't know for sure (or at all) if the latency spike was due to this change. [12:32:33] (03PS2) 10Ema: Fix multiple definition of /var/lib/puppet [puppet] - 10https://gerrit.wikimedia.org/r/270263 [12:35:17] (03PS1) 10Krinkle: Make $wgLocalStylePath the same as $wgStylePath [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270264 [12:38:41] 6operations: ntp restart sometimes unrealiable - https://phabricator.wikimedia.org/T126733#2022152 (10MoritzMuehlenhoff) [12:42:16] (03PS3) 10Ema: Fix multiple definition of /var/lib/puppet [puppet] - 10https://gerrit.wikimedia.org/r/270263 [12:43:30] (03PS2) 10Filippo Giunchedi: swiftrepl: name-based filter for objects [software] - 10https://gerrit.wikimedia.org/r/269387 (https://phabricator.wikimedia.org/T125791) [12:43:38] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swiftrepl: name-based filter for objects [software] - 10https://gerrit.wikimedia.org/r/269387 (https://phabricator.wikimedia.org/T125791) (owner: 10Filippo Giunchedi) [12:44:35] (03PS7) 10Krinkle: [DONT MERGE] Set $wgResourceBasePath to "/w" for group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268715 (https://phabricator.wikimedia.org/T99096) [12:44:37] (03PS2) 10Krinkle: Make $wgLocalStylePath the same as $wgStylePath [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270264 [12:44:55] (03CR) 10ArielGlenn: [C: 031] Fix multiple definition of /var/lib/puppet [puppet] - 10https://gerrit.wikimedia.org/r/270263 (owner: 10Ema) [12:48:15] 6operations, 6Services, 10Trebuchet: `git deploy service restart` asked for sudo password - https://phabricator.wikimedia.org/T126359#2022177 (10hashar) [12:51:18] jynus: also 503's? :O [12:53:22] 7Blocked-on-Operations, 10Beta-Cluster-Infrastructure, 6Discovery, 6Release-Engineering-Team, and 2 others: Beta: submodule update reverts new portals commits - https://phabricator.wikimedia.org/T126061#2022187 (10hashar) 5Open>3Resolved I am not sure why beta cluster is being used for live hack. The p... [12:53:26] (03CR) 10Ema: [C: 032 V: 032] Fix multiple definition of /var/lib/puppet [puppet] - 10https://gerrit.wikimedia.org/r/270263 (owner: 10Ema) [12:54:07] 6operations, 10Salt, 6Services, 10Trebuchet: `git deploy service restart` asked for sudo password - https://phabricator.wikimedia.org/T126359#2022190 (10ArielGlenn) p:5Triage>3Normal a:3ArielGlenn [12:54:15] meta down? [12:54:28] Still getting 503's everywhere [12:54:32] No. But I got a few 503s, though not consistently here. [12:55:13] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [12:55:22] yikes [12:55:22] PROBLEM - Apache HTTP on mw1237 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:55:30] yikes [12:55:31] PROBLEM - Apache HTTP on mw1252 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:55:32] PROBLEM - Apache HTTP on mw1251 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:55:32] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] [12:55:41] PROBLEM - HHVM rendering on mw1252 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:57:04] RECOVERY - Apache HTTP on mw1237 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 498 bytes in 0.028 second response time [12:57:11] RECOVERY - Apache HTTP on mw1252 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 498 bytes in 0.028 second response time [12:57:11] RECOVERY - Apache HTTP on mw1251 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 498 bytes in 0.056 second response time [12:57:12] RECOVERY - HHVM rendering on mw1252 is OK: HTTP OK: HTTP/1.1 200 OK - 70654 bytes in 0.118 second response time [12:57:16] 7Blocked-on-Operations, 10Beta-Cluster-Infrastructure, 6Discovery, 6Release-Engineering-Team, and 2 others: Beta: submodule update reverts new portals commits - https://phabricator.wikimedia.org/T126061#2022201 (10hashar) [12:57:33] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [12:58:41] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [12:59:23] RECOVERY - puppet last run on strontium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:00:19] (03PS2) 10Krinkle: redis: declare /var/run/redis [puppet] - 10https://gerrit.wikimedia.org/r/268598 (owner: 10Ori.livneh) [13:00:43] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] [13:01:28] !log restart thumbs swiftrepl, auth token expired T125791 [13:01:32] 6operations, 7Availability, 5Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: swiftrepl replication pass for thumbnails eqiad -> codfw - https://phabricator.wikimedia.org/T125791#2022209 (10Stashbot) {nav icon=file, name=Mentioned in SAL, href=https://tools.wmflabs.org/sal/log/AVLVkmsehQaf1... [13:02:23] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] [13:04:21] hi [13:06:08] https://grafana.wikimedia.org/dashboard/db/varnish-http-errors - shows the 5xx spike (second graph) [13:07:38] A lot of 503's. :O [13:07:52] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:07:53] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:08:01] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:08:03] There's also some turbulence here: http://graphite.wikimedia.org/render/?width=925&height=556&_salt=1455230924.675&from=-3hours&target=timeShift%28varnish.esams.text.frontend.request.client.method.get.sum%2C%221d%22%29&target=varnish.esams.text.frontend.request.client.method.get.sum [13:11:42] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:12:02] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:16:32] RECOVERY - puppet last run on labcontrol1001 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [13:22:51] RECOVERY - puppet last run on labcontrol1002 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [13:27:40] (03PS1) 10Alexandros Kosiaris: otrs: Replace logo_bg_wmf.png with transparent one [puppet] - 10https://gerrit.wikimedia.org/r/270273 (https://phabricator.wikimedia.org/T125912) [13:27:42] (03PS1) 10Alexandros Kosiaris: otrs: Update loginlogo [puppet] - 10https://gerrit.wikimedia.org/r/270274 (https://phabricator.wikimedia.org/T125911) [13:34:16] 6operations: setup/deploy oresrdb1001-oresrdb1002 - https://phabricator.wikimedia.org/T125562#2022267 (10Cmjohnson) [13:34:18] 6operations, 10ops-eqiad: Update Label for oresrdb1001 (WMF4577) & relocate and update label for oresrdb1002 (WMF4578) - https://phabricator.wikimedia.org/T125565#2022265 (10Cmjohnson) 5Open>3Resolved Moved oresrdb1002 to A4/u26. Added to asw-a ge-4/0/32 removed from asw-d ge-3/0/12. racktables updated. [13:34:30] (03PS3) 10Alexandros Kosiaris: otrs: Remove the otrs cron file [puppet] - 10https://gerrit.wikimedia.org/r/269703 [13:35:15] things slow again? [13:36:22] (03PS5) 10Cmjohnson: admin: add shell users for frack pentest [puppet] - 10https://gerrit.wikimedia.org/r/268722 (https://phabricator.wikimedia.org/T126012) (owner: 10Dzahn) [13:37:24] (03CR) 10Alexandros Kosiaris: [C: 032] otrs: Remove the otrs cron file [puppet] - 10https://gerrit.wikimedia.org/r/269703 (owner: 10Alexandros Kosiaris) [13:37:55] (03PS2) 10Alexandros Kosiaris: otrs: Replace logo_bg_wmf.png with transparent one [puppet] - 10https://gerrit.wikimedia.org/r/270273 (https://phabricator.wikimedia.org/T125912) [13:38:02] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] otrs: Replace logo_bg_wmf.png with transparent one [puppet] - 10https://gerrit.wikimedia.org/r/270273 (https://phabricator.wikimedia.org/T125912) (owner: 10Alexandros Kosiaris) [13:38:14] (03CR) 10jenkins-bot: [V: 04-1] admin: add shell users for frack pentest [puppet] - 10https://gerrit.wikimedia.org/r/268722 (https://phabricator.wikimedia.org/T126012) (owner: 10Dzahn) [13:38:16] (03PS2) 10Alexandros Kosiaris: otrs: Update loginlogo [puppet] - 10https://gerrit.wikimedia.org/r/270274 (https://phabricator.wikimedia.org/T125911) [13:38:23] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] otrs: Update loginlogo [puppet] - 10https://gerrit.wikimedia.org/r/270274 (https://phabricator.wikimedia.org/T125911) (owner: 10Alexandros Kosiaris) [13:43:52] PROBLEM - puppet last run on mw2029 is CRITICAL: CRITICAL: Puppet has 1 failures [13:44:56] 6operations, 10OTRS, 5Patch-For-Review: Upload AgentLogo file to OTRS skins directory - https://phabricator.wikimedia.org/T125912#2022291 (10akosiaris) 5Open>3Resolved File uploaded, width changed to 700px, definitely looks better. Thanks. Re-resolving. [13:45:18] yay [13:45:27] 6operations, 10OTRS, 5Patch-For-Review: Upload AgentLoginLogo file to OTRS skins directory - https://phabricator.wikimedia.org/T125911#2022293 (10akosiaris) File uploaded, height set to 100px, definitely looks better. Thanks! Re-resolving [13:45:40] 6operations, 10OTRS, 5Patch-For-Review: Upload AgentLoginLogo file to OTRS skins directory - https://phabricator.wikimedia.org/T125911#2022294 (10akosiaris) 5Open>3Resolved [13:51:22] (03CR) 10Thcipriani: [C: 031] make scap::target use the scap3 package provider [puppet] - 10https://gerrit.wikimedia.org/r/269560 (https://phabricator.wikimedia.org/T114363) (owner: 1020after4) [13:54:27] (03PS1) 10Jgreen: set rDNS for civi service IPs to civicrm.wm.o to match cert for SMTP/TLS [dns] - 10https://gerrit.wikimedia.org/r/270277 [13:56:03] !log ms-be1008 replacing /dev/sdd slot 3 T126627 [13:56:05] 6operations, 10ops-eqiad: ms-be1008.eqiad.wmnet: slot=3 dev=sdd failed - https://phabricator.wikimedia.org/T126627#2022301 (10Stashbot) {nav icon=file, name=Mentioned in SAL, href=https://tools.wmflabs.org/sal/log/AVLVxGHWhQaf1CQcCdyP} [2016-02-12T13:56:03Z] ms-be1008 replacing /dev/sdd slot 3 T12... [14:02:44] 6operations, 10ops-eqiad: ms-be1008.eqiad.wmnet: slot=3 dev=sdd failed - https://phabricator.wikimedia.org/T126627#2022306 (10Cmjohnson) Replaced the disk, but when I attempted to add LD 3 back, VD 13 was created. Command i used megacli -CfgLdAdd -r0[32:3] -a0 [14:04:13] (03CR) 10BBlack: [C: 032] misc-web: reduce TTLs from 1H to 600 [dns] - 10https://gerrit.wikimedia.org/r/270144 (owner: 10BBlack) [14:05:34] !log replaced failed disk db1021 [14:05:41] RECOVERY - puppet last run on ms-be1008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:06:57] (03PS1) 10Muehlenhoff: Add ferm rules for maps/cassandra [puppet] - 10https://gerrit.wikimedia.org/r/270280 [14:07:51] (03PS1) 10BBlack: All DYNA standardized to TTL=600 [dns] - 10https://gerrit.wikimedia.org/r/270281 [14:08:05] 6operations, 10ops-eqiad: db1021 degraded RAID - https://phabricator.wikimedia.org/T126451#2022308 (10Cmjohnson) Disk is rebuilding [14:10:51] RECOVERY - puppet last run on mw2029 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:12:09] (03CR) 10Yurik: [C: 04-1] "We do use thrift protocol, but not in production:" [puppet] - 10https://gerrit.wikimedia.org/r/270280 (owner: 10Muehlenhoff) [14:28:53] (03PS1) 10BBlack: move most of esams to standard layout [dns] - 10https://gerrit.wikimedia.org/r/270285 [14:31:02] !log upgrading openssl on cp1068 [14:33:54] 6operations, 10Traffic, 5Patch-For-Review: openssl-1.0.2f introduced minor bug with nginx - https://phabricator.wikimedia.org/T126616#2022318 (10BBlack) I put this on one real text server as a canary. All it's really done in practice is replace one spammy log message with another: Now it's: ``` 2016/02/12... [14:37:54] (03PS1) 10Jcrespo: Reducing db1072 weight a bit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270287 [14:39:02] (03CR) 10Jcrespo: [C: 032] Reducing db1072 weight a bit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270287 (owner: 10Jcrespo) [14:41:16] !log jynus@mira Synchronized wmf-config/db-eqiad.php: Reducing db1072 weight a bit (duration: 01m 16s) [14:42:19] morebots died a couple of hours ago [14:44:45] ah [14:45:02] will restart it https://wikitech.wikimedia.org/wiki/Morebots [14:45:44] restarted [14:46:33] (03PS2) 10Muehlenhoff: Add ferm rules for maps/cassandra [puppet] - 10https://gerrit.wikimedia.org/r/270280 [14:46:47] 6operations, 5Patch-For-Review, 7user-notice: reinstall eqiad memcache servers with jessie - https://phabricator.wikimedia.org/T123711#2022336 (10elukey) Hi @Johan, I am going to try to answer your questions: > *) Will this affect Wikimedians who won't find out any other way? Who need to know, and what is... [14:48:27] (03PS6) 10Cmjohnson: admin: add shell users for frack pentest [puppet] - 10https://gerrit.wikimedia.org/r/268722 (https://phabricator.wikimedia.org/T126012) (owner: 10Dzahn) [14:50:45] (03PS7) 10Cmjohnson: admin: add shell users for frack pentest [puppet] - 10https://gerrit.wikimedia.org/r/268722 (https://phabricator.wikimedia.org/T126012) (owner: 10Dzahn) [14:52:49] (03CR) 10Cmjohnson: [C: 032] admin: add shell users for frack pentest [puppet] - 10https://gerrit.wikimedia.org/r/268722 (https://phabricator.wikimedia.org/T126012) (owner: 10Dzahn) [14:59:22] 6operations, 10Traffic, 5Patch-For-Review: openssl-1.0.2f introduced minor bug with nginx - https://phabricator.wikimedia.org/T126616#2022389 (10MoritzMuehlenhoff) Hmm, I think this needs to be raised with nginx upstream: ngx_ssl_shutdown() has the following comment: /* SSL_shutdown() never returns -1, on e... [14:59:28] (03PS3) 10Cmjohnson: admin: add akumar, mnoushad to pentesters, bastionly [puppet] - 10https://gerrit.wikimedia.org/r/268823 (https://phabricator.wikimedia.org/T126012) (owner: 10Dzahn) [14:59:34] (03PS2) 10Jgreen: set rDNS for civi service IPs to civicrm.wm.o to match cert for SMTP/TLS [dns] - 10https://gerrit.wikimedia.org/r/270277 [15:00:49] (03PS1) 10Filippo Giunchedi: swiftrepl: minimal README with instructions [software] - 10https://gerrit.wikimedia.org/r/270296 [15:01:30] (03PS4) 10Cmjohnson: admin: add akumar, mnoushad to pentesters, bastionly [puppet] - 10https://gerrit.wikimedia.org/r/268823 (https://phabricator.wikimedia.org/T126012) (owner: 10Dzahn) [15:02:53] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swiftrepl: minimal README with instructions [software] - 10https://gerrit.wikimedia.org/r/270296 (owner: 10Filippo Giunchedi) [15:03:36] (03PS1) 10Filippo Giunchedi: swiftrepl: example configuration file [software] - 10https://gerrit.wikimedia.org/r/270297 [15:03:44] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swiftrepl: example configuration file [software] - 10https://gerrit.wikimedia.org/r/270297 (owner: 10Filippo Giunchedi) [15:04:12] (03CR) 10Jgreen: [C: 032 V: 031] set rDNS for civi service IPs to civicrm.wm.o to match cert for SMTP/TLS [dns] - 10https://gerrit.wikimedia.org/r/270277 (owner: 10Jgreen) [15:06:34] !log flip barium rdns and mta hostname from barium.wm.o to civicrm.wm.o [15:07:14] (03CR) 10jenkins-bot: [V: 04-1] admin: add akumar, mnoushad to pentesters, bastionly [puppet] - 10https://gerrit.wikimedia.org/r/268823 (https://phabricator.wikimedia.org/T126012) (owner: 10Dzahn) [15:10:26] (03PS1) 10Jgreen: change civicrm.wm.o from a cname to an A record [dns] - 10https://gerrit.wikimedia.org/r/270298 [15:11:14] sjoerddebruin unrelated. Depooled == not part of production [15:11:23] jynus: ah nvm then [15:12:58] (03CR) 10Jgreen: [C: 032 V: 031] change civicrm.wm.o from a cname to an A record [dns] - 10https://gerrit.wikimedia.org/r/270298 (owner: 10Jgreen) [15:13:59] (03PS3) 10Muehlenhoff: Add ferm rules for maps/cassandra [puppet] - 10https://gerrit.wikimedia.org/r/270280 [15:14:33] 7Blocked-on-Operations, 10OTRS: Upgrade otrs.TicketExport2Mbox.pl up to OTRS 4 and 5 standards. - https://phabricator.wikimedia.org/T126745#2022413 (10akosiaris) 3NEW [15:20:19] (03PS1) 10Alexandros Kosiaris: otrs.TicketExport2Mbox.pl up to OTRS 4/5 standards [puppet] - 10https://gerrit.wikimedia.org/r/270299 (https://phabricator.wikimedia.org/T126745) [15:27:57] For debugging my puppet patch, I wrote 2 tests (in the elasticsearch module). They are awful and ugly, but they exist. Should I commit them with the module? I did not see many puppet modules with tests, so not sure if there is a standard test structure / framework / ... [15:31:14] !log restoring compactor thread count to 10 on restbase1002.eqiad [15:31:53] gehel: I'd say yes, commit even if ugly, better than no tests [15:32:38] 12.24 -!- morebots [tools.more@208.80.155.213] has quit [Ping timeout: 260 seconds] [15:32:42] still dead [15:33:28] godog: As far as I can see, rspecs of individual modues are not run during CI, so I should not break anything by adding my tests. Correct? [15:34:22] gehel: I believe so yeah, jenkins will -1 if something breaks anyways [15:36:45] 6operations, 10Traffic, 5Patch-For-Review: openssl-1.0.2f introduced minor bug with nginx - https://phabricator.wikimedia.org/T126616#2022504 (10MoritzMuehlenhoff) Reported upstream at https://trac.nginx.org/nginx/ticket/901 [15:39:32] 6operations, 7Availability, 5Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: swiftrepl replication pass for thumbnails eqiad -> codfw - https://phabricator.wikimedia.org/T125791#2022509 (10fgiunchedi) one week later that's ~119M files in, the top 100 requested sizes look like this ``` 18... [15:40:00] (03PS10) 10Gehel: Ship Elasticsearch logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/269100 (https://phabricator.wikimedia.org/T109101) [15:41:44] (03CR) 10jenkins-bot: [V: 04-1] Ship Elasticsearch logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/269100 (https://phabricator.wikimedia.org/T109101) (owner: 10Gehel) [15:48:47] (03PS11) 10Gehel: Ship Elasticsearch logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/269100 (https://phabricator.wikimedia.org/T109101) [15:52:08] (03CR) 10Yurik: [C: 031] Add ferm rules for maps/cassandra [puppet] - 10https://gerrit.wikimedia.org/r/270280 (owner: 10Muehlenhoff) [15:53:58] !log disabling puppet on labcontrol1001 [15:56:18] (03PS4) 10Andrew Bogott: Switch keystone to mysql assignment from ldap. [puppet] - 10https://gerrit.wikimedia.org/r/268325 (https://phabricator.wikimedia.org/T115029) [15:59:36] \O/ [15:59:45] (03CR) 10Andrew Bogott: [C: 032] Switch keystone to mysql assignment from ldap. [puppet] - 10https://gerrit.wikimedia.org/r/268325 (https://phabricator.wikimedia.org/T115029) (owner: 10Andrew Bogott) [15:59:51] everything is merged andrewbogott [16:00:01] andrewbogott: so since labs / openstack will not be responding, the operations/puppet CI jobs will not report back [16:00:04] andrewbogott krenair jynus: Respected human, time to deploy Labs/Wikitech maintenance (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160212T1600). Please do the needful. [16:00:14] so I guess you will want to force merge your changes [16:00:25] hashar: yeah, I think we have it down to just one patch that needs that. [16:00:44] andrewbogott: you can manually check the status of other jobs on https://integration.wikimedia.org/zuul/ though [16:01:34] once the api comes back up, Nodepool will/should magically work again :D [16:01:54] (03PS1) 10Muehlenhoff: Add ferm rule for graphite/labs web service [puppet] - 10https://gerrit.wikimedia.org/r/270306 [16:03:20] Krenair: I’ve disabled logins and I’m running the job to log everyone out now. [16:03:48] I’ll stop keystone meanwhile so Jynus can do the backup. [16:03:57] jynus: keystone is stopped, go ahead and run the backup [16:04:10] keystone only? [16:04:29] jynus: that’s the only one I’m worried about [16:04:37] but it can’t hurt to do everything openstack if that’s easy [16:04:59] keystone is done [16:05:06] ok [16:05:10] last backup of all is from 2 days ago [16:05:26] and I can roll it forward easily at will [16:05:31] that’s fine [16:05:56] Logging everyone out will take 5-10 minutes, I think we should maybe just pause until that finishes [16:06:30] please ping me *in the future* for that ticket you recently created about HA [16:06:45] jynus: ok! [16:09:19] hm, maybe I will fix the logbot while waiting [16:09:47] (03PS7) 10Giuseppe Lavagetto: mediawiki: Rewrite /w/{skins,resources,extensions} to /w/static.php [puppet] - 10https://gerrit.wikimedia.org/r/268802 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle) [16:10:34] for your tranquility: https://phabricator.wikimedia.org/P2610 [16:10:40] morebots, you ok? [16:10:40] I am a logbot running on tools-exec-1210. [16:10:41] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [16:10:41] To log a message, type !log . [16:11:30] I restarted morebots a few hours ago [16:12:28] hashar: I think irc had a split-brain yesterday? Maybe it’s ongoing. [16:13:14] (03CR) 10Giuseppe Lavagetto: [C: 032] "LGTM, should have almost no impact." [puppet] - 10https://gerrit.wikimedia.org/r/268802 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle) [16:13:41] morebots, you ok? [16:13:41] I am a logbot running on tools-exec-1210. [16:13:41] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [16:13:42] To log a message, type !log . [16:13:45] Wo, fancy [16:14:19] It’s not as fancy as it seems, it has the same answer no matter what you ask it :) [16:14:28] andrewbogott: yeah on netsplit the bot tends to loose track :/ [16:14:29] morebots [16:14:29] I am a logbot running on tools-exec-1210. [16:14:29] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [16:14:29] To log a message, type !log . [16:14:31] andrewbogott, how's that logout going? [16:14:37] Ah, not even comma linked [16:14:48] Krenair: still running. It’s slooooow [16:15:02] ok, no more jenkins.... [16:15:11] aude: yeah labs maintenance [16:15:19] Krenair: also the reporting isn’t sorted so I can’t tell how far along it is. [16:15:23] it was announced, so ok [16:15:26] !log the pool of CI slaves is exhausted, no more jobs running (scheduled labs maintenance) [16:15:32] it's friday anyway :) [16:15:43] yeah nothing can go wrong [16:15:44] hashar: to log, morebots has to log in to wikitech [16:15:50] oh man [16:16:14] luckily we have a backup/second sal logger https://tools.wmflabs.org/sal/production [16:16:17] you can allways go to wikitech static, and hard-code an sql query to edit it [16:16:20] and that go in. So you can !log as needed [16:16:35] (03PS12) 10Gehel: Ship Elasticsearch logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/269100 (https://phabricator.wikimedia.org/T109101) [16:18:13] (03PS1) 10Alexandros Kosiaris: Update otrs.TicketExport2Mbox.pl help message [puppet] - 10https://gerrit.wikimedia.org/r/270310 [16:18:17] (03CR) 10Alexandros Kosiaris: [C: 032] "Did a couple of test runs, works fine. merging" [puppet] - 10https://gerrit.wikimedia.org/r/270299 (https://phabricator.wikimedia.org/T126745) (owner: 10Alexandros Kosiaris) [16:18:33] (03PS2) 10Alexandros Kosiaris: otrs.TicketExport2Mbox.pl up to OTRS 4/5 standards [puppet] - 10https://gerrit.wikimedia.org/r/270299 (https://phabricator.wikimedia.org/T126745) [16:18:38] (03CR) 10Alexandros Kosiaris: [V: 032] otrs.TicketExport2Mbox.pl up to OTRS 4/5 standards [puppet] - 10https://gerrit.wikimedia.org/r/270299 (https://phabricator.wikimedia.org/T126745) (owner: 10Alexandros Kosiaris) [16:20:55] when we moved from tin to mira, how did the git remotes for trebuchet-deployed repos get updated? [16:21:29] having that problem now in beta since we moved to a new deployment server. [16:22:00] deployment_server pillar value has been updated, but the remotes don't seem to update their git remote.origin.url before trying to run fetch. [16:22:15] this isn’t the most exciting part [16:22:46] (03PS5) 10Alex Monk: admin: add akumar, mnoushad to pentesters, bastiononly [puppet] - 10https://gerrit.wikimedia.org/r/268823 (https://phabricator.wikimedia.org/T126012) (owner: 10Dzahn) [16:25:29] (03PS2) 10Alexandros Kosiaris: Update otrs.TicketExport2Mbox.pl help message [puppet] - 10https://gerrit.wikimedia.org/r/270310 [16:26:24] thcipriani: in prod they fetch from local scap proxies, right? [16:26:30] 6operations, 6Performance-Team, 10Traffic, 5Patch-For-Review: Disable SPDY on cache_text for a week - https://phabricator.wikimedia.org/T125979#2022642 (10Gilles) firstPaint geometric mean per country, comparing Feb 11th and Feb 4th: https://docs.google.com/spreadsheets/d/1oZuFk152g-CRdVnw2aBaIAN3Z-nRciizt... [16:26:44] ok, finally finished. Krenair, I’m going to enable puppet on labcontrol1001 which will start keystone and switch over to the new backend. [16:27:35] and I’m running sync-common on silver [16:29:12] Krinkle: no I meant for things like -oids that are deployed via trebuchet, they all fetch directly from deployment server. On the first setup it clones directly from whatever the salt pillar deployment_server is set to, on subsequent fetches trebuchet just runs git fetch. [16:29:41] andrewbogott, did you pull the changes onto mira? [16:29:52] Krenair: no, forgot [16:29:53] <_joe_> thcipriani: that's why I had to write salt deploy.fixurl [16:30:00] Krenair: will you? [16:30:04] and then I’ll sync again on silver [16:30:19] _joe_: ah, nice, I'll check that out. Thanks! [16:30:40] 6operations, 6Performance-Team, 10Traffic, 5Patch-For-Review: Disable SPDY on cache_text for a week - https://phabricator.wikimedia.org/T125979#2022656 (10BBlack) Random horrible idea of the day: we could do some crazy hack in nginx code where we measure RTT during the initial part of the handshake and the... [16:30:41] <_joe_> thcipriani: I'll document it after the swicth back to tin on monday [16:30:42] PROBLEM - puppet last run on holmium is CRITICAL: CRITICAL: Puppet has 1 failures [16:31:23] andrewbogott, done [16:31:33] thanks [16:33:06] (03PS2) 10Andrew Bogott: Update admin_project_id in the keystone config [puppet] - 10https://gerrit.wikimedia.org/r/270169 (https://phabricator.wikimedia.org/T115029) [16:33:46] (03CR) 10Andrew Bogott: [C: 032 V: 032] Update admin_project_id in the keystone config [puppet] - 10https://gerrit.wikimedia.org/r/270169 (https://phabricator.wikimedia.org/T115029) (owner: 10Andrew Bogott) [16:33:52] (03PS3) 10Andrew Bogott: Update admin_project_id in the keystone config [puppet] - 10https://gerrit.wikimedia.org/r/270169 (https://phabricator.wikimedia.org/T115029) [16:34:00] (03CR) 10Andrew Bogott: [V: 032] Update admin_project_id in the keystone config [puppet] - 10https://gerrit.wikimedia.org/r/270169 (https://phabricator.wikimedia.org/T115029) (owner: 10Andrew Bogott) [16:34:24] 7Blocked-on-Operations, 10OTRS, 5Patch-For-Review: Upgrade otrs.TicketExport2Mbox.pl up to OTRS 4 and 5 standards. - https://phabricator.wikimedia.org/T126745#2022671 (10akosiaris) 5Open>3Resolved a:3akosiaris Did a number of tests and all seems to work just fine. Starting today, our spamassasin databa... [16:37:03] 6operations, 10ops-codfw: es2011-es2020 racking and onsite setup tasks - https://phabricator.wikimedia.org/T126006#2022675 (10RobH) [16:38:15] Krenair: I’m running the ldap->mysql migration script now. So far so good [16:39:05] !log purging rows from analytics-slave as requested (eventlogging database) [16:39:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:40:35] (03PS1) 10RobH: setting oresrdb1002.eqiad.wmnet dns entry [dns] - 10https://gerrit.wikimedia.org/r/270313 [16:41:36] (03CR) 10RobH: [C: 032] setting oresrdb1002.eqiad.wmnet dns entry [dns] - 10https://gerrit.wikimedia.org/r/270313 (owner: 10RobH) [16:44:41] 6operations: setup/deploy oresrdb1001-oresrdb1002 - https://phabricator.wikimedia.org/T125562#2022688 (10RobH) [16:48:05] RECOVERY - RAID on db1021 is OK: OK: optimal, 1 logical, 2 physical [16:49:55] Krenair: this stage /is/ alphabetized, and it’s still doing project ‘bastion' [16:50:05] of course, bastion takes about as long as everything else put together [16:50:05] :/ [16:50:15] yeah [16:51:29] Re: RAID, thanks, cmjohnson1 [16:59:32] (03CR) 10Dzahn: [C: 04-1] "http://puppet-compiler.wmflabs.org/1749/rcs1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/270107 (owner: 10Tim Landscheidt) [17:01:30] (03CR) 10Dzahn: [C: 04-1] "http://puppet-compiler.wmflabs.org/1750/ruthenium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/270105 (owner: 10Tim Landscheidt) [17:03:16] (03CR) 10Dzahn: "Hey, it's been a couple weeks again." [puppet] - 10https://gerrit.wikimedia.org/r/182141 (owner: 10AndyRussG) [17:03:17] Krenair: it’s doing ‘tools’ now, the other big one [17:03:31] \O/ [17:03:52] <_joe_> !log soft-reloading apache on half of appservers [17:03:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:05:54] andrewbogott: somehow nodepool manage to create instances again [17:06:10] hashar: yep, its project was imported. [17:06:13] as of like 2 minutes ago [17:06:17] it’s a good sign that it’s working :) [17:06:45] !log CI is processing jobs again. Nodepool instances are spawning [17:06:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:07:03] Krenair: import finished, I’m going to re-enable wikitech things and we can do some testing. [17:08:32] 7Puppet, 6operations: 'role' function doesn't find classess in autoload layout in manifests/role - https://phabricator.wikimedia.org/T119042#2022812 (10Dzahn) >>! In T119042#2017204, @scfc wrote: > Could you point out the patch that failed? https://gerrit.wikimedia.org/r/#/c/270105/ http://puppet-compiler.wmf... [17:10:30] and morebots is back [17:10:33] morebots: ping [17:10:33] I am a logbot running on tools-exec-1210. [17:10:33] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [17:10:33] To log a message, type !log . [17:10:45] (03CR) 10jenkins-bot: [V: 04-1] admin: add akumar, mnoushad to pentesters, bastiononly [puppet] - 10https://gerrit.wikimedia.org/r/268823 (https://phabricator.wikimedia.org/T126012) (owner: 10Dzahn) [17:11:31] I see keystone activity on s5 [17:11:53] m5 I mean [17:12:03] thcipriani: Ah I misread. trebuchet, not scap. [17:13:05] <_joe_> !log reloading apache on all the appservers [17:13:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:16:12] Krenair: something is horrifyingly slow in prod that wasn’t happening on labtest. I’m going to wait until the caches are warmed up and see if it’s tolerable... [17:16:23] but it’s pretty brutal [17:16:36] :/ [17:17:26] blames SemanticMediaWiki [17:17:26] or php5.5 [17:18:29] <_joe_> andrewbogott: what is slow? [17:19:02] _joe_: wikitech page loads [17:19:09] presumably the process of determining user rights [17:19:11] <_joe_> since when? [17:19:28] since… the migration that I’m in the middle of [17:19:35] <_joe_> ahah ok [17:19:37] it’s not page loads per se, it’s only for logged in users [17:19:51] 6operations, 6Labs, 10Labs-Infrastructure: labservices1001 ran out of disk space - https://phabricator.wikimedia.org/T126572#2022935 (10Dzahn) a:3Andrew [17:20:06] <_joe_> I'll let you manage this then, sorry for the interruption :) [17:21:47] I am not seeing things that bad, even logged in and loading labs-management pages [17:23:11] 6operations, 6Security-Team: Use user-specific passwords for accessing EventLogging database - https://phabricator.wikimedia.org/T120532#2022961 (10Ottomata) +1 for this, we've wanted it for a while. [17:23:28] 6operations, 6Security-Team: Use user-specific passwords for accessing EventLogging database - https://phabricator.wikimedia.org/T120532#2022964 (10Ottomata) Not just for EventLogging DB, but all research/analytics MySQL DBs. [17:24:21] ottomata, I want that too, we are just facing technical and social challenges [17:24:51] :) [17:24:58] andrewbogott, seems okay to me... [17:25:04] really? [17:25:16] we are 2 here saying that [17:25:31] hm, maybe it’s just because I’m in a million projects [17:25:32] RECOVERY - puppet last run on holmium is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [17:25:32] but I have not done any fancy stuff [17:25:35] but, I’m glad it’s working ok [17:25:46] can you two do some testing while I figure out why it’s gummed up for me? [17:26:09] please point of some specific url that is slow foryou [17:26:47] my English is getting worse and worse everyday [17:26:53] 7Blocked-on-Operations, 10RESTBase: Separate metrics & logs between staging and production - https://phabricator.wikimedia.org/T103124#2022990 (10GWicke) [17:27:23] jynus: simply logging in is hanging for me, forever [17:27:25] so I can’t load any page [17:27:28] oh [17:27:53] jynus: but I’m going to do some commandline tests, ignore me for the moment [17:27:58] ok [17:28:23] 7Blocked-on-Operations, 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Finish conversion to multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#2022993 (10GWicke) [17:30:43] 6operations, 5Patch-For-Review, 7Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2023004 (10Dzahn) [17:30:57] ^ down to 70 [17:31:38] (03PS1) 10Hashar: mediawiki: add texlive-generic-extra [puppet] - 10https://gerrit.wikimedia.org/r/270322 (https://phabricator.wikimedia.org/T126422) [17:33:17] (03PS5) 10Dzahn: parsoid: create module, move files and templates there [puppet] - 10https://gerrit.wikimedia.org/r/269602 [17:35:46] Krenair, jynus, I’m finding a couple of things that need tweaking, but I still think we can move forward. Things looking reasonable to you? [17:36:26] andrewbogott, I noticed that I no longer appear to be able to add people as members of projects [17:36:27] not now [17:36:31] (03CR) 10Dzahn: [C: 032] "only moves files and templates around. and compiler shows no other changes http://puppet-compiler.wmflabs.org/1707/" [puppet] - 10https://gerrit.wikimedia.org/r/269602 (owner: 10Dzahn) [17:36:38] 6operations, 10RESTBase, 6Services, 3Mobile-Content-Service, 7Varnish: Enable caching for the Mobile Content Service's RESTBase public endpoints - https://phabricator.wikimedia.org/T113591#2023030 (10GWicke) This was deployed yesterday afternoon SF time. Even with a short TTL of one hour & beta use only,... [17:36:42] I see some php code on https://wikitech.wikimedia.org/wiki/Special:NovaSudoer [17:36:46] Krenair: the links are missing, or it fails when you try? [17:36:57] andrewbogott, actually, as members of roles [17:36:59] 6operations, 10ops-codfw: es2011-es2019 racking and onsite setup tasks - https://phabricator.wikimedia.org/T126006#2023038 (10Papaul) [17:36:59] it fails [17:37:23] Krenair: so, adding a projectadmin? [17:37:26] https://wikitech.wikimedia.org/w/index.php?title=Special:NovaRole&action=addmember&projectid=deployment-prep&roleid=4d8cad783d6342efa8414d7d36fbc034&returnto=Special%3ANovaProject [17:37:31] jynus: that’s probably me debugging, is it gone now? [17:37:36] You do not have permission to manage OpenStack projects and roles, for the following reason: [17:37:36] The action you have requested is limited to users in the group: cloudadmin. [17:37:38] (03CR) 10Hashar: [C: 031] "Cherry picked on integration puppetmaster." [puppet] - 10https://gerrit.wikimedia.org/r/270322 (https://phabricator.wikimedia.org/T126422) (owner: 10Hashar) [17:37:42] (03PS1) 10Papaul: ADD mgmt DNS entries for es201[1-9] Bug:T126006 [dns] - 10https://gerrit.wikimedia.org/r/270325 (https://phabricator.wikimedia.org/T126006) [17:37:57] andrewbogott, yes [17:38:01] (03CR) 10Physikerwelt: [C: 031] mediawiki: add texlive-generic-extra [puppet] - 10https://gerrit.wikimedia.org/r/270322 (https://phabricator.wikimedia.org/T126422) (owner: 10Hashar) [17:38:12] (03CR) 10Dzahn: "noop confirmed - wtp1001, wtp2001, ...." [puppet] - 10https://gerrit.wikimedia.org/r/269602 (owner: 10Dzahn) [17:38:19] Krenair: ok, I’ll will get to that once I can… see the ui myself [17:38:31] I am not a heavy labs user, so sorry if I cannot see much aside from sanity checks [17:38:56] (03PS9) 10Dzahn: parsoid: one file per role, move to module/role [puppet] - 10https://gerrit.wikimedia.org/r/269603 [17:39:33] jynus: sanity checks should get us through most use cases. Things like changing rights and such can stay broken for a few hours. [17:39:34] !log wikibugs broken in operations and other channels [17:39:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:39:58] ok, then [17:42:12] PROBLEM - Host pay-lvs2002 is DOWN: PING CRITICAL - Packet loss = 100% [17:42:14] I cannot render a book of wikitech, but I supose I can live without that [17:42:51] Jeff_Green: ?? [17:42:53] looking [17:43:03] ask for help if needed [17:43:05] heh, i was about to ask. [17:43:22] PROBLEM - Host alnitak is DOWN: PING CRITICAL - Packet loss = 100% [17:43:41] and a pattern starts to emerge... [17:43:54] pfws ? [17:44:01] could be [17:44:13] jynus, existing issue [17:44:14] PROBLEM - Host betelgeuse is DOWN: PING CRITICAL - Packet loss = 100% [17:44:21] PROBLEM - Host bellatrix is DOWN: PING CRITICAL - Packet loss = 100% [17:44:28] seems likely [17:45:10] PROBLEM - check_puppetrun on payments2002 is CRITICAL: CRITICAL: Puppet has 1 failures [17:45:19] RECOVERY - Host pay-lvs2002 is UP: PING OK - Packet loss = 0%, RTA = 37.32 ms [17:45:21] is this all codfw then? [17:45:24] it is [17:45:26] yes [17:45:26] yeah [17:45:27] RECOVERY - Host betelgeuse is UP: PING OK - Packet loss = 0%, RTA = 36.88 ms [17:45:35] all one side [17:45:35] RECOVERY - Host alnitak is UP: PING OK - Packet loss = 0%, RTA = 37.26 ms [17:45:35] node1 out of the 2 nodes has an uptime of 5 mins [17:45:43] RECOVERY - Host bellatrix is UP: PING OK - Packet loss = 0%, RTA = 36.78 ms [17:45:45] user facing seems up from here [17:45:46] pfw2 fell over I guess [17:45:58] none of this is use-facing luckily [17:46:11] it's the failover site [17:46:32] yes, I just wanted to check! [17:46:55] is there an update for the SRX that fixes this yet, I wonder? [17:46:56] JSRPD_RG_STATE_CHANGE: Redundancy-group 0 transitioned from 'secondary' to 'primary' state due to Control & Fabric links down [17:47:24] (03CR) 10Paladox: [C: 031] mediawiki: add texlive-generic-extra [puppet] - 10https://gerrit.wikimedia.org/r/270322 (https://phabricator.wikimedia.org/T126422) (owner: 10Hashar) [17:47:30] so /kernel: rdp retransmit error: No route to host (65) src 0x00000000:1155 dest 0x02100001:14088 [17:48:21] 6operations, 10ops-codfw: es2011-es2019 racking and onsite setup tasks - https://phabricator.wikimedia.org/T126006#2023132 (10Papaul) Switch ports information es2011 ge-1/0/9 rack B1 es2012 ge-1/0/0 rack C1 es2013 ge-1/0/4 rack D1 es2014 ge-1/0/5 rack A1 es2015 ge-1/0/1 rack C1 es2016... [17:49:02] 6operations, 10ops-codfw: es2011-es2019 racking and onsite setup tasks - https://phabricator.wikimedia.org/T126006#2023133 (10Papaul) [17:49:07] so the second pfw decided to commit harakiki I think after failing to talk to the first 1 [17:49:14] (03PS1) 10Alex Monk: Update interwiki.php for adywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270327 (https://phabricator.wikimedia.org/T125501) [17:49:24] Krenair: can you create/delete instances? [17:49:47] Krenair: and most importantly, can you create a new instance and then log in to it? [17:49:53] akosiaris: that's a peculiar high availability implementation [17:49:59] (03CR) 10Alex Monk: [C: 032] Update interwiki.php for adywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270327 (https://phabricator.wikimedia.org/T125501) (owner: 10Alex Monk) [17:50:17] i'm not surprised that we lost contact with the hosts that are direct connected to the one that initially fell over, that at least is expected [17:50:20] PROBLEM - check_puppetrun on payments2002 is CRITICAL: CRITICAL: Puppet has 1 failures [17:50:34] (03Merged) 10jenkins-bot: Update interwiki.php for adywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270327 (https://phabricator.wikimedia.org/T125501) (owner: 10Alex Monk) [17:51:00] PROBLEM - puppet last run on ruthenium is CRITICAL: CRITICAL: Puppet has 1 failures [17:51:28] greg-g: we have a patch to wmf.13 to add some logging that I would like to deploy sometime today to start gathering data -- https://gerrit.wikimedia.org/r/#/c/270240 [17:52:00] * greg-g looks [17:52:18] greg-g: we don't know what the baseline for this new measurement is so it won't be immediately actionable but it will give us some details if there are reports of more session mixups [17:52:25] bd808: interesting, cool, yeah [17:53:02] bd808: I guess after andrewbogott is done [17:53:11] *nod* and thanks [17:55:08] Feb 12 17:38:42 LCC: Transition from Secondary to Primary, Restarting [17:55:10] RECOVERY - check_puppetrun on payments2002 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [17:55:36] Jeff_Green: ^ so, seems like it makes sense to kill "the service" when taking over ... [17:55:45] !log krenair@mira Synchronized wmf-config/interwiki.php: https://gerrit.wikimedia.org/r/#/c/270327/ (duration: 01m 18s) [17:55:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:56:00] well makes sense to the pfws... not sure how these things work tbg [17:56:05] tbh* [17:56:34] akosiaris: maybe "the service" is the watcher that's polling the active pfw to see if it's still active [17:57:14] no by "the service" I was referring to the connectivity of connected clients (that is boxes like pay-lvs2002) [17:57:50] ah maybe boxes that are connected to the pfw of the pair that just died [17:58:10] so, the watcher for some reason failed to connect to the primary and decided to promote the pfw [17:58:14] stashbot: refresh [17:58:28] but why the transition requires a restart... that I do not know [17:58:53] note that primary to secondary requires a restart as well. at least that's what the logs tell me [17:59:12] jouncebot: refresh [17:59:14] I refreshed my knowledge about deployments. [17:59:19] jouncebot: next [17:59:19] In 0 hour(s) and 30 minute(s): Debug logging enhancements (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160212T1830) [18:00:03] bd808: does stashbot need a refresh or was that a typo? [18:00:10] typo [18:00:13] * greg-g nods [18:00:23] E_TOOMANYBOTS [18:00:25] akosiaris: I'll put in a phabricator ticket to check for an OS fix [18:00:38] ok [18:00:39] bd808: you're too blame! [18:00:46] too? ugh, moar coffee [18:00:57] that too ;) [18:03:55] (03CR) 10Tim Landscheidt: "The relevant puppet-compiler diff seems to be:" [puppet] - 10https://gerrit.wikimedia.org/r/270107 (owner: 10Tim Landscheidt) [18:04:34] andre__: are you by chance one of the folks who triage adding folks to #project-creators? [18:04:49] I see that you ahve added folks in the past, and there are a couple of requests (one of them pinged me) [18:05:05] robh, aren't you a phabricator admin? you should be processing them too [18:05:08] While I'm in that group, it is only for operations specific project creation. [18:05:27] It wasn't given to me to admin the entirety of the phabricator isntance as it has policies that I may not be aware of [18:05:34] I gave it to myself for ops stuff ;] [18:05:37] (03CR) 10Filippo Giunchedi: [C: 031] "a little naming bikeshed but LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/270306 (owner: 10Muehlenhoff) [18:05:44] Krenair: and im not an admin [18:05:55] I'm in the #project creators group but that is not the same. [18:06:13] yes you are: https://phabricator.wikimedia.org/project/profile/5/#9318 [18:06:15] robh: I think any admin can handle that... I'm aware of the requests but haven't gotten there yet. thanks for the ping [18:06:15] and i often use the admin console (via terminal recovery) to do operations specific admin tasks, but I dont go into how they run things... [18:06:33] andre__: oh, if its cool for me to then i can add them both but i wasnt sure if there was an official process [18:06:57] I don't like to assume I can do things just because the software will let me. [18:07:20] Krenair: ha, well... i wasnt aware that chad had made me an admin [18:07:24] =P [18:07:46] robh: when adding I normally just point out the rules again as shown on top of https://phabricator.wikimedia.org/T706 - and it's up to your judgement if the usecase sounds good enough :D [18:07:56] thanks :) [18:08:04] cool, ill review them both now then, thanks for the clarification [18:08:38] ostriches: damn you, you just made it so i have more work, curse youuuuuuu ;p [18:08:52] though i guess i dont have to go dropping to admin console anymore... thatll save time. [18:09:21] only members of that phabricator group can manage project-creators membership [18:09:44] 6operations, 6Labs: Some labs instances IP have multiple PTR entries in DNS - https://phabricator.wikimedia.org/T115194#2023269 (10dduvall) [18:09:49] !log krenair@mira Synchronized php-1.27.0-wmf.13/extensions/WikimediaMaintenance/dumpInterwiki.php: https://gerrit.wikimedia.org/r/#/c/270328/ (duration: 01m 16s) [18:09:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:10:03] (03CR) 10Alexandros Kosiaris: "nope. unfortunately you just stumbled upon a known issue. It's namespace collision. Having role:: namespace defined by both the module and" [puppet] - 10https://gerrit.wikimedia.org/r/270107 (owner: 10Tim Landscheidt) [18:10:09] robh: Hmm? Iron? [18:10:12] Krenair: re https://phabricator.wikimedia.org/T125501 and https://wikitech.wikimedia.org/wiki/Add_a_wiki, are you updating the docs with your findings? [18:10:45] ostriches: just you added me to a group to admin project creators was joking ;] [18:10:55] ah lol [18:11:03] greg-g, so I'm thinking we should maybe have a separate page from the interwiki cache stuff [18:11:04] maybe [18:11:06] idk [18:11:18] I might just fix up add_a_wiki and leave it at that [18:11:22] robh: On that subject....we got approval yesterday for iron :) [18:11:35] approval for wha? [18:12:07] (I have so many open requests I can no longer keep them all in my head, I have to use workboards) [18:12:15] (03PS6) 10Cmjohnson: admin: add akumar, mnoushad to pentesters, bastiononly [puppet] - 10https://gerrit.wikimedia.org/r/268823 (https://phabricator.wikimedia.org/T126012) (owner: 10Dzahn) [18:12:36] robh: https://phabricator.wikimedia.org/T123132 [18:12:46] Krenair, jynus: I sent a tepid ‘all clear’ email. I’m going to keep bug-hunting for the rest of the day, but so far it feels like we can go forward rather than reverting the migration. [18:12:48] Thank you for your help! [18:13:11] andrewbogott, sorry, forgot about your messages earlier re: creating/deleting instances [18:13:22] (03CR) 10jenkins-bot: [V: 04-1] admin: add akumar, mnoushad to pentesters, bastiononly [puppet] - 10https://gerrit.wikimedia.org/r/268823 (https://phabricator.wikimedia.org/T126012) (owner: 10Dzahn) [18:13:23] Krenair: yeah, documentation organization is hard, yo [18:13:29] was busy with other things [18:13:48] Krenair: btw, the testlabs project is a unique case, for reasons that I’m looking at. [18:13:57] So don’t sweat it if you see bad behavior in that project in particular. [18:14:18] ‘testlabs’ is the canary case where project id != project name [18:14:29] I'm not actually in testlabs andrewbogott [18:14:40] greg-g, fixed add_a_wiki [18:14:46] ty [18:14:50] I didn’t think so, just warning you that it is even more broken [18:15:03] 6operations, 6Phabricator, 6Project-Creators, 6Triagers: Requests for addition to the #project-creators group (in comments) - https://phabricator.wikimedia.org/T706#2023301 (10RobH) >>! In T706#1984525, @Danny_B wrote: > Please add me to #project-creators group. I would like to cleanup the #tracking bugs b... [18:15:35] ostriches: lead, not iron! [18:15:40] i was wondering what you were talking about [18:15:47] 'why is chad talking about the ops bastion?' [18:16:06] I'll get this spun up for you today =] [18:16:16] 6operations, 10Gerrit, 10hardware-requests: Need spare server to upgrade/migrate gerrit - https://phabricator.wikimedia.org/T123132#2023303 (10RobH) a:5mark>3RobH [18:17:12] 6operations: setup/deploy oresrdb1001-oresrdb1002 - https://phabricator.wikimedia.org/T125562#2023304 (10RobH) [18:17:19] or "iron" as in "metal" as in "a box" [18:18:03] 6operations: setup/deploy oresrdb1001-oresrdb1002 - https://phabricator.wikimedia.org/T125562#2023306 (10RobH) a:5RobH>3akosiaris Both systems are now installed and ready for service implementation. Since Alex was the initial hw-request author, I'm assuming he would handle the service implementation. If t... [18:18:21] oh [18:18:32] well, they are steel not iron.... [18:18:40] im gonna be pedantic, my name is robh ;] [18:18:53] either way yay approved and you'll have it shortly [18:18:56] yay new gerrit box [18:19:26] That's the problem with using the common elements :p [18:19:30] Imma miss ytterbium [18:19:36] * bd808 sends news to Twitter that Hell has indeed frozen over [18:19:53] greg-g, so I think the remaining item from adywiki creation is the search issue [18:20:08] 6operations, 10Gerrit, 10hardware-requests: Need spare server to upgrade/migrate gerrit - https://phabricator.wikimedia.org/T123132#2023310 (10Paladox) Would gerrit be upgraded in the process of migrating to a new server. [18:20:14] eiximenis was the best server name ever. [18:20:52] (03PS1) 10Andrew Bogott: Designate needs to refer to 'testlabs' by id rather than by name. [puppet] - 10https://gerrit.wikimedia.org/r/270329 [18:21:11] (03PS2) 10Muehlenhoff: Add ferm rule for graphite/labs web service [puppet] - 10https://gerrit.wikimedia.org/r/270306 [18:21:28] (03CR) 10Andrew Bogott: [C: 032 V: 032] Designate needs to refer to 'testlabs' by id rather than by name. [puppet] - 10https://gerrit.wikimedia.org/r/270329 (owner: 10Andrew Bogott) [18:22:27] ugh [18:22:31] * Krenair is still not caught up with email for today [18:22:42] PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: Puppet last ran 20 hours ago [18:22:45] andre__: Currently here? [18:24:32] RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:27:32] 6operations, 10Gerrit, 10hardware-requests: setup/deploy server lead as jessie gerrit server - https://phabricator.wikimedia.org/T126794#2023323 (10RobH) 3NEW a:3RobH [18:27:39] (03PS7) 10Cmjohnson: admin: add akumar, mnoushad to pentesters, bastiononly [puppet] - 10https://gerrit.wikimedia.org/r/268823 (https://phabricator.wikimedia.org/T126012) (owner: 10Dzahn) [18:27:39] 6operations, 10Gerrit: setup/deploy server lead as jessie gerrit server - https://phabricator.wikimedia.org/T126794#2023323 (10RobH) [18:27:43] (03PS1) 10Andrew Bogott: That last patch overcorrected... nova still calls testlabs 'testlabs' apparently. [puppet] - 10https://gerrit.wikimedia.org/r/270330 [18:29:00] (03CR) 10jenkins-bot: [V: 04-1] admin: add akumar, mnoushad to pentesters, bastiononly [puppet] - 10https://gerrit.wikimedia.org/r/268823 (https://phabricator.wikimedia.org/T126012) (owner: 10Dzahn) [18:30:05] bd808 tgr anomie: Dear anthropoid, the time has come. Please deploy Debug logging enhancements (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160212T1830). [18:30:42] (03CR) 10Alex Monk: [C: 032] Add CirrusSearch-production.php to noc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270143 (owner: 10Reedy) [18:30:53] oh, a window [18:30:55] * Krenair will be quick [18:31:17] 6operations, 10Gerrit, 10hardware-requests: Need spare server to upgrade/migrate gerrit - https://phabricator.wikimedia.org/T123132#2023341 (10demon) Yes, that's what the "upgrade" part of the task title means. [18:31:34] (03Merged) 10jenkins-bot: Add CirrusSearch-production.php to noc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270143 (owner: 10Reedy) [18:31:49] hmm, i can't manage the workboard of project creators. i even logged out and in [18:32:08] (03PS2) 10Andrew Bogott: Use hiera values rather than hard-coded names for project in designate.conf. [puppet] - 10https://gerrit.wikimedia.org/r/270330 [18:32:09] 6operations, 10Gerrit, 10hardware-requests: Need spare server to upgrade/migrate gerrit - https://phabricator.wikimedia.org/T123132#2023342 (10Paladox) Ok. [18:32:15] (03CR) 10Cmjohnson: [C: 032 V: 032] admin: add akumar, mnoushad to pentesters, bastiononly [puppet] - 10https://gerrit.wikimedia.org/r/268823 (https://phabricator.wikimedia.org/T126012) (owner: 10Dzahn) [18:32:41] Danny_B, only phabricator admins can do that [18:32:56] Krenair: no rush. I jsut wanted to "official" [18:32:58] why? no idea [18:33:31] (03CR) 10Andrew Bogott: [C: 032] Use hiera values rather than hard-coded names for project in designate.conf. [puppet] - 10https://gerrit.wikimedia.org/r/270330 (owner: 10Andrew Bogott) [18:33:36] Krenair: Members of #phabricator should be allowed to do this [18:33:45] (03PS1) 10RobH: adding lead dns entries [dns] - 10https://gerrit.wikimedia.org/r/270331 [18:33:51] Luke081515, yep, they are [18:33:52] because the edit policys says, that there are able to edit this project [18:33:59] ah, that's a bit inconsistent, that project members can't manage the workboard of their own project... ;-) [18:34:22] Danny_B: Which kind of clomuns do you wanted to add? [18:34:24] Danny_B, a few projects have this [18:34:33] !log krenair@mira Synchronized docroot/noc: https://gerrit.wikimedia.org/r/#/c/270143/ (duration: 01m 15s) [18:34:34] I think a "at discussion" column will be useful ;) [18:34:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:34:59] (03CR) 10Tim Landscheidt: "Same here; the change doesn't touch role::ntp." [puppet] - 10https://gerrit.wikimedia.org/r/270105 (owner: 10Tim Landscheidt) [18:35:25] Luke081515: first of all i'd like to split to three - backlog, new projects, converting tracking bugs [btw check your mailbox] [18:35:45] (03CR) 10RobH: [C: 032] adding lead dns entries [dns] - 10https://gerrit.wikimedia.org/r/270331 (owner: 10RobH) [18:36:30] andre__: whenever you'll have a bit of time, please consider allowing project creators to manage their workboard. thanks [18:37:20] Krenair: all clear on mira? [18:37:22] bd808, yep [18:37:27] coolio [18:37:45] had some fun with that .git/objects permission error :( [18:37:48] do we have a task about that? [18:37:51] (03PS1) 10RobH: Revert "adding lead dns entries" [dns] - 10https://gerrit.wikimedia.org/r/270332 [18:37:56] Danny_B: The problem: If members can edit their own projects, they can add members too [18:38:15] Krenair: not that I've seen but it's been mentioned on irc several times now [18:38:23] At this is not wanted, I guess [18:38:25] * Krenair will open one [18:38:45] (03CR) 10RobH: [C: 032] Revert "adding lead dns entries" [dns] - 10https://gerrit.wikimedia.org/r/270332 (owner: 10RobH) [18:38:45] bd808, err... at least, I would if I still had the console log showing the issue [18:38:50] oops [18:38:59] damn it [18:39:03] 7Blocked-on-Operations, 10Beta-Cluster-Infrastructure, 6Discovery, 6Release-Engineering-Team, and 2 others: Beta: submodule update reverts new portals commits - https://phabricator.wikimedia.org/T126061#2023357 (10ksmith) @hashar: I'm a bit out of the technical loop, but my understanding is that the portal... [18:39:06] sorry [18:39:15] has jenkins/zuul woken up yet? [18:39:20] yes [18:39:34] *nod* I was just being impatient [18:39:35] (03PS8) 10Cmjohnson: admin: add akumar, mnoushad to pentesters, bastionly [puppet] - 10https://gerrit.wikimedia.org/r/268823 (https://phabricator.wikimedia.org/T126012) (owner: 10Dzahn) [18:40:19] Luke081515: ah, thought such rights are separated [18:40:48] Danny_b: At the moment not. If you can edit a project, you can add and remove members and change policy [18:40:54] otoh there could be policy created such as "you are technically able to add members, but don't do that, only foo, bar and baz can do it" [18:41:11] I guess we should discuss new columns before [18:41:20] because you never can delete a column [18:41:29] (03PS1) 10RobH: setting lead production dns entries [dns] - 10https://gerrit.wikimedia.org/r/270333 [18:42:26] (03CR) 10RobH: [C: 032] setting lead production dns entries [dns] - 10https://gerrit.wikimedia.org/r/270333 (owner: 10RobH) [18:42:42] Luke081515: repurpose (rename) or hide... ;-) [18:43:12] (03Abandoned) 10Cmjohnson: admin: add akumar, mnoushad to pentesters, bastionly [puppet] - 10https://gerrit.wikimedia.org/r/268823 (https://phabricator.wikimedia.org/T126012) (owner: 10Dzahn) [18:43:30] Krenair: I’ve verified that fundamental access is correct (if I add someone to a project they can ssh, if I remove them they can’t) [18:43:37] So we’re in a safe, if lousy, state. [18:43:46] Which means I’m going to lunch, and then will hack on this more. [18:43:58] I’ll add you to whatever patches shake out of this later. It’ll probably be caching fixes mostly. [18:44:07] And project name/id confusion [18:44:45] ok [18:44:47] grrr... qunit failed randomly for my merge [18:46:25] (03CR) 10Filippo Giunchedi: [C: 031] Add ferm rule for graphite/labs web service [puppet] - 10https://gerrit.wikimedia.org/r/270306 (owner: 10Muehlenhoff) [18:47:21] (03PS1) 10Cmjohnson: Adding akumar, mnoushad to bastion only and perf-roots group [puppet] - 10https://gerrit.wikimedia.org/r/270337 [18:48:17] (03PS1) 10RobH: setting lead install parameters [puppet] - 10https://gerrit.wikimedia.org/r/270338 [18:48:27] (03CR) 10jenkins-bot: [V: 04-1] Adding akumar, mnoushad to bastion only and perf-roots group [puppet] - 10https://gerrit.wikimedia.org/r/270337 (owner: 10Cmjohnson) [18:48:47] Luke081515: still here for a few minutes. What's up? [18:49:22] (03CR) 10RobH: [C: 032] setting lead install parameters [puppet] - 10https://gerrit.wikimedia.org/r/270338 (owner: 10RobH) [18:49:31] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 60.87% of data above the critical threshold [5000000.0] [18:49:50] andre__: I want to ask you, if it's ok for you, if I would clean up the backlog of subtasks at T555. Since I got the permissions to create the herald rules for that too, I can do it, but first I want to ask you, if this is ok for you. [18:50:14] Luke081515, would be awesome and very welcome! [18:50:34] Luke081515, (though not sure if everybody asked for Herald rules, would have to check that) [18:51:38] (03PS2) 10Cmjohnson: Adding akumar, mnoushad to bastion only and perf-roots group [puppet] - 10https://gerrit.wikimedia.org/r/270337 [18:51:57] andre__: Ok, I will to that. (But I will create a herald rule for my project, if it's ok) [18:52:00] *do [18:52:56] Thanks for the help! [18:53:18] where do I put files to publish them on people.wikimedia.org these days? [18:53:44] I remember that moved to a VM somewhere, but I don't remember where [18:53:49] mutante: ^ ? [18:53:56] bd808: it's called rutherfordium.eqiad.wmnet [18:54:01] thank [18:54:03] *s [18:54:04] yw [18:54:45] (03PS3) 10Cmjohnson: Adding akumar, mnoushad to bastion only and perf-roots group [puppet] - 10https://gerrit.wikimedia.org/r/270337 [18:56:30] PROBLEM - puppet last run on ms-fe3001 is CRITICAL: CRITICAL: puppet fail [18:57:26] (03CR) 10Cmjohnson: [C: 032] "commit message says perf-roots...supposed to be pentesters GID 768" [puppet] - 10https://gerrit.wikimedia.org/r/270337 (owner: 10Cmjohnson) [19:00:40] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 50.00% above the threshold [1000000.0] [19:03:06] (03CR) 10Tim Landscheidt: "Ah, I missed @Dzahn's parallel reply at T119042. If your explanation is correct, I personally won't author that big patch because it woul" [puppet] - 10https://gerrit.wikimedia.org/r/270107 (owner: 10Tim Landscheidt) [19:03:41] * bd808 is still waiting on Jenkins for his backport [19:05:43] 6operations: Can't access analytics-store.eqiad.wmnet from stat1003 - https://phabricator.wikimedia.org/T126800#2023554 (10Halfak) 3NEW [19:09:11] merged! [19:11:57] ostriches: while I'm here on mira should I backport that undef index fix? [19:12:07] fatalmonitor is spammy with it [19:12:12] That'd be great. [19:12:15] And thx for the fix on that [19:12:34] !log bd808@mira Synchronized php-1.27.0-wmf.13/includes/DefaultSettings.php: Log multiple IPs using the same session or the same user account (4d8b8ca) (duration: 01m 16s) [19:12:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:14:09] !log bd808@mira Synchronized php-1.27.0-wmf.13/includes/Setup.php: Log multiple IPs using the same session or the same user account (4d8b8ca) (duration: 01m 18s) [19:14:34] wikipedia has gone down for me https://en.wikipedia.org/ [19:14:36] shit. synced in wrong order [19:14:41] Request from 10.20.0.104 via cp1065 cp1065 ([10.64.0.102]:3128), Varnish XID 1730353932 [19:14:41] Forwarded for: 81.140.246.2, 10.20.0.104, 10.20.0.104, 10.20.0.104 [19:14:41] Error: 503, Service Unavailable at Fri, 12 Feb 2016 19:14:22 GMT [19:14:44] 503's yep [19:14:47] wikitech empty main page. er? [19:14:48] will be fixed in 2 minutes [19:14:49] anyways [19:15:04] uh oh, api is throwing lots of 503s [19:15:12] !log Synced files for T125455 in wrong order; broke all wikis [19:15:26] the fix is syncing now :/ [19:15:31] 7Blocked-on-Operations, 3Scap3: include refreshCdbJsonFiles in scap's debian package - https://phabricator.wikimedia.org/T126660#2023621 (10mmodell) [19:15:44] !log bd808@mira Synchronized php-1.27.0-wmf.13/includes/session/SessionManager.php: Log multiple IPs using the same session or the same user account (4d8b8ca) (T125455) (duration: 01m 17s) [19:15:47] better? [19:15:49] twentyafterfour: ready for a merge of the phab deploy change? [19:15:52] 7Blocked-on-Operations, 3Scap3: include refreshCdbJsonFiles in scap's debian package - https://phabricator.wikimedia.org/T126660#2023624 (10mmodell) a:3fgiunchedi [19:15:58] bd808: back for me [19:16:17] its back up now. [19:16:26] Thanks for fixing the problem. [19:16:28] sorry everyone. brain fart from me [19:16:29] 7Blocked-on-Operations, 3Scap3: rebuild scap debian package (we forgot to include refreshCdbJsonFiles) - https://phabricator.wikimedia.org/T126660#2023626 (10mmodell) [19:16:35] woah [19:16:39] we really ought to stop breaking everything at once [19:16:40] was that tested yet in beta, mutante / twentyafterfour? [19:16:51] thanks for bringing wikibugs back, whoever did it [19:16:51] wait, which patch is it [19:16:55] !log Wikis back up thankfully [19:16:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:16:59] apergos: yes tested [19:17:05] okay, I'm late :) [19:17:10] mutante: yes I would love to get that stuff merged [19:17:22] is this the scap -> apt patch ? [19:17:25] anyone still seeing broking wikis? [19:17:32] *broken [19:17:32] nope [19:17:41] twentyafterfour: let's do it now.. but i'll wait because i saw the broken wiki log [19:17:47] mutante, twentyafterfour [19:17:57] nothing like breaking the internet to get your heart going in the morning [19:18:05] apergos: yeah we tested scap's apt package in beta quite thoroughly, we found one problem, the package doesn't include the rebuildCdbJson script [19:18:08] That's the second spike of today. :) https://grafana.wikimedia.org/dashboard/db/varnish-http-errors?panelId=17&fullscreen [19:18:12] apergos: yes [19:18:23] ok, why not wait for folks to finish their deploys first [19:18:24] mutante: https://phabricator.wikimedia.org/T126660 [19:18:25] just a thought [19:18:42] before changing the deploy mechanism out form under them. just in case [19:18:50] bd808: should we have wikibugs ignore stashbot messages? [19:18:54] * twentyafterfour agrees with waiting for deploys to finish. really we need to rebuild the scap package with the latest merged change [19:18:58] since they're already on IRC... [19:18:59] apergos: yes, agree [19:19:01] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] [19:19:07] +1 rebuild it [19:19:16] and retest just cause. due diligence even if it's tedious [19:19:30] (03PS1) 10RobH: changing lead's partitioning [puppet] - 10https://gerrit.wikimedia.org/r/270342 [19:19:32] I assigned that to filippo but I guess anyone in ops can build that package? [19:19:40] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] [19:19:51] errors are still coming into fatalmontitor but I think they are rsylog buffering [19:19:52] anyone who knows how to build packages, yep [19:19:54] (03CR) 10RobH: [C: 032] changing lead's partitioning [puppet] - 10https://gerrit.wikimedia.org/r/270342 (owner: 10RobH) [19:20:57] (03PS1) 10Thcipriani: Beta: Move bastion server [puppet] - 10https://gerrit.wikimedia.org/r/270343 (https://phabricator.wikimedia.org/T126377) [19:21:54] 47672ms to load main page of wikitech :(( [19:22:02] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [19:22:05] twentyafterfour: if he doesn't, see if I'm around tomorrow and I'll poke at it [19:22:10] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [1000.0] [19:22:16] did you assign it to him after checking with him? [19:22:32] fatalmonitor is not seeing more errors now [19:22:46] apergos: I just assigned it to him because he built it for us last time [19:22:46] I will write an incident report certainly [19:22:54] ah [19:23:03] eyah unassign it then, leave him as a subscriber if you like [19:23:45] just make sure the new deb files or whatever are in the repo for the build [19:23:52] RECOVERY - puppet last run on ms-fe3001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:24:04] apergos: I merged the change and pushed it to scap's repo [19:24:10] ok awesome [19:24:18] so yeah unassign him, asdd me as subscriber too if you don't mind [19:24:25] I'll see about it this weekend prolly [19:24:29] or monday worst case [19:24:29] 7Blocked-on-Operations, 3Scap3: rebuild scap debian package (we forgot to include refreshCdbJsonFiles) - https://phabricator.wikimedia.org/T126660#2023662 (10mmodell) a:5fgiunchedi>3None [19:25:34] (03PS2) 10Alex Monk: Beta: Move deployment server [puppet] - 10https://gerrit.wikimedia.org/r/270343 (https://phabricator.wikimedia.org/T126377) (owner: 10Thcipriani) [19:25:45] thanks, twentyafterfour [19:25:50] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [19:26:03] (03PS3) 10Alex Monk: Beta: Move deployment server [puppet] - 10https://gerrit.wikimedia.org/r/270343 (https://phabricator.wikimedia.org/T126377) (owner: 10Thcipriani) [19:26:05] shoudl probably get a deploy window for this change actually. greg-g? [19:26:11] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 65.22% of data above the critical threshold [5000000.0] [19:26:50] (03CR) 10Jcrespo: "Lag detention is broken in mediawiki, as throughout the code there are several custom methods created to detect lag. I need to expand on t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243116 (https://phabricator.wikimedia.org/T111266) (owner: 10Aaron Schulz) [19:26:51] !log bd808@mira Synchronized php-1.27.0-wmf.13/extensions/Disambiguator/Disambiguator.hooks.php: Check for array index existence (7b5f87f) (T126651) (duration: 01m 15s) [19:26:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:27:27] apergos: I'm done as soon as I confirm that fatalmonitor looks good [19:27:28] 7Blocked-on-Operations, 3Scap3: rebuild scap debian package (we forgot to include refreshCdbJsonFiles) - https://phabricator.wikimedia.org/T126660#2023690 (10mmodell) The package got tested fairly extensively on beta: we replaced deployment-bastion with a fresh deployment instance, deployment-tin, and went th... [19:27:43] bd808: s'ok we need to do other stuff first [19:27:55] new package, testing... [19:29:30] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:29:31] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:29:31] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:29:37] so it will be monday or later. just don't know how to get on the calendar [19:29:46] yay recovery [19:30:03] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:30:20] read-only slave lag lock on enwitionary [19:30:35] known issue? jynus ^ [19:30:41] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:30:49] yes [19:30:57] cool, thanks [19:31:04] ignore for now, I am investigaating it/writing about it [19:31:09] no user impact [19:31:45] 6operations: Can't access analytics-store.eqiad.wmnet from stat1003 - https://phabricator.wikimedia.org/T126800#2023701 (10jcrespo) Is this recent? If it is (I have not touched analytics slave recently), it could be related to firewall deployment. "Lost connection to MySQL server at 'reading authorization packet... [19:32:30] the new Gerrit is in Flint's water supply [19:32:47] lol [19:33:08] don't drink the gerrit kool-aide [19:34:49] 6operations: Can't access analytics-store.eqiad.wmnet from stat1003 - https://phabricator.wikimedia.org/T126800#2023728 (10jcrespo) I cannot reproduce: ``` root@stat1003:~$ mysql -h analytics-store.eqiad.wmnet -e "SELECT 1" +---+ | 1 | +---+ | 1 | +---+ ``` [19:35:46] twentyafterfour: i guess we don't have to if there is enough _diffusion_ to water it down [19:35:53] lol [19:37:11] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 50.00% above the threshold [1000000.0] [19:38:24] (03CR) 10Dzahn: [C: 04-1] "grmbl .. http://puppet-compiler.wmflabs.org/1751/" [puppet] - 10https://gerrit.wikimedia.org/r/269603 (owner: 10Dzahn) [19:40:00] (03CR) 10Dzahn: "what the hell "Error: Role class role::parsoid::production not found" it's right there!" [puppet] - 10https://gerrit.wikimedia.org/r/269603 (owner: 10Dzahn) [19:42:41] (03PS1) 10Jhobs: Enable survey at reduced sample rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270344 (https://phabricator.wikimedia.org/T125946) [19:43:51] (03PS1) 10Jforrester: VisualEditor: Don't over-use one config variable for two uses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270345 [19:43:53] (03CR) 10Dzahn: "oh, right /modules/role/_manifests_/parsoid/ is where it's at" [puppet] - 10https://gerrit.wikimedia.org/r/269603 (owner: 10Dzahn) [19:43:55] (03PS1) 10Jforrester: VisualEditor: Switch to Single Edit Tab mode on Hungarian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270346 (https://phabricator.wikimedia.org/T126801) [19:45:06] yeah you just found it mutante [19:45:27] apergos: :) [19:45:58] always compile, especially when it looks "trivial" :p [19:46:00] (03CR) 10Jforrester: [C: 04-2] "Not now. :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270346 (https://phabricator.wikimedia.org/T126801) (owner: 10Jforrester) [19:46:39] 6operations: Can't access analytics-store.eqiad.wmnet from stat1003 - https://phabricator.wikimedia.org/T126800#2023824 (10jcrespo) Can you try again? I have not changed anything myself, but there seemed to be a network problem before. [19:46:47] (03PS10) 10Dzahn: parsoid: one file per role, move to module/role [puppet] - 10https://gerrit.wikimedia.org/r/269603 [19:47:45] Krenair: https://phabricator.wikimedia.org/project/board/178/query/W7R1jcY6kEL7/ is rather telling. :-) [19:48:59] remember "shell" ? glad it's gone [19:49:32] James_F, so this board is mainly to process requests that otherwise noone would take care of [19:49:47] Yeah. [19:49:59] (03PS11) 10Dzahn: parsoid: one file per role, move to module/role [puppet] - 10https://gerrit.wikimedia.org/r/269603 [19:50:26] James_F, luckily, stuff assigned to you is generally pretty safe in that respect [19:51:13] when moving stuff to that column I look at whether deployers are involved, and some product managers like you [19:51:19] (03CR) 10Dzahn: [C: 032] "compiles fine now http://puppet-compiler.wmflabs.org/1752/" [puppet] - 10https://gerrit.wikimedia.org/r/269603 (owner: 10Dzahn) [19:54:11] (03CR) 10Dzahn: "confirmed noop on wtp1001, wtp2001,.." [puppet] - 10https://gerrit.wikimedia.org/r/269603 (owner: 10Dzahn) [19:54:37] Krenair: Yup. [19:54:52] Krenair: Of course, I need to not get hit by a truck. :-) [19:55:11] James_F: No dying [19:55:37] RoanKattouw: Yeah yeah. [19:56:34] well, unassignment of tasks from people departing is something that needs improvement [19:56:36] :p [19:57:10] Krenair: i though tthat unassignment is the right thing. but others disagree on that [19:57:22] and asked to please not do it [19:57:29] right, it doesn't make sense in every case [19:57:38] someone has to manually go through and review all the assigned tasks [19:58:39] review, reassign where it make sense. it can't be an automatic thing [19:58:40] my point would be that "unassigned" makes sure nobody keeps expecting the former person to work on it [19:58:41] sadl [19:58:44] y [19:59:27] whre's that status "woopsgone" ? [20:00:15] apergos: i guess.. technically it's "stalled" [20:00:27] yeah but stalled is for a bunch of other things [20:00:30] stalled (user gone) [20:00:42] you want someone to see the woopsgone status and go through and parcel them out [20:00:44] stalled (but maybe they come back as volunteer, you never know) [20:00:57] they won't do that on a global list of stalled tickets [20:01:17] well, the status "unassigned" should achieve the same thing [20:01:28] somebody eventually will look at it when trigaing [20:01:44] because that's what it is, not assigned to anyone right now [20:01:46] except we have tickets in ops that stay unassigned for quite some time because there is activity on them and no one owner [20:01:48] * mutante stops bikeshedding [20:02:07] shoulda woulda [20:02:26] apergos: yes, i also agree that it's _not_ a goal to get every single task assigned to a single "owner" [20:02:34] team work is better [20:02:39] sure [20:06:26] _joe_: your comments to CTO position are right on, please edit away. I only disagree on the open source vs free software [20:07:02] _joe_: not to start a flamewar here but I think that doesn't matter as much as the other things you mentioned [20:07:14] I'm with _joe_ on that one, can I ask you to explain why you prefer 'open source'? [20:07:28] just going to listen, not argue [20:07:55] apergos: because is more broad terminology, if you ask me before joining wmf what free software was i had no idea [20:08:00] PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors [20:08:27] apergos: not that i represent the whole world ... but i bet i am not the only one [20:09:08] I'm thinking as this is targetted at someone who comes from a deep tech background, they should know the terms and their context. but I hear your point [20:09:19] and I will stop there, I said I wasn't gonna argue :-) [20:09:29] ;) [20:11:47] 6operations, 10ops-codfw: es2011-es2019 racking and onsite setup tasks - https://phabricator.wikimedia.org/T126006#2023941 (10RobH) ``` es2011 ge-1/0/9 rack B1 es2012 ge-1/0/0 rack C1 es2013 ge-1/0/4 rack D1 es2014 ge-1/0/5 rack A1 es2015 ge-1/0/1 rack C1 es2016 ge-1/0/5 rack D1... [20:11:49] 6operations, 10ops-codfw: es2011-es2019 racking and onsite setup tasks - https://phabricator.wikimedia.org/T126006#2023942 (10jcrespo) I would like @Volans to at least do one full install. [20:12:22] 6operations, 10ops-codfw: es2011-es2019 racking and onsite setup tasks - https://phabricator.wikimedia.org/T126006#2023944 (10RobH) [20:15:21] 6operations, 10Gerrit: setup/deploy server lead as jessie gerrit server - https://phabricator.wikimedia.org/T126794#2023946 (10RobH) [20:15:26] 6operations, 10ops-codfw: es2011-es2019 racking and onsite setup tasks - https://phabricator.wikimedia.org/T126006#2023947 (10jcrespo) The installation recipe db currently has one bug and it is not fully unattended, here I comment the fix: https://gerrit.wikimedia.org/r/#/c/267328/ (at the bottom). [20:15:43] 6operations, 10Gerrit: setup/deploy server lead as jessie gerrit server - https://phabricator.wikimedia.org/T126794#2023948 (10RobH) a:5RobH>3demon Reassigned to Chad for service implementation. [20:17:15] finally got the flint gerrit jokes [20:18:08] (03PS4) 10Dzahn: apache: rotate logs daily, default to 30d [puppet] - 10https://gerrit.wikimedia.org/r/266480 (owner: 10ArielGlenn) [20:19:14] legoktm: just saw you question about having wikibugs ignore Stashbot. Seems reasonable to me. [20:20:40] (03CR) 10Dzahn: [C: 032] "only affects misc apaches where there is no specific setting in hiera already" [puppet] - 10https://gerrit.wikimedia.org/r/266480 (owner: 10ArielGlenn) [20:30:40] (03CR) 10ArielGlenn: [C: 04-1] "needs to be updated for new nova api and thoroughly tested" [puppet] - 10https://gerrit.wikimedia.org/r/211432 (owner: 10ArielGlenn) [20:30:51] (03CR) 10Dzahn: [C: 04-1] "talked with apergos, unfortunately the nova api has changed, needs major updates because of that" [puppet] - 10https://gerrit.wikimedia.org/r/211432 (owner: 10ArielGlenn) [20:30:58] hahaha [20:33:45] apergos: hey, sorry, was at an early lunch, what was the context? [20:34:04] (03CR) 10ArielGlenn: "still in the backlog but I have some other trebuchet work going on so I might get to it soon(ish)" [software/deployment/trebuchet-trigger] - 10https://gerrit.wikimedia.org/r/219845 (owner: 10ArielGlenn) [20:34:33] oh greg-g there's a patch to switch scap from the /srv/deployment/scap/scap copy to scap3 deb package [20:34:47] the package needs rebuilt one last time and tested in beta [20:35:00] but after that I guess we want a deployment window and a way to double check it in prod [20:35:04] so sometime next week [20:35:14] given it affects all deployment eh? [20:35:15] ah, right, yeah, good plan [20:36:47] who wins the rsyncd battle? dataset or stat? :) [20:36:54] dataset [20:37:04] i mean [20:37:09] dataset will push to stat1002 [20:37:33] ok! [20:37:45] I have that on my current list actually [20:38:36] current are: that one, labs to nfs, trebuchet more readable output, remove minions from redis [20:39:04] might be a couple more [20:39:20] wanna close em in the next few days [20:39:52] (03PS3) 10Dzahn: send web server logs from dataset hosts to stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/268129 (https://phabricator.wikimedia.org/T118739) (owner: 10ArielGlenn) [20:40:19] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [20:40:43] (03CR) 10Dzahn: "PS3: added dataset1001 to list in hiera to allow it to rsync to stat1002, per ottomata's comment" [puppet] - 10https://gerrit.wikimedia.org/r/268129 (https://phabricator.wikimedia.org/T118739) (owner: 10ArielGlenn) [20:40:50] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 61.54% of data above the critical threshold [5000000.0] [20:42:22] (03CR) 10Dzahn: "please still see inline comments on PS2. i agree with the part to avoid the hardcoded hostname" [puppet] - 10https://gerrit.wikimedia.org/r/268129 (https://phabricator.wikimedia.org/T118739) (owner: 10ArielGlenn) [20:48:01] (03CR) 10ArielGlenn: "this is still needed, I see stuff in e.g. /a/log/webrequest/archive/zero that's mch older than 90 days." [puppet] - 10https://gerrit.wikimedia.org/r/197081 (https://phabricator.wikimedia.org/T83531) (owner: 10ArielGlenn) [20:49:15] 6operations, 6Performance-Team, 10Traffic, 5Patch-For-Review: Disable SPDY on cache_text for a week - https://phabricator.wikimedia.org/T125979#2024033 (10Nemo_bis) > Could it be that those countries have a higher proportion of people on slow connections in the country side? There was major network disrup... [20:50:46] 6operations, 7Monitoring: [RFC] Alert about *when* partitions will run out of space, not a percentage/absolute number - https://phabricator.wikimedia.org/T126158#2024036 (10Dzahn) Oh yes, I'm definitely ok with the idea, i just wanted to share the experience so we could avoid these issues and improve on it. Y... [20:50:49] (03CR) 10ArielGlenn: "yes but those should be set up by doing" [puppet] - 10https://gerrit.wikimedia.org/r/219372 (owner: 10ArielGlenn) [20:51:19] (03CR) 10Dzahn: "also see https://phabricator.wikimedia.org/T126158" [puppet] - 10https://gerrit.wikimedia.org/r/193834 (owner: 10ArielGlenn) [20:51:21] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 50.00% above the threshold [1000000.0] [20:52:03] (03CR) 10Dzahn: "but doesn't it need it each time a new repo is added?" [puppet] - 10https://gerrit.wikimedia.org/r/219372 (owner: 10ArielGlenn) [20:54:59] (03CR) 10Dzahn: "adding relevant eyeballs" [puppet] - 10https://gerrit.wikimedia.org/r/172700 (owner: 10ArielGlenn) [20:57:33] (03PS1) 10Gergő Tisza: Send log messages from session-ip channel to logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270359 (https://phabricator.wikimedia.org/T125455) [20:59:56] (03CR) 10ArielGlenn: "yes, and when someone sets up a new repo on mira they can run that command. once. and be done." [puppet] - 10https://gerrit.wikimedia.org/r/219372 (owner: 10ArielGlenn) [21:00:39] 6operations, 10Wikimedia-Mailing-lists: import old staff list archives ? - https://phabricator.wikimedia.org/T109395#2024042 (10Nemo_bis) >>! In T109395#2019869, @Dzahn wrote: > All of them? :o And since this is an ancient staff list, what i expect is a bunch of wikimedia.org email addresses that don't exist a... [21:00:43] (03CR) 10ArielGlenn: [C: 04-1] "needs updating for nova version" [puppet] - 10https://gerrit.wikimedia.org/r/172700 (owner: 10ArielGlenn) [21:02:30] (03PS2) 10Gergő Tisza: Send log messages from session-ip channel to logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270359 (https://phabricator.wikimedia.org/T125455) [21:03:00] (03CR) 10Dzahn: [C: 031] "makes sense, changing to +1" [puppet] - 10https://gerrit.wikimedia.org/r/219372 (owner: 10ArielGlenn) [21:08:03] (03CR) 10Anomie: [C: 031] Send log messages from session-ip channel to logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270359 (https://phabricator.wikimedia.org/T125455) (owner: 10Gergő Tisza) [21:10:11] so greg-g, what will twentyafterfour need to do to get a deployment slot next week? [21:10:50] apergos: edit a wiki page? :) [21:11:23] (03PS1) 10Ema: Rename vcl_recv_purge into recv_purge [puppet] - 10https://gerrit.wikimedia.org/r/270392 (https://phabricator.wikimedia.org/T124279) [21:11:32] (03CR) 10Dzahn: "can we ask CI to add it as non-voting?" [puppet] - 10https://gerrit.wikimedia.org/r/175442 (owner: 10ArielGlenn) [21:11:58] that simple? awesome [21:12:53] 6operations, 6Labs, 5Patch-For-Review: audit labs versus production ssh keys - https://phabricator.wikimedia.org/T108078#2024073 (10Dzahn) maybe we can get this merged and ask CI team to add it as a non-voting check on operations/puppet and see how it works. then if we like it , just change non-voting to voting [21:13:14] (03CR) 10ArielGlenn: "I haven't even tested this well yet. So, please not :-)" [puppet] - 10https://gerrit.wikimedia.org/r/175442 (owner: 10ArielGlenn) [21:14:05] (03CR) 10Dzahn: "alright, so if i try to sum this up: analytics says _not_ to do this because legal and ops says to please _do_ this because legal. maybe n" [puppet] - 10https://gerrit.wikimedia.org/r/197081 (https://phabricator.wikimedia.org/T83531) (owner: 10ArielGlenn) [21:16:09] (03CR) 10BryanDavis: [C: 031] Send log messages from session-ip channel to logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270359 (https://phabricator.wikimedia.org/T125455) (owner: 10Gergő Tisza) [21:16:22] twentyafterfour: you two have my trust to do this [21:16:26] (03CR) 10ArielGlenn: "analytics says don't do because legal says they need some of it. so... legal needs to give some info, yes." [puppet] - 10https://gerrit.wikimedia.org/r/197081 (https://phabricator.wikimedia.org/T83531) (owner: 10ArielGlenn) [21:16:28] (03CR) 10BBlack: [C: 031] Omit thread_pool_add_delay on Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/269686 (https://phabricator.wikimedia.org/T126206) (owner: 10Ema) [21:16:38] (03CR) 10BBlack: [C: 031] Rename vcl_recv_purge into recv_purge [puppet] - 10https://gerrit.wikimedia.org/r/270392 (https://phabricator.wikimedia.org/T124279) (owner: 10Ema) [21:17:23] andre__: I need you again ;). Still here? [21:17:57] (03CR) 10Alex Monk: "non-voting is much less disruptive than voting for broken tests :)" [puppet] - 10https://gerrit.wikimedia.org/r/175442 (owner: 10ArielGlenn) [21:19:14] (03PS1) 10Ori.livneh: Speed trials: add no-srcset variant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270407 [21:19:27] (03PS2) 10Ori.livneh: Speed trials: add no-srcset variant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270407 [21:19:28] greg-g: can I do a followup to bd808's previous backport? [21:19:35] config needs to be updated [21:19:38] tgr: sure [21:19:38] (03CR) 10Ori.livneh: [C: 032] Speed trials: add no-srcset variant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270407 (owner: 10Ori.livneh) [21:20:26] (03Merged) 10jenkins-bot: Speed trials: add no-srcset variant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270407 (owner: 10Ori.livneh) [21:20:58] csteipp: Can you take a look? T122624 waits for your input ;) [21:22:16] !log ori@mira Synchronized docroot and w: Ifc5b02cba4: Speed trials: add no-srcset variant (duration: 01m 16s) [21:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:22:46] ori: need to sync a config change, am I conflicting with you? [21:22:51] tgr: nope [21:23:11] (03CR) 10Aaron Schulz: [C: 04-1] Rationalize services definitions for labs too. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269955 (owner: 10Giuseppe Lavagetto) [21:24:01] (03CR) 10Gergő Tisza: [C: 032] Send log messages from session-ip channel to logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270359 (https://phabricator.wikimedia.org/T125455) (owner: 10Gergő Tisza) [21:26:24] (03Merged) 10jenkins-bot: Send log messages from session-ip channel to logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270359 (https://phabricator.wikimedia.org/T125455) (owner: 10Gergő Tisza) [21:30:15] (03CR) 10Dzahn: "ok, in that case i don't think it's gonna happen and i'll remove myself" [puppet] - 10https://gerrit.wikimedia.org/r/197081 (https://phabricator.wikimedia.org/T83531) (owner: 10ArielGlenn) [21:30:42] !log tgr@mira Synchronized wmf-config/InitialiseSettings.php: T125455: log session-ip channel to logstash (duration: 01m 17s) [21:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:34:58] apergos: there is a weekly phabricator deployment on Thursday 01:00 UTC but it would be nice to deploy the puppet stuff sooner than that [21:35:35] well it's up to you, I might not be awake during your slot [21:35:49] but I can have the package built and handed off to you for testing well before your slot [21:36:05] Luke081515: Which part? [21:36:06] thenit's a matter of coordinating with whichever ops person, make sure they know the history [21:36:13] including the premature deployment etc [21:36:33] (03PS2) 10Dzahn: parsoid: move rt/vd roles into role module [puppet] - 10https://gerrit.wikimedia.org/r/269707 [21:37:15] (03CR) 10jenkins-bot: [V: 04-1] parsoid: move rt/vd roles into role module [puppet] - 10https://gerrit.wikimedia.org/r/269707 (owner: 10Dzahn) [21:39:18] csteipp: The rename part of security, I think. ACL or not [21:39:31] apergos: ok thanks [21:39:56] Luke081515: {{done}} [21:40:02] thanks :) [21:40:17] apergos: this was menat for you: 21:16 < greg-g> twentyafterfour: you two have my trust to do this ;) [21:40:25] ah [21:40:31] sorry missed that. ok! [21:40:33] * Luke081515 tries to clean up the backlog of project creators, so this makes it easier, if there is progress ;) [21:40:44] apergos: no, I mis-pinged :) [21:40:49] heh [21:41:06] 6operations, 10Analytics, 6Security, 6Zero, 7audits-data-retention: Purge > 90 days stat1002:/a/squid/archive/zero - https://phabricator.wikimedia.org/T92343#2024150 (10Dzahn) [21:41:25] 6operations, 10Analytics, 6Security, 6Zero, 7audits-data-retention: Purge > 90 days stat1002:/a/squid/archive/sampled - https://phabricator.wikimedia.org/T92342#2024152 (10Dzahn) [21:41:34] 6operations, 10Analytics, 6Security, 6Zero, and 2 others: Purge > 90 days stat1002:/a/squid/archive/mobile - https://phabricator.wikimedia.org/T92341#2024153 (10Dzahn) [21:41:46] 6operations, 10Analytics, 6Security, 7audits-data-retention: Purge > 90 days stat1002:/a/squid/archive/glam_nara - https://phabricator.wikimedia.org/T92340#2024154 (10Dzahn) [21:41:54] 6operations, 5Patch-For-Review, 7audits-data-retention: Purge > 90 days stat1002:/a/squid/archive/edits - https://phabricator.wikimedia.org/T92339#2024155 (10Dzahn) [21:42:06] 6operations, 10Analytics, 6Security, 7audits-data-retention: Purge > 90 days stat1002:/a/squid/archive/api - https://phabricator.wikimedia.org/T92338#2024156 (10Dzahn) [21:42:22] 6operations, 5Patch-For-Review, 7audits-data-retention: Delete gadolinium:/a/log/nginx/ - https://phabricator.wikimedia.org/T92337#2024157 (10Dzahn) [21:42:33] 6operations, 10Fundraising-Backlog, 6Security, 10fundraising-tech-ops, 7audits-data-retention: Delete gadolinium:/a/log/fundraising/ - https://phabricator.wikimedia.org/T92336#2024159 (10Dzahn) [21:42:42] 6operations, 7audits-data-retention: Delete stat1002:/a/squid/archive/teahouse - https://phabricator.wikimedia.org/T92335#2024160 (10Dzahn) [21:42:51] 6operations, 7audits-data-retention: Delete stat1002:/a/squid/archive/sampled-geocoded - https://phabricator.wikimedia.org/T92334#2024162 (10Dzahn) [21:43:03] 6operations, 7audits-data-retention: Delete stat1002:/a/squid/archive/mobile-geocoded - https://phabricator.wikimedia.org/T92333#2024165 (10Dzahn) [21:43:16] 6operations, 7audits-data-retention: Delete stat1002:/a/squid/archive/edits-geocoded - https://phabricator.wikimedia.org/T92332#2024167 (10Dzahn) [21:43:42] 6operations, 10Wikimedia-Blog, 7audits-data-retention: Delete stat1002:/a/squid/archive/blog - https://phabricator.wikimedia.org/T92331#2024169 (10Dzahn) [21:43:52] 6operations, 7audits-data-retention: Delete stat1002:/a/squid/archive/bannerImpressions - https://phabricator.wikimedia.org/T92330#2024170 (10Dzahn) [21:44:00] 6operations, 7audits-data-retention: Delete stat1002:/a/squid/archive/arabic-banner - https://phabricator.wikimedia.org/T92329#2024171 (10Dzahn) [21:44:43] 6operations, 10Analytics-EventLogging, 7audits-data-retention: Delete vanadium:/srv/eventlogging - https://phabricator.wikimedia.org/T75084#2024173 (10Dzahn) [21:44:54] 6operations, 7audits-data-retention: Delete stat1002:/a/squid/archive/sopa - https://phabricator.wikimedia.org/T92344#2024175 (10Dzahn) [21:46:37] (03PS4) 10Dzahn: purge webrequest logs after 90 days [puppet] - 10https://gerrit.wikimedia.org/r/197081 (https://phabricator.wikimedia.org/T83531) (owner: 10ArielGlenn) [21:47:27] (03CR) 10Dzahn: [C: 04-2] "given the history of this, i suggest to abandon the change in gerrit for now, continue the discussion on the linked T84618 and reopen this" [puppet] - 10https://gerrit.wikimedia.org/r/197081 (https://phabricator.wikimedia.org/T83531) (owner: 10ArielGlenn) [21:47:51] (03CR) 10jenkins-bot: [V: 04-1] purge webrequest logs after 90 days [puppet] - 10https://gerrit.wikimedia.org/r/197081 (https://phabricator.wikimedia.org/T83531) (owner: 10ArielGlenn) [21:48:06] (03CR) 10Dzahn: "because hitting "restore" is cheap ...etc" [puppet] - 10https://gerrit.wikimedia.org/r/197081 (https://phabricator.wikimedia.org/T83531) (owner: 10ArielGlenn) [22:05:35] 6operations: Can't access analytics-store.eqiad.wmnet from stat1003 - https://phabricator.wikimedia.org/T126800#2024214 (10Halfak) Just checked and I can connect now. Looks like the problem might be solved or intermittent. I'll 'resolve' this task and reopen it if the problem returns. [22:05:48] 6operations: Can't access analytics-store.eqiad.wmnet from stat1003 - https://phabricator.wikimedia.org/T126800#2024219 (10Halfak) 5Open>3Resolved [22:06:10] (03PS1) 10Chad: phabricator-admins: Also let them manage object policies [puppet] - 10https://gerrit.wikimedia.org/r/270414 [22:07:06] mutante: Got a sec for a super easy sudoers patch for phab? ^ [22:07:26] 6operations, 10Wikimedia-Mailing-lists: import old staff list archives ? - https://phabricator.wikimedia.org/T109395#2024235 (10Dzahn) Now if we had a current list of staff we might be able to compare how many of them ever posted back on this list, but i'm afraid we don't have one. [22:08:28] he's doing stuff I should be doing, ostriches [22:08:34] so I'll do this :-D [22:09:08] (03CR) 10ArielGlenn: [C: 032] phabricator-admins: Also let them manage object policies [puppet] - 10https://gerrit.wikimedia.org/r/270414 (owner: 10Chad) [22:10:01] apergos: Thanks :) [22:10:24] where does this need to run to go live, ostriches? [22:10:34] please don't tell me "iridium" [22:10:38] Oh, yes. [22:10:43] Is that a problem? [22:11:05] wasn't puppet disabled there? [22:11:09] Oh phooey [22:11:17] with a scary warning message? [22:11:41] The last Puppet run was at Thu Feb 4 07:22:46 UTC 2016 (11936 minutes ago). [22:11:44] it is and it still is and it's going to be that way unless [22:11:45] Otherwise nothing in motd :p [22:11:55] ostriches: you mean iridium? [22:11:58] Yes. [22:12:03] twentyafterfour has a phab deployment slot and takes a few minutes potential downtime [22:12:04] it's been in limbo [22:12:09] and does this: [22:12:16] ostriches: unless sudoers means access request?:) [22:12:18] make sure all puppet lockfiles are there [22:12:26] shutdown down apache and phd [22:12:32] enable puppet, puppet run [22:12:35] check all the phab repos [22:12:40] really check em again [22:12:47] mutante: No, this is granting another bin/ script from phab for an existing group with existing members. [22:12:51] then start apache and phd back up [22:12:53] (we already have several whitelisted) [22:13:00] ostriches: yea, that's an access request :p [22:13:17] https://gerrit.wikimedia.org/r/#/c/270414/1/modules/admin/data/data.yaml - for this? since when? [22:13:31] 'ALL = NOPASSWD: /srv/phab/phabricator/bin/policy' this is phab-specific [22:13:48] if it's out of policy (I don't think it is) I'll take the heat [22:14:01] I thought access requests only apply to new users [22:14:03] it's a 'fix phab config fsckups' [22:14:05] Not adding permissions to existing groups [22:14:10] no, they apply to existing users too [22:14:19] apergos: I could manually edit the sudoers file. [22:14:24] which I'm pretty sure has been done before without going through the ops meeting process [22:14:26] So when puppet does finally run, it'll just be 0 diff [22:14:35] ostriches: ifyou did that I would have to come over there and slap you [22:14:43] and the airplane tickets are waaay too expensive [22:14:54] I know [22:15:07] twentyafterfour: can you schedule a time to do the update I just mentioned? [22:15:10] does adding groups to another server (directly or via roles) require access review? [22:15:22] 5 to 15 mins outage? so we can catch up on puppet on that host? [22:15:53] apergos: yes but we need to merge https://gerrit.wikimedia.org/r/#/c/268351/ first [22:16:00] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 53.57% of data above the critical threshold [5000000.0] [22:16:00] why [22:16:14] all you're doing is getting puppet caught up then disabling again [22:16:18] (03CR) 10Chad: [C: 031] phabricator: forward the old tag system to current release/2015-11-18/1 [puppet] - 10https://gerrit.wikimedia.org/r/268351 (owner: 10Rush) [22:16:22] (03PS1) 10Rush: grub: allow setting ioscheduler [puppet] - 10https://gerrit.wikimedia.org/r/270421 (https://phabricator.wikimedia.org/T126090) [22:16:25] because I don't trust the lock files? [22:16:31] apergos: oh [22:16:34] that's why you have the widow [22:16:37] *window [22:16:47] so you can check tht after the pupet run all the repos are right, or fix em [22:16:49] then disable again [22:16:54] we're 7 days behind on that host [22:17:26] but you should still schedule/announce [22:17:59] apergos: I think we could leave puppet enabled if we merged https://gerrit.wikimedia.org/r/#/c/268351/ [22:18:18] but yeah you are right we could disable it again and not merge that [22:18:29] (03CR) 10jenkins-bot: [V: 04-1] grub: allow setting ioscheduler [puppet] - 10https://gerrit.wikimedia.org/r/270421 (https://phabricator.wikimedia.org/T126090) (owner: 10Rush) [22:18:29] not going to merge that right now [22:18:43] * twentyafterfour wonders where is the appropriate venue to announce [22:19:01] that I don't know [22:19:11] 6operations, 10Wikimedia-Mailing-lists: import old staff list archives ? - https://phabricator.wikimedia.org/T109395#2024283 (10Jalexander) >>! In T109395#2024235, @Dzahn wrote: > Now if we had a current list of staff we might be able to compare how many of them ever posted back on this list, but i'm afraid we... [22:19:21] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 60.87% of data above the critical threshold [5000000.0] [22:19:35] ehh... ask greg-g? :-D [22:21:05] (03PS2) 10Rush: grub: allow setting ioscheduler [puppet] - 10https://gerrit.wikimedia.org/r/270421 (https://phabricator.wikimedia.org/T126090) [22:21:56] (03PS3) 10Rush: grub: allow setting ioscheduler [puppet] - 10https://gerrit.wikimedia.org/r/270421 (https://phabricator.wikimedia.org/T126090) [22:25:09] apergos: I will do it monday? It seems major enough not to do on a friday evening? [22:25:20] yes agree [22:25:25] chad, sorry about that [22:25:28] er ostriches [22:25:32] can you live with monday ? [22:28:19] apergos: Yeah no rush on this at all [22:30:51] ok great [22:31:49] (03CR) 10Rush: "no change" [puppet] - 10https://gerrit.wikimedia.org/r/270421 (https://phabricator.wikimedia.org/T126090) (owner: 10Rush) [22:33:23] (03CR) 10Andrew Bogott: [C: 031] grub: allow setting ioscheduler [puppet] - 10https://gerrit.wikimedia.org/r/270421 (https://phabricator.wikimedia.org/T126090) (owner: 10Rush) [22:33:54] !log ori@mira Synchronized php-1.27.0-wmf.13/extensions/MobileFrontend/extension.json: I315628aef3: Don't use 'qlow' for NetSpeed=B (duration: 01m 16s) [22:33:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:35:05] (03CR) 10Rush: [C: 032] grub: allow setting ioscheduler [puppet] - 10https://gerrit.wikimedia.org/r/270421 (https://phabricator.wikimedia.org/T126090) (owner: 10Rush) [22:35:54] 6operations, 6Services: Package npm 2.14 - https://phabricator.wikimedia.org/T124474#2024357 (10Krinkle) Node 4.x ships with its own npm. One shouldn't install "npm" from Debian as its own package. Maybe we're shadowing it somehow, due to the separate npm package overwriting the bin link? [22:38:41] (03CR) 10Rush: "post merge test fyi:" [puppet] - 10https://gerrit.wikimedia.org/r/270421 (https://phabricator.wikimedia.org/T126090) (owner: 10Rush) [22:39:26] and the airplane tickets are waaay too expensive heh [22:39:38] Although I'm not sure if the Friendly Spae policy will approve [22:39:41] *Space [22:40:59] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 50.00% above the threshold [1000000.0] [22:43:42] well he didn't edit the file so we're all good :-D [22:44:39] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 50.00% above the threshold [1000000.0] [22:48:00] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [5000000.0] [22:49:11] (03PS1) 10Rush: diamond: fixup nfsd collector [puppet] - 10https://gerrit.wikimedia.org/r/270430 [22:49:13] (03PS1) 10Rush: labstore1001: persist cfq ioscheduler [puppet] - 10https://gerrit.wikimedia.org/r/270431 (https://phabricator.wikimedia.org/T126090) [22:49:33] (03Abandoned) 10Rush: diamond: fixup nfsd collector [puppet] - 10https://gerrit.wikimedia.org/r/270430 (owner: 10Rush) [22:49:45] (03Abandoned) 10Rush: labstore1001: persist cfq ioscheduler [puppet] - 10https://gerrit.wikimedia.org/r/270431 (https://phabricator.wikimedia.org/T126090) (owner: 10Rush) [22:50:47] (03PS1) 10Rush: labstore1001: persist cfq ioscheduler [puppet] - 10https://gerrit.wikimedia.org/r/270432 (https://phabricator.wikimedia.org/T126090) [22:51:30] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 50.00% above the threshold [1000000.0] [23:09:17] 6operations, 10Deployment-Systems, 6Performance-Team, 10Traffic, 5Patch-For-Review: Make Varnish cache for /static/$wmfbranch/ expire when resources change within branch lifetime - https://phabricator.wikimedia.org/T99096#2024490 (10Krinkle) [23:10:03] hashar: I wanted to give you a heads up, after just three days we have a couple trusty integration instances at already over 60% [23:10:13] disk use on that /mnt/ partition [23:10:28] so it might be worth it to have a find job that runs once a day til things settle [23:10:58] apergos: the CI slaves you mean ? [23:11:17] the old ones are usually at 85-90% disk usage yeah [23:11:36] integration-slave-trusty-1011.integration.eqiad.wmflabs for example [23:11:47] there is some work going on to migrate to generic jobs so we get less different workspaces [23:11:56] yeah 101x are old ones [23:12:05] the 100x are the one I have added this week [23:12:24] we get some monitoring alarms send to #wikimedia-releng and qa-alerts mailing list [23:12:35] three are no 100x trusty ones [23:12:39] *there [23:12:42] long term solution: move all the stuff to disposable instances [23:13:06] https://integration.wikimedia.org/ci/ on the left [23:13:21] or https://integration.wikimedia.org/ci/label/UbuntuTrusty/ [23:13:29] 6operations, 10Deployment-Systems, 6Performance-Team, 10Traffic, 5Patch-For-Review: Make Varnish cache for /static/$wmfbranch/ expire when resources change within branch lifetime - https://phabricator.wikimedia.org/T99096#2024524 (10Krinkle) [23:13:37] oh they weren't there a day ago I could swear [23:13:47] all right, I'll add them to my check list for the next few days [23:13:48] thanks [23:14:04] I have added then on Thursday around 8pm UTC [23:14:08] hm [23:14:13] they will fill up [23:14:30] well I'll report back in another 3 days then :-D [23:14:51] thanks [23:15:23] I usually look at them via salt on integration-saltmaster (something like salt -v '*trusty*' cmd.run 'df -h' [23:15:35] oh and salt is very reliable nowadays [23:16:20] :-) music to my ears! [23:16:59] thx for that! [23:19:14] I've been running an ssh loop cause i couldn't get onto the saltmaster for whatever reason [23:19:27] only downside is it requires me to know the names :-D [23:20:01] (03PS6) 10Dzahn: parsoid::testing: use /srv instead /usr/lib [puppet] - 10https://gerrit.wikimedia.org/r/269606 [23:23:04] apergos: you have to login from labs bastion-restricted machine iirc [23:23:15] huh my loop says it can't resolve integration-slave-trusty-1005.integration.eqiad.wmflabs or integration-slave-trusty-1002.integration.eqiad.wmflabs, weird [23:23:19] labs ops folk would know [23:23:22] I was on bastion-restricted, always am [23:23:35] in fact I run my loop from there :-D [23:23:38] maybe dns is borked [23:23:46] it finds the rest [23:23:51] oh yeah dns issues recently [23:23:53] the Jenkins master reach them using the IP address [23:23:55] urgh [23:24:02] in case DNS has failure [23:24:05] smart [23:24:07] very smart [23:24:11] and apparently the DHCP always give the same ip for some reason [23:24:14] not smart [23:24:16] prudent :D [23:24:19] err [23:24:45] 6 of one half a dozen of the other, as they say [23:24:48] wise / cautious [23:25:04] (03PS7) 10Dzahn: parsoid::testing: use /srv instead /usr/lib [puppet] - 10https://gerrit.wikimedia.org/r/269606 [23:25:28] apergos: oh also one can just go to https://integration.wikimedia.org/ci/computer/ login with your ldap account and you will get disk space usage [23:25:39] on integration i always notice how i have to re-login so quickly [23:25:46] (already logged in) [23:25:54] I've had the intergration page open for days now :-D [23:25:58] most services remember me for a while, but integration have to re-login a couple times per day [23:26:02] hmm [23:26:04] not me hm [23:26:42] added to my list of "monitoring" [23:26:44] thank you [23:28:28] 6operations, 10RESTBase-Cassandra, 6Services: Evaluate Brotli compression for Cassandra - https://phabricator.wikimedia.org/T125906#2024563 (10mobrovac) [23:28:39] PROBLEM - puppet last run on mw2091 is CRITICAL: CRITICAL: Puppet has 1 failures [23:32:53] (03CR) 10Dzahn: "only affects the test server, http://puppet-compiler.wmflabs.org/1755/" [puppet] - 10https://gerrit.wikimedia.org/r/269606 (owner: 10Dzahn) [23:33:05] (03CR) 10Dzahn: [C: 032] "only affects the test server, http://puppet-compiler.wmflabs.org/1755/" [puppet] - 10https://gerrit.wikimedia.org/r/269606 (owner: 10Dzahn) [23:37:49] !log ruthenium - moving parsoid path, cleaning up old resources [23:37:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:38:40] (03PS3) 10Dzahn: parsoid: move rt/vd roles into role module [puppet] - 10https://gerrit.wikimedia.org/r/269707 [23:39:50] arr, an nginx::site configured directly in site.pp [23:40:06] which is now causing an issue because it wasnt in a role.. [23:40:30] fixing it, affects ruthenium only [23:47:01] (03PS1) 10Yurik: Enable Kartographer ext in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270441 [23:47:57] (03PS1) 10Dzahn: parsoid-testing: fix nginx site config [puppet] - 10https://gerrit.wikimedia.org/r/270442 [23:50:43] (03PS2) 10Dzahn: parsoid-testing: fix nginx site config [puppet] - 10https://gerrit.wikimedia.org/r/270442 [23:54:27] (03PS2) 10Yurik: Enable Kartographer ext in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270441 [23:54:56] (03CR) 10Dzahn: "so.. actually it did affect all mediawiki installations, unlike originally expected. but nevertheless it's an ok change and luckily the re" [puppet] - 10https://gerrit.wikimedia.org/r/266480 (owner: 10ArielGlenn) [23:55:20] RECOVERY - puppet last run on mw2091 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:55:43] (03CR) 10Dzahn: [C: 032] parsoid-testing: fix nginx site config [puppet] - 10https://gerrit.wikimedia.org/r/270442 (owner: 10Dzahn) [23:56:55] (03CR) 10MaxSem: [C: 031] Enable Kartographer ext in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270441 (owner: 10Yurik) [23:59:30] RECOVERY - puppet last run on ruthenium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures