[00:00:04] addshore, hashar, anomie, no_justification, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Time to snap out of that daydream and deploy Evening SWAT (Max 8 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171214T0000). [00:00:05] Jhs: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:29] I'm here! [00:02:06] (03PS3) 10Dzahn: Add Icelandic dictionary for ORES on iswiki [puppet] - 10https://gerrit.wikimedia.org/r/398078 (https://phabricator.wikimedia.org/T181099) (owner: 10Awight) [00:02:18] (03CR) 10Dzahn: [C: 032] Add Icelandic dictionary for ORES on iswiki [puppet] - 10https://gerrit.wikimedia.org/r/398078 (https://phabricator.wikimedia.org/T181099) (owner: 10Awight) [00:04:46] (03CR) 10Dzahn: "on ores1001: Notice: /Stage[main]/Packages::Aspell_is/Package[aspell-is]/ensure: created" [puppet] - 10https://gerrit.wikimedia.org/r/398078 (https://phabricator.wikimedia.org/T181099) (owner: 10Awight) [00:05:44] addshore, hashar, anomie, no_justification, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: ping [00:05:53] I can SWAT [00:06:02] Sorry that nobody took it [00:06:27] SWAT is also best-effort. All 12 of those people were already pinged -- please don't ping them all a second time [00:06:42] (4 of whom I know for a fact are disconnected and/or asleep right now :)) [00:06:50] there are probably too many people in that list, heh [00:06:51] <3 [00:06:57] Yeah, I shouldn't be on any of them [00:07:00] I never SWAT :p [00:07:05] (03PS3) 10Catrope: Add upload_by_url to extended uploaders on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397129 (https://phabricator.wikimedia.org/T182534) (owner: 10Jon Harald Søby) [00:07:06] MatmaRex: they all volunteered at some point! :) [00:07:08] (03CR) 10Catrope: [C: 032] Add upload_by_url to extended uploaders on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397129 (https://phabricator.wikimedia.org/T182534) (owner: 10Jon Harald Søby) [00:07:09] especially if they're known to be asleep at this time… ;) [00:07:13] Who needs SWAT when you can deploy WHENEVER YOU WANT [00:07:18] Reedy: shush now [00:07:39] :( [00:08:10] no_justification, i have no idea who might be awake or not :) and whoever slept through the first ping would probably sleep through the second ;) [00:08:22] Reedy: So what you're saying really is everyone needs deploy rights?! [00:08:27] That's totally how I read it :) [00:08:44] (03Merged) 10jenkins-bot: Add upload_by_url to extended uploaders on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397129 (https://phabricator.wikimedia.org/T182534) (owner: 10Jon Harald Søby) [00:12:13] RoanKattouw, looks right on wmdebug1002 [00:12:13] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: T182534 (duration: 01m 08s) [00:12:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:26] T182534: Add upload_by_url to Extended uploaders (Wikimedia Commons) - https://phabricator.wikimedia.org/T182534 [00:12:42] no_justification: +1. Let's just make zuul trigger a deploy after the a green merge and be done with it [00:12:44] Jhs: Thanks for testing there. I also checked there and deployed it to the entire cluster already :) [00:13:16] * bd808 went to a bunch of CI/CD talks at KubeCon last week [00:13:20] sweet :) very efficient [00:13:53] question: should i remove Patch-for-Review when something is deployed? [00:14:08] if you'd like/have a bit of OCD :) [00:14:20] greg-g, just a tiny bit ;) [00:14:39] bd808: Actually, I don't want anyone to have deploy rights. Or to deploy stuff. Or even write code anymore [00:14:42] I don't trust developers :p [00:14:49] re: patch-for-review https://phabricator.wikimedia.org/T95309 [00:15:08] no_justification: so you are ready to rise into management? I'll trade you straight across [00:15:32] * bd808 has had 17 meetings so far this week [00:15:46] I've had 2. That's 2 more than I wanted ;-) [00:16:47] no_justification secretly loves meetings. don't tell him I told you though [00:17:01] It's so secret I don't even know [00:17:05] hahah [00:17:46] no_justification says: "There is no justification for meetings. I love meetings, especially the ones that I don't go to." [00:18:31] suggests a meeting on how to reduce the number of meetings [00:18:41] twentyafterfour: Best meetings are canceled meetings :p [00:19:29] mutante: I've been to that meeting ;) [00:19:58] Did we ever get the ribbons with "I survived another meeting that should have been an email"? [00:21:19] * twentyafterfour needs to order "This meeting should have been an email" t-shirts for all tech managers [00:22:07] I'll take 5 of them so I then have a uniform. [00:22:58] (03CR) 10jenkins-bot: Add upload_by_url to extended uploaders on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397129 (https://phabricator.wikimedia.org/T182534) (owner: 10Jon Harald Søby) [00:28:13] bd808: hehehe [00:28:43] what were the ideas.. mandatory meeting notes checked into git ?:) [00:34:51] mutante: I think there was a suggestion of a working group to study the problem... [00:34:57] twentyafterfour: s/email/public ticket :) [00:35:58] bd808: ooh..heh! well i hope it's succesful [00:37:20] (03PS3) 10Dzahn: mariadb::parsercache: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/397990 (https://phabricator.wikimedia.org/T177225) [00:42:12] (03CR) 10Dzahn: "it turns out this doesn't affect pc* servers in any way, it doesn't break anything but also doesn't remove ganglia and the reason is that " [puppet] - 10https://gerrit.wikimedia.org/r/397990 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [00:45:55] (03PS1) 10Dzahn: parsercache: remove ganglia from parsercache nodes [puppet] - 10https://gerrit.wikimedia.org/r/398186 (https://phabricator.wikimedia.org/T177225) [00:53:23] is toollabs having the same issues that it had about this time yesterday? [00:53:56] I am seeing similar weird responsiveness when using Magnus's wikidata tools [00:54:33] 10Operations, 10Mail, 10Toolforge, 10Security: Forward security@tools.wmflabs.org to security@wikimedia.org - https://phabricator.wikimedia.org/T182812#3836318 (10Legoktm) ``` From: legoktm@ To: security@tools.wmflabs.org Subject: Test - T182812 Please comment on https://phabricator.wikimedia.org/T182812... [00:55:47] (03PS2) 10Dzahn: parsercache: remove ganglia from parsercache nodes [puppet] - 10https://gerrit.wikimedia.org/r/398186 (https://phabricator.wikimedia.org/T177225) [00:57:51] 10Operations, 10Mail, 10Toolforge, 10Security: Forward security@tools.wmflabs.org to security@wikimedia.org - https://phabricator.wikimedia.org/T182812#3836322 (10Reedy) {F11816615 size=full} I already got emails to that address... [00:58:27] 10Operations, 10Mail, 10Toolforge, 10Security: Forward security@tools.wmflabs.org to security@wikimedia.org - https://phabricator.wikimedia.org/T182812#3836323 (10bd808) > Please comment on https://phabricator.wikimedia.org/T182812 if this email was successfully received. I think I was already getting mai... [01:00:05] twentyafterfour: #bothumor I � Unicode. All rise for Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171214T0100). [01:00:05] No GERRIT patches in the queue for this window AFAICS. [01:00:05] (03CR) 10Dzahn: [C: 032] parsercache: remove ganglia from parsercache nodes [puppet] - 10https://gerrit.wikimedia.org/r/398186 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [01:00:57] 10Operations, 10Mail, 10Toolforge, 10Security: Forward security@tools.wmflabs.org to security@wikimedia.org - https://phabricator.wikimedia.org/T182812#3836326 (10bd808) ``` tools-mail.tools:/etc/exim4 bd808$ grep security /etc/aliases security: root ``` [01:09:20] (03CR) 10Dzahn: "so i had to do it like this for each hostname https://gerrit.wikimedia.org/r/#/c/398186/ but no issues with the puppet runs there.. it's" [puppet] - 10https://gerrit.wikimedia.org/r/397990 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [01:19:04] seems to me that Magnus's tools are having difficulties pulling information from the dataservers in toollabs; web interface is fine ... example https://tools.wmflabs.org/mix-n-match/#/list/61/auto [01:19:07] haaaaaangs [01:30:53] (03PS1) 10Gergő Tisza: Add PUT to list of allowed characters [puppet] - 10https://gerrit.wikimedia.org/r/398197 (https://phabricator.wikimedia.org/T182825) [01:43:41] (03PS1) 10Andrew Bogott: labsdb.zone: typo fix to the tools.db cname [puppet] - 10https://gerrit.wikimedia.org/r/398201 [01:44:29] (03CR) 10Andrew Bogott: [C: 032] labsdb.zone: typo fix to the tools.db cname [puppet] - 10https://gerrit.wikimedia.org/r/398201 (owner: 10Andrew Bogott) [01:46:08] (03CR) 10Krinkle: labsdb: Point DNS at equivalent web.db.svc.eqiad.wmflabs hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/397256 (https://phabricator.wikimedia.org/T142807) (owner: 10BryanDavis) [01:46:46] !log no phabricator deployment tonight, not enough time to prepare and test the update due to a short outage earlier this evening. [01:46:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:52:07] 10Operations, 10Reading List Service, 10Reading-Infrastructure-Team-Backlog, 10Traffic, 10Patch-For-Review: PUT blocked by Varnish - https://phabricator.wikimedia.org/T182825#3835819 (10Dzahn) related ticket that allowed PUT on the "misc" cluster: T62350 [02:05:28] (03PS2) 10Gergő Tisza: Add PUT to list of allowed methods for text varnish [puppet] - 10https://gerrit.wikimedia.org/r/398197 (https://phabricator.wikimedia.org/T182825) [02:26:18] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.11) (duration: 08m 05s) [02:26:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:11:33] (03CR) 10Krinkle: [C: 031] mwlog/xenon: access should be based on role, not host names [puppet] - 10https://gerrit.wikimedia.org/r/393994 (owner: 10Dzahn) [03:24:41] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 726.03 seconds [03:46:50] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 290.27 seconds [04:23:40] PROBLEM - MariaDB Slave SQL: s5 on db1097 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Update_rows_v1 event on table dewiki.archive: Cant find record in archive, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1070-bin.001641, end_log_pos 334455268 [04:23:41] PROBLEM - MariaDB Slave SQL: s8 on db1101 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Update_rows_v1 event on table dewiki.archive: Cant find record in archive, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1071-bin.005983, end_log_pos 728987568 [04:24:20] PROBLEM - MariaDB Slave SQL: s5 on db1096 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Update_rows_v1 event on table dewiki.archive: Cant find record in archive, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1070-bin.001641, end_log_pos 334455268 [04:31:41] PROBLEM - MariaDB Slave Lag: s8 on db1101 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 626.96 seconds [04:31:41] PROBLEM - MariaDB Slave Lag: s5 on db1097 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 626.96 seconds [04:32:00] PROBLEM - MariaDB Slave Lag: s5 on db1096 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 641.13 seconds [04:32:30] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1949 bytes in 0.142 second response time [04:33:31] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1panelId=2fullscreen [04:37:14] Due to high database server lag, changes newer than 907 seconds may not be shown in this list. @ wikidata [04:37:32] (Watchlist) [04:43:00] RECOVERY - MariaDB Slave Lag: s5 on db1096 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [04:43:20] RECOVERY - MariaDB Slave SQL: s5 on db1096 is OK: OK slave_sql_state Slave_SQL_Running: Yes [04:43:43] revi, it is know, it is being fixed right now [04:43:54] kk! [04:44:02] thanks! [04:44:30] in fact it should be fixed already, but it may be still issues, thanks for the report [04:48:41] RECOVERY - MariaDB Slave SQL: s5 on db1097 is OK: OK slave_sql_state Slave_SQL_Running: Yes [04:48:50] RECOVERY - MariaDB Slave Lag: s5 on db1097 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [04:51:20] PROBLEM - MariaDB Slave SQL: s5 on db1096 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Update_rows_v1 event on table dewiki.archive: Cant find record in archive, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1070-bin.001641, end_log_pos 349362942 [04:51:30] PROBLEM - MariaDB Slave SQL: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Update_rows_v1 event on table dewiki.archive: Cant find record in archive, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1070-bin.001641, end_log_pos 349362942 [04:51:50] PROBLEM - MariaDB Slave SQL: s5 on db1097 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Update_rows_v1 event on table dewiki.archive: Cant find record in archive, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1070-bin.001641, end_log_pos 349362942 [04:54:50] PROBLEM - MariaDB Slave Lag: s8 on db1101 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 344.68 seconds [04:58:21] RECOVERY - MariaDB Slave SQL: s5 on db1096 is OK: OK slave_sql_state Slave_SQL_Running: Yes [05:01:20] PROBLEM - MariaDB Slave SQL: s5 on db1096 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Update_rows_v1 event on table dewiki.archive: Cant find record in archive, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1070-bin.001641, end_log_pos 352396187 [05:03:50] PROBLEM - MariaDB Slave Lag: s8 on db1101 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 307.23 seconds [05:04:00] PROBLEM - MariaDB Slave Lag: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 895.52 seconds [05:04:00] anomie, your script is killing s7 [05:08:50] PROBLEM - MariaDB Slave Lag: s5 on db1097 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 608.45 seconds [05:09:10] PROBLEM - MariaDB Slave Lag: s5 on db1096 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 623.03 seconds [05:09:14] jynus_: I doubt he will be online right now, it's probably safe to kill if it's causing problems [05:09:23] I am trying [05:09:52] he has like 20 self-reproducing scripts [05:11:05] hmm [05:11:09] /bin/bash tmp/cleanupUsersWithNoIds.sh s1 [05:12:17] oh, you moved it [05:12:37] who runs 20 scripts without a screen? [05:12:56] worst case you can just move the MW maintenance script I guess [05:13:20] it is ok now, as long as it doesn't run on s5 [05:14:50] RECOVERY - MariaDB Slave SQL: s5 on db1097 is OK: OK slave_sql_state Slave_SQL_Running: Yes [05:14:50] RECOVERY - MariaDB Slave Lag: s5 on db1097 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [05:15:10] RECOVERY - MariaDB Slave Lag: s5 on db1096 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [05:15:21] RECOVERY - MariaDB Slave SQL: s5 on db1096 is OK: OK slave_sql_state Slave_SQL_Running: Yes [05:15:50] RECOVERY - MariaDB Slave Lag: s8 on db1101 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [05:15:51] RECOVERY - MariaDB Slave SQL: s8 on db1101 is OK: OK slave_sql_state Slave_SQL_Running: Yes [05:17:30] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1951 bytes in 0.069 second response time [05:21:40] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1panelId=2fullscreen [05:34:38] I happened to look at my email as I was heading to bed. [05:34:46] jynus_: What exactly was blowing up? Just s5? [05:35:23] "just" :-D [05:35:54] s5 is currently more sensitive because it is in ROW [05:36:47] this is actually relevant for mediawiki core, because it is a direct result of unsafe statements in the past [05:36:56] I was told by Aaron those were corrected [05:37:03] (please confirm) [05:37:15] but probably older rows are out of sync [05:37:26] the ones that failed here were from 2014 [05:38:20] it wouldn't hurt to double check unsafe queries are no longer happening and establish a timestamp of when those happened for its fix [05:42:50] jynus_: While I have you online, on db1061 I see "show explain for 528158583" is saying the query filesorts, but explaining the identical query says no filesort as I'd expect. Any ideas? [05:43:41] explain is not realiable pre-query, it is just a guide of what it could do [05:43:52] show explain for is reliable [05:44:23] that is why I focus on handlers more than explain usually, it doesn't tell the whole story [05:48:07] jynus_: Anyway, any objection to me restarting the script for enwiki? [05:48:09] I guess we'll have to wait for s5 until January when the rest of the s5 weirdness is sorted out. [05:49:37] ok [05:49:53] do not run it on s5 [05:52:29] also please use screen [05:52:48] or something I can just kill with a single command [05:53:17] also, something that will not make the queries fail if you disconnect [05:59:16] I didn't use screen because I wanted to be able to check the whole log file for errors, not just whatever scrollback is kept by screen. If you need to kill the current run for enwiki, "kill -- -10745" should do it by killing the whole process group. For s2, s6, and s7 that are still running, "kill -- -8867". I don't know what you mean by "something that will not make the queries fail if you disconnect". [06:00:57] I don't understand- if you are logging to a file, why need of a buffer? [06:01:09] if you need a larger buffer, just increase it [06:02:04] https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/admin/files/home/jynus/.screenrc [06:05:50] "something that will not make the queries fail if you disconnect" you run your command, then you detach the session and can disconnect [06:06:09] Anyway, I'm going to bed now. [06:06:17] if you lose connection mid-query, screen doesn't care, it keeps running [06:06:33] also other people can connect and manage it if needed [06:06:46] very handy when working with other people [06:07:23] let me convince you tomorrow of the goodness of screen :-) [06:11:40] RECOVERY - MariaDB Slave SQL: s5 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [06:14:17] Just read the backlog :-( [06:21:11] RECOVERY - MariaDB Slave Lag: s5 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 291.86 seconds [06:22:47] (03PS1) 10Marostegui: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398211 (https://phabricator.wikimedia.org/T174569) [06:24:22] you would have come handy to research the source while someone put outs the fires, by luck I was able to indentify it quickly because I realized the edits where recent [06:25:10] Yeah, I didn't get any pages so I didn't know what was going on :( [06:25:50] if you can have a look at the ticket, that would help a lot [06:26:09] the pt table checksum for s5? [06:26:12] leave the others as it is now, do s5 now [06:26:28] dewiki specifically [06:26:33] Yeah, will do it today [06:32:16] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398211 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [06:33:42] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398211 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [06:35:17] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1091 - T174569 (duration: 01m 08s) [06:35:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:30] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [06:35:36] !log Deploy schema change on db1091 (s4) - T174569 [06:35:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:51] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398211 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [06:49:56] (03PS1) 10Marostegui: db-eqiad.php: Depool db1101:3318 and db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398214 (https://phabricator.wikimedia.org/T161294) [06:52:22] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1101:3318 and db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398214 (https://phabricator.wikimedia.org/T161294) (owner: 10Marostegui) [06:53:47] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1101:3318 and db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398214 (https://phabricator.wikimedia.org/T161294) (owner: 10Marostegui) [06:55:37] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1101:3318 and db1109 - T161294 (duration: 01m 09s) [06:55:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:48] T161294: run pt-tablechecksum on s5 - https://phabricator.wikimedia.org/T161294 [06:56:38] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1101:3318 and db1109" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398215 [06:56:40] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1101:3318 and db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398214 (https://phabricator.wikimedia.org/T161294) (owner: 10Marostegui) [06:58:22] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1101:3318 and db1109" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398215 (owner: 10Marostegui) [07:00:53] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1101:3318 and db1109" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398215 (owner: 10Marostegui) [07:00:53] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1101:3318 and db1109" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398215 (owner: 10Marostegui) [07:02:04] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1101:3318 and db1109 - T161294 (duration: 01m 08s) [07:02:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:13] T161294: run pt-tablechecksum on s5 - https://phabricator.wikimedia.org/T161294 [07:02:29] (03PS1) 10Marostegui: db-eqiad.php: Depool db1109 and db1101:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398216 (https://phabricator.wikimedia.org/T161294) [07:03:51] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [07:04:50] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [07:05:19] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1109 and db1101:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398216 (https://phabricator.wikimedia.org/T161294) (owner: 10Marostegui) [07:06:50] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1109 and db1101:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398216 (https://phabricator.wikimedia.org/T161294) (owner: 10Marostegui) [07:07:37] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1109 and db1101:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398216 (https://phabricator.wikimedia.org/T161294) (owner: 10Marostegui) [07:08:09] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1101:3318 and db1109 - T161294 (duration: 01m 07s) [07:08:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:21] T161294: run pt-tablechecksum on s5 - https://phabricator.wikimedia.org/T161294 [07:08:49] !log Stop replication in sync on db1109 and db1101:3318 - T161294 [07:08:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:15] !log Stop replication and set read-only on labsdb1003 - T142807 [07:18:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:26] T142807: Migrate all users to new Wiki Replica cluster and decommission old hardware - https://phabricator.wikimedia.org/T142807 [07:25:24] 10Operations, 10DBA, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3836626 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` db1111.eqiad.wmnet ``` The log can be found in `/var/log/wmf-auto-rei... [07:27:34] 10Operations, 10Mail, 10Toolforge, 10Security: Forward security@tools.wmflabs.org to security@wikimedia.org - https://phabricator.wikimedia.org/T182812#3836628 (10faidon) You can use `exim4 -bt foo@example.org` to test how/where exim4 would deliver a specific address (if at all). Also, note that even with... [07:31:28] 10Operations, 10Puppet, 10Patch-For-Review: custom fact interface_primary breaks under newer versions of facter - https://phabricator.wikimedia.org/T182819#3835611 (10faidon) Facter 3 is quite different than Facter 2, and we're not ready to use this -- that would be a transition of its own, I think. For what... [07:33:48] (03CR) 10Faidon Liambotis: [C: 04-1] "What Riccardo said. For instance, you may have:" [puppet] - 10https://gerrit.wikimedia.org/r/398120 (https://phabricator.wikimedia.org/T182819) (owner: 10Herron) [07:38:58] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0 [07:56:16] (03PS1) 10Marostegui: site.pp: Failover labsdb1011 to labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/398222 (https://phabricator.wikimedia.org/T174569) [07:56:37] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0 [07:59:09] (03PS16) 10TerraCodes: Remove single editor tab for plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393121 (https://phabricator.wikimedia.org/T181045) [08:00:11] (03PS30) 10TerraCodes: $wmf* -> $wmg* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392184 (https://phabricator.wikimedia.org/T45956) [08:05:35] PROBLEM - configured eth on db1111 is CRITICAL: Return code of 255 is out of bounds [08:07:35] RECOVERY - configured eth on db1111 is OK: OK - interfaces up [08:11:26] 10Operations, 10DBA, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3836666 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1111.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['db1111.eqiad.wmnet'] ``` [08:13:24] 10Operations: Debian Jessie reimage/install ends up in kernel panic with 8.10 netboot image - https://phabricator.wikimedia.org/T182702#3836669 (10Marostegui) My last reimaged failed with: ``` 08:11:20 | db1111.eqiad.wmnet | Unable to run wmf-auto-reimage-host: could not convert string to float: Warning: Setting... [08:15:55] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [08:16:16] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 234, down: 0, dormant: 0, excluded: 0, unused: 0 [08:21:15] 10Operations, 10Release Pipeline, 10monitoring, 10Continuous-Integration-Infrastructure (shipyard), and 2 others: Icinga disk space alert when a Docker container is running on an host - https://phabricator.wikimedia.org/T178454#3836675 (10hashar) \o/ [08:32:13] (03CR) 10Jcrespo: [C: 031] site.pp: Failover labsdb1011 to labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/398222 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [08:32:42] (03CR) 10Marostegui: [C: 032] site.pp: Failover labsdb1011 to labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/398222 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [08:38:21] (03PS1) 10Hashar: Migrate contint::worker_localhost to a profile [puppet] - 10https://gerrit.wikimedia.org/r/398227 [08:39:06] !log Reload dbproxy1011 config [08:39:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:09] PROBLEM - MegaRAID on db1059 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [08:45:19] ACKNOWLEDGEMENT - MegaRAID on db1059 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T182853 [08:45:22] 10Operations, 10ops-eqiad: Degraded RAID on db1059 - https://phabricator.wikimedia.org/T182853#3836729 (10ops-monitoring-bot) [08:50:11] (03CR) 10Hashar: [C: 04-1] "I first need to make sure the catalog compiles properly on the nodepool image https://gerrit.wikimedia.org/r/#/c/398228/" [puppet] - 10https://gerrit.wikimedia.org/r/398227 (owner: 10Hashar) [08:50:35] 10Operations: Puppet: Setting configtimeout is deprecated - https://phabricator.wikimedia.org/T182585#3827993 (10Marostegui) This is affecting new installs: T182702#3836669 [08:51:25] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1059 - https://phabricator.wikimedia.org/T182853#3836737 (10Marostegui) p:05Triage>03Normal [08:52:40] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1059 - https://phabricator.wikimedia.org/T182853#3836729 (10Marostegui) a:03Cmjohnson @Cmjohnson please get this disk replaced whenever you can Thanks! [09:08:41] (03PS3) 10Elukey: modules::varnishkafka: update to latest sha [puppet] - 10https://gerrit.wikimedia.org/r/398057 [09:12:47] (03CR) 10Alexandros Kosiaris: "FYI, http://tools.wmflabs.org/sal/log/AU8UAgVc6snAnmqnLhxJ" [puppet] - 10https://gerrit.wikimedia.org/r/398078 (https://phabricator.wikimedia.org/T181099) (owner: 10Awight) [09:16:55] 10Operations, 10Release Pipeline, 10monitoring, 10Continuous-Integration-Infrastructure (shipyard), and 2 others: Icinga disk space alert when a Docker container is running on an host - https://phabricator.wikimedia.org/T178454#3836778 (10akosiaris) A more interesting question would be why lawrencium has d... [09:17:43] (03CR) 10Hashar: [C: 031] "The Nodepool manifest works now :)" [puppet] - 10https://gerrit.wikimedia.org/r/398227 (owner: 10Hashar) [09:21:08] (03CR) 10Jcrespo: "It doesn't matter what mysql client does; it should matter what mediawiki/WMF deployment does by default, which is the 3306. Revert if you" [puppet] - 10https://gerrit.wikimedia.org/r/397912 (https://phabricator.wikimedia.org/T182713) (owner: 10Anomie) [09:22:14] (03CR) 10Jcrespo: [C: 032] "> What happened with this patch?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398058 (owner: 10Jcrespo) [09:22:27] (03PS3) 10Jcrespo: Revert "mariadb: Depool db1067 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398058 [09:22:59] (03CR) 10Marostegui: "I guess he means that despite the +2 it never got merged?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398058 (owner: 10Jcrespo) [09:25:11] (03CR) 10Filippo Giunchedi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/394552 (https://phabricator.wikimedia.org/T181794) (owner: 10Filippo Giunchedi) [09:25:56] (03CR) 10jerkins-bot: [V: 04-1] tox: run mtail tests [puppet] - 10https://gerrit.wikimedia.org/r/394552 (https://phabricator.wikimedia.org/T181794) (owner: 10Filippo Giunchedi) [09:26:42] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1067 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398058 (owner: 10Jcrespo) [09:27:04] (03PS3) 10Filippo Giunchedi: tox: run mtail tests [puppet] - 10https://gerrit.wikimedia.org/r/394552 (https://phabricator.wikimedia.org/T181794) [09:27:06] (03PS4) 10Ema: tox: run mtail tests [puppet] - 10https://gerrit.wikimedia.org/r/394552 (https://phabricator.wikimedia.org/T181794) (owner: 10Filippo Giunchedi) [09:27:09] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1067 (duration: 01m 08s) [09:27:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:37] godog: sorry for stepping on your toes with the rebase :) [09:27:54] (03CR) 10jerkins-bot: [V: 04-1] tox: run mtail tests [puppet] - 10https://gerrit.wikimedia.org/r/394552 (https://phabricator.wikimedia.org/T181794) (owner: 10Filippo Giunchedi) [09:27:56] (03CR) 10Jcrespo: "This is nonsense- can we please remove them on all of codfw, and then, if it does not create problems, on all of eqiad. This adds lots of " [puppet] - 10https://gerrit.wikimedia.org/r/398186 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [09:28:13] ema: no worries! not that it changed the result [09:28:48] (03CR) 10Jcrespo: "with "all of codfw" I mean the cluster mysql_codfw. other hosts that are not part of a role can be deleted, too, as not important." [puppet] - 10https://gerrit.wikimedia.org/r/398186 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [09:28:50] godog: that's because varnishmtail tests are going to be fixed merging https://gerrit.wikimedia.org/r/#/c/397876/ [09:29:28] indeed, so working as intended! [09:29:41] yup! [09:30:19] (03CR) 10Filippo Giunchedi: [C: 031] Add postgresql::prometheus class [puppet] - 10https://gerrit.wikimedia.org/r/392438 (https://phabricator.wikimedia.org/T177196) (owner: 10Alexandros Kosiaris) [09:32:12] (03CR) 10Elukey: [C: 032] modules::varnishkafka: update to latest sha [puppet] - 10https://gerrit.wikimedia.org/r/398057 (owner: 10Elukey) [09:34:37] (03PS3) 10Ema: Add PUT to list of allowed methods for text varnish [puppet] - 10https://gerrit.wikimedia.org/r/398197 (https://phabricator.wikimedia.org/T182825) (owner: 10Gergő Tisza) [09:34:49] (03CR) 10Ema: [V: 032 C: 032] Add PUT to list of allowed methods for text varnish [puppet] - 10https://gerrit.wikimedia.org/r/398197 (https://phabricator.wikimedia.org/T182825) (owner: 10Gergő Tisza) [09:39:16] (03CR) 10Filippo Giunchedi: "See inline, we'll also need to prefix metrics with 'rabbitmq_', this can be done from the metric name itself or with the 'namespace' attri" (031 comment) [debs/prometheus-rabbitmq-exporter] - 10https://gerrit.wikimedia.org/r/398003 (https://phabricator.wikimedia.org/T181802) (owner: 10Muehlenhoff) [09:44:43] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi, 10cloud-services-team (Kanban): Port non-deprecated Diamond collectors to Prometheus - https://phabricator.wikimedia.org/T177196#3836798 (10fgiunchedi) [10:01:54] (03PS1) 10Elukey: profile::manifests::hadoop::master: fix dashboard_links parameter [puppet] - 10https://gerrit.wikimedia.org/r/398232 [10:02:24] (03CR) 10Elukey: [C: 032] profile::manifests::hadoop::master: fix dashboard_links parameter [puppet] - 10https://gerrit.wikimedia.org/r/398232 (owner: 10Elukey) [10:06:11] (03PS1) 10Elukey: profile::hadoop::master: fix dashboard_links param - part2 [puppet] - 10https://gerrit.wikimedia.org/r/398233 [10:06:42] (03CR) 10Elukey: [C: 032] profile::hadoop::master: fix dashboard_links param - part2 [puppet] - 10https://gerrit.wikimedia.org/r/398233 (owner: 10Elukey) [10:06:48] 10Operations, 10Discovery-Search (Current work), 10Goal, 10Patch-For-Review, and 2 others: Port elasticsearch metrics to Prometheus - https://phabricator.wikimedia.org/T181627#3836829 (10Gehel) Now that I understand a bit better how prometheus works, the jmx_exporter starts to be scary. If I understand cor... [10:08:30] (03PS3) 10Muehlenhoff: Add Prometheus exporter for RabbitMQ [debs/prometheus-rabbitmq-exporter] - 10https://gerrit.wikimedia.org/r/398003 (https://phabricator.wikimedia.org/T181802) [10:08:59] (03CR) 10Muehlenhoff: Add Prometheus exporter for RabbitMQ (031 comment) [debs/prometheus-rabbitmq-exporter] - 10https://gerrit.wikimedia.org/r/398003 (https://phabricator.wikimedia.org/T181802) (owner: 10Muehlenhoff) [10:11:59] RECOVERY - puppet last run on analytics1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:13:16] 10Operations, 10Discovery-Search (Current work), 10Goal, 10Patch-For-Review, and 2 others: Port elasticsearch metrics to Prometheus - https://phabricator.wikimedia.org/T181627#3836838 (10Gehel) While updating the [[ https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles-prometheus?orgId=1&pan... [10:15:15] (03PS1) 10Volans: wmf-auto-reimage: fix check puppet run [puppet] - 10https://gerrit.wikimedia.org/r/398234 [10:21:28] 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, and 2 others: Redirect several wikis - https://phabricator.wikimedia.org/T169450#3450565 (10Verdy_p) Moldavian: about the renaming/redirect from "mo.wiki(pedia|dictionary).org" to "ro.wiki(pedia|dictionary).org", this sho... [10:23:04] (03PS1) 10Volans: puppet common: make check more resilient [puppet] - 10https://gerrit.wikimedia.org/r/398236 [10:24:33] 10Operations, 10Goal, 10User-fgiunchedi, 10cloud-services-team (Kanban): Create Prometheus exporter for Blazegraph - https://phabricator.wikimedia.org/T182857#3836850 (10MoritzMuehlenhoff) p:05Triage>03High [10:25:23] 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, and 2 others: Redirect several wikis - https://phabricator.wikimedia.org/T169450#3836865 (10Strainu) The transliterator is not active on the Romanian projects. [10:26:06] 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, and 2 others: Redirect several wikis - https://phabricator.wikimedia.org/T169450#3836866 (10Verdy_p) Note also that "uselang=mo" really displays a Romanian-Cyrillic UI, but "uselang=ro-latn" just renders as an English UI... [10:31:48] (03PS2) 10Volans: wmf-auto-reimage: fix check puppet run [puppet] - 10https://gerrit.wikimedia.org/r/398234 [10:31:50] 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, and 2 others: Redirect several wikis - https://phabricator.wikimedia.org/T169450#3836873 (10Strainu) @Verdy_p , where is ro-latn defined and why do you expect it to work? [10:32:36] (03CR) 10Volans: [C: 032] wmf-auto-reimage: fix check puppet run [puppet] - 10https://gerrit.wikimedia.org/r/398234 (owner: 10Volans) [10:32:53] 10Operations, 10Cloud-Services: Rename @thiemowmde's account in LDAP, Wikitech, and Gerrit - https://phabricator.wikimedia.org/T181130#3836875 (10akosiaris) 05Open>03Resolved a:03akosiaris Gerrit change done and wikitech page renamed. Closing this, feel free to reopen if something is wrong. [10:32:56] (03PS2) 10Volans: puppet common: make check more resilient [puppet] - 10https://gerrit.wikimedia.org/r/398236 [10:33:52] (03CR) 10Volans: [C: 032] puppet common: make check more resilient [puppet] - 10https://gerrit.wikimedia.org/r/398236 (owner: 10Volans) [10:34:22] 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, and 2 others: Redirect several wikis - https://phabricator.wikimedia.org/T169450#3836880 (10Verdy_p) So now with the redirect being active, you've dropped completely the capability of contributing and displaying the UI in... [10:35:57] (03PS8) 10Elukey: role::cache::canary: add a test Varnishkafka instance [puppet] - 10https://gerrit.wikimedia.org/r/397765 [10:38:23] 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, and 2 others: Redirect several wikis - https://phabricator.wikimedia.org/T169450#3836884 (10Strainu) You seem to be behind on the news: the language committee has decided that mo.wp should be deleted, not just closed, fol... [10:39:06] 10Operations, 10DBA, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3836885 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` db1111.eqiad.wmnet ``` The log can be found in `/var/log/wmf-auto-rei... [10:39:16] 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, and 2 others: Redirect several wikis - https://phabricator.wikimedia.org/T169450#3836886 (10Verdy_p) "ro-latn" and "ro" are the same. I don't want this addition as a separate locale (these can be simply aliases), but "ro... [10:41:19] (03CR) 10Elukey: "Last change removed $conf_template = 'varnishkafka/varnishkafka_v4.conf.erb' after https://gerrit.wikimedia.org/r/#/c/398057/" [puppet] - 10https://gerrit.wikimedia.org/r/397765 (owner: 10Elukey) [10:42:42] 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, and 2 others: Redirect several wikis - https://phabricator.wikimedia.org/T169450#3836899 (10Verdy_p) The comity can decide what they want. It was NOT requested to *delete* Moldavian but merge it to Romanian. Dropping all... [10:57:27] 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, and 2 others: Redirect several wikis - https://phabricator.wikimedia.org/T169450#3836917 (10Strainu) >>! In T169450#3836899, @Verdy_p wrote: > The comity can decide what they want. It was NOT requested to *delete* Moldavi... [11:10:21] 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, and 2 others: Redirect several wikis - https://phabricator.wikimedia.org/T169450#3836937 (10Verdy_p) The consensus was only to join the two communities so they create a joint content without dividing it, Blocking all acce... [11:12:07] !log akosiaris@tin Started deploy [ores/deploy@b67bba7]: T181661 [11:12:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:18] T181661: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661 [11:13:02] !log akosiaris@tin Finished deploy [ores/deploy@b67bba7]: T181661 (duration: 00m 55s) [11:13:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:59] !log akosiaris@tin Started deploy [ores/deploy@b67bba7]: T181661 [11:14:02] !log akosiaris@tin Finished deploy [ores/deploy@b67bba7]: T181661 (duration: 00m 03s) [11:14:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:52] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scap, and 2 others: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3836942 (10akosiaris) >>! In T181661#3834939, @awight wrote: > Looks like I'm getting the same error. > >> commit b67bba77acb7c0ffc678201c9f3f... [11:16:12] !log akosiaris@tin Started deploy [ores/deploy@b67bba7]: T181661 [11:16:16] !log akosiaris@tin Finished deploy [ores/deploy@b67bba7]: T181661 (duration: 00m 03s) [11:16:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:12] 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, and 2 others: Redirect several wikis - https://phabricator.wikimedia.org/T169450#3836955 (10EddieGP) >>! In T169450#3836846, @Verdy_p wrote: > Moldavian: about the renaming/redirect from "mo.wiki(pedia|dictionary).org" to... [11:23:31] !log awight@tin Started deploy [ores/deploy@b67bba7]: (non-production) Update ORES on new cluster [11:23:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:43] !log awight@tin Finished deploy [ores/deploy@b67bba7]: (non-production) Update ORES on new cluster (duration: 00m 11s) [11:23:52] !log awight@tin Started deploy [ores/deploy@b67bba7]: (non-production) Update ORES on new cluster [11:23:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:57] !log awight@tin Finished deploy [ores/deploy@b67bba7]: (non-production) Update ORES on new cluster (duration: 00m 05s) [11:24:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:08] !log awight@tin Started deploy [ores/deploy@b67bba7]: (non-production) Update ORES on new cluster [11:27:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:28] !log awight@tin Finished deploy [ores/deploy@b67bba7]: (non-production) Update ORES on new cluster (duration: 00m 20s) [11:27:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:36] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scap, and 2 others: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3836975 (10awight) Not sure if this is related, but now I'm seeing a deploy-local failure with no diagnostics other than error code 70: {P6463} [11:31:13] 10Operations, 10Cloud-Services: Rename @thiemowmde's account in LDAP, Wikitech, and Gerrit - https://phabricator.wikimedia.org/T181130#3836988 (10thiemowmde) 05Resolved>03Open I had to delete and recreate the link from my Phabricator account to LDAP, but now everything looks fine. Thanks a lot! Unfortunat... [11:36:12] 10Operations, 10DBA, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3837011 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1111.eqiad.wmnet'] ``` and were **ALL** successful. [11:44:50] 10Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure (shipyard), 10Release-Engineering-Team (Kanban): Allow contint-admins to interact with docker on CI hosts - https://phabricator.wikimedia.org/T182860#3837027 (10hashar) [11:44:55] (03PS1) 10Hashar: contint: allow releng to interact with Docker [puppet] - 10https://gerrit.wikimedia.org/r/398240 (https://phabricator.wikimedia.org/T182860) [11:45:28] (03CR) 10Hashar: "An example use is to deploy new containers with docker-pkg which is being worked on https://gerrit.wikimedia.org/r/#/c/398086/" [puppet] - 10https://gerrit.wikimedia.org/r/398240 (https://phabricator.wikimedia.org/T182860) (owner: 10Hashar) [11:50:53] 10Operations, 10Cloud-Services: Rename @thiemowmde's account in LDAP, Wikitech, and Gerrit - https://phabricator.wikimedia.org/T181130#3837045 (10akosiaris) Ah indeed. Sorry about that. However, I guess we should delete the new user create when you logged in because I can no longer rename `Thiemo Kreuz` to `Th... [11:52:06] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398241 [11:52:23] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398241 [11:53:59] 10Operations, 10DBA, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3837050 (10Marostegui) [11:54:47] 10Operations: Debian Jessie reimage/install ends up in kernel panic with 8.10 netboot image - https://phabricator.wikimedia.org/T182702#3837051 (10Marostegui) db1111 got installed fine. But @Volans and myself noticed that we are no longer generating the puppet cert on the server, so the installation gets stuck u... [11:55:39] (03PS19) 10Jcrespo: mariadb: Remove mariadb.pp and move some old roles to profiles [puppet] - 10https://gerrit.wikimedia.org/r/394541 (https://phabricator.wikimedia.org/T150850) [11:57:22] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398241 (owner: 10Marostegui) [11:58:52] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398241 (owner: 10Marostegui) [11:59:02] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398241 (owner: 10Marostegui) [11:59:42] (03PS1) 10Volans: base: fix dependency relationship [puppet] - 10https://gerrit.wikimedia.org/r/398244 (https://phabricator.wikimedia.org/T182702) [12:00:06] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scap, and 2 others: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3837059 (10mmodell) `scap deploy-log -v` reveals more: ``` 11:27:10 [ores1001.eqiad.wmnet] Unhandled error: Traceback (most recent call last):... [12:00:22] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1091 - T174569 (duration: 01m 08s) [12:00:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:33] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [12:05:59] 10Operations, 10Cloud-Services: Rename @thiemowmde's account in LDAP, Wikitech, and Gerrit - https://phabricator.wikimedia.org/T181130#3837080 (10thiemowmde) Huh? https://wikitech.wikimedia.org/wiki/Special:Log/Thiemo_Kreuz_(WMDE). Yea, sure. This is just an unintentional artifact. [12:06:09] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scap, and 2 others: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3837081 (10awight) Those revisions aren't in gerrit. I think the github -> gerrit mirroring broke when we were messing around with pointing to... [12:06:56] (03CR) 10Muehlenhoff: [C: 031] base: fix dependency relationship [puppet] - 10https://gerrit.wikimedia.org/r/398244 (https://phabricator.wikimedia.org/T182702) (owner: 10Volans) [12:08:55] (03PS1) 10Marostegui: Revert "site.pp: Failover labsdb1011 to labsdb1010" [puppet] - 10https://gerrit.wikimedia.org/r/398246 [12:09:05] (03PS2) 10Marostegui: Revert "site.pp: Failover labsdb1011 to labsdb1010" [puppet] - 10https://gerrit.wikimedia.org/r/398246 [12:10:19] marostegui: wait [12:10:23] did you deploy already? [12:10:24] waiting :) [12:10:25] no [12:10:28] only rebased [12:10:36] can you wait so I can upgrade it? [12:10:41] of course :) [12:10:50] if it already finished [12:10:51] good point [12:10:54] yeah, it is done [12:11:03] I can do that later [12:11:21] ok! thanks :) [12:11:47] !log disable puppet on all databases to deploy safely https://gerrit.wikimedia.org/r/398246 [12:11:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:02] marostegui: plase send me a comment on gerrit so I remember [12:12:10] haha doing it now [12:12:20] otherwise I will gorget [12:12:22] *forget [12:12:27] (03CR) 10Volans: "compiler result: https://puppet-compiler.wmflabs.org/compiler02/9349/db1098.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/398244 (https://phabricator.wikimedia.org/T182702) (owner: 10Volans) [12:12:33] (03CR) 10Marostegui: "@jcrespo: please merge this once you have upgraded labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/398246 (owner: 10Marostegui) [12:12:38] I need all labsdbs upgraded to upgrade also their masters [12:15:42] (03CR) 10Jcrespo: [C: 032] mariadb: Remove mariadb.pp and move some old roles to profiles [puppet] - 10https://gerrit.wikimedia.org/r/394541 (https://phabricator.wikimedia.org/T150850) (owner: 10Jcrespo) [12:16:21] my biggest fear is that the patch will enable the firewall in some host we forgot to enable it [12:16:29] blocking all connections [12:16:39] in theory, it should have been enabled everywhere [12:17:14] 10Operations, 10Cloud-Services: Rename @thiemowmde's account in LDAP, Wikitech, and Gerrit - https://phabricator.wikimedia.org/T181130#3837096 (10akosiaris) And here we are https://wikitech.wikimedia.org/w/index.php?title=Special%3AContributions&contribs=user&target=Thiemo+Kreuz+%28WMDE%29&namespace=&tagfilt... [12:18:23] !log re-initialize cassandra on maps-test2001 - T182583 [12:18:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:33] T182583: maps-test2001 is low on disk space - https://phabricator.wikimedia.org/T182583 [12:18:38] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scap, and 2 others: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3837099 (10awight) The Phabricator control panels look happy, https://phabricator.wikimedia.org/source/editquality/manage/uris/ shows that we'r... [12:19:18] wikitech seems ok [12:19:35] 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, and 2 others: Redirect several wikis - https://phabricator.wikimedia.org/T169450#3837101 (10Verdy_p) About redirects keeping the title name: as the moldavian past page is most probably written in Cyrillic, forwarding to t... [12:24:53] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scap, and 2 others: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3797363 (10awight) Oops—we aren't expecting this repo to be mirrored to gerrit. So the surprise is that the revision exists in Phabricator but... [12:31:09] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scap, and 2 others: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3837143 (10mmodell) @awight: yeah, I'm getting to the bottom of it now. The issue is that the commit does not exist on a local branch on tin, i... [12:33:25] (03CR) 10Jon Harald Søby: "As part of deploying this, the deployer needs to run the following script:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396381 (https://phabricator.wikimedia.org/T181503) (owner: 10Jon Harald Søby) [12:39:55] !log awight@tin Started deploy [ores/deploy@b67bba7]: (non-production) Update ORES on new cluster [12:40:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:39] !log awight@tin Finished deploy [ores/deploy@b67bba7]: (non-production) Update ORES on new cluster (duration: 00m 45s) [12:40:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:04] !log awight@tin Started deploy [ores/deploy@b67bba7]: (non-production) Update ORES on new cluster [12:41:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:02] !log awight@tin Finished deploy [ores/deploy@b67bba7]: (non-production) Update ORES on new cluster (duration: 00m 59s) [12:42:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:41] (03CR) 10Elukey: [C: 031] "I got the same weird issue and I was wondering the root cause, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/398244 (https://phabricator.wikimedia.org/T182702) (owner: 10Volans) [12:45:33] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scap, and 2 others: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3837171 (10awight) 05Open>03Resolved a:03awight Using a workaround for T182865, where we go into submodules and checkout master before su... [12:48:35] 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, and 2 others: Redirect several wikis - https://phabricator.wikimedia.org/T169450#3837176 (10Aklapper) Please contact the [[ https://meta.wikimedia.org/wiki/Language_committee | Language Committee ]] when it comes to poten... [12:50:31] 10Operations, 10Puppet, 10Epic, 10Need-volunteer, 10Patch-For-Review: align puppet-lint config with coding style - https://phabricator.wikimedia.org/T93645#3837180 (10jcrespo) [12:51:43] jouncebot: next [12:51:45] In 1 hour(s) and 8 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171214T1400) [12:52:48] (03PS1) 10Marostegui: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398249 (https://phabricator.wikimedia.org/T174569) [12:56:03] !log stress testing on ores1*.eqiad.wmnet cluster, T182249 [12:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:14] T182249: Diagnose and fix 4.5k req/min ceiling for ores* requests - https://phabricator.wikimedia.org/T182249 [12:58:24] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398249 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [12:59:49] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398249 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [13:00:00] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398249 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [13:01:22] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1089 - T174569 (duration: 01m 07s) [13:01:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:34] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [13:02:34] 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, and 2 others: Redirect several wikis - https://phabricator.wikimedia.org/T169450#3837238 (10EddieGP) >>! In T169450#3837101, @Verdy_p wrote: > So I suggest redirecting user pages and Wiktionary without change of tiltle, a... [13:10:41] !log Deploy schema change on db1089 (s1) - T174569 [13:10:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:50] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [13:12:29] 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, and 2 others: Redirect several wikis - https://phabricator.wikimedia.org/T169450#3837258 (10Verdy_p) It's not complicate to add "mo-x-old" to direct to the readonly archive of the "mo" wiki, now that "mo" redirects to "ro... [13:18:44] (03PS12) 10Rush: cloud: setup for attended upgrade process [puppet] - 10https://gerrit.wikimedia.org/r/394200 (https://phabricator.wikimedia.org/T181647) [13:19:15] (03CR) 10jerkins-bot: [V: 04-1] cloud: setup for attended upgrade process [puppet] - 10https://gerrit.wikimedia.org/r/394200 (https://phabricator.wikimedia.org/T181647) (owner: 10Rush) [13:19:26] (03PS5) 10Arturo Borrero Gonzalez: apt: unattended-upgrades: add reporter script [puppet] - 10https://gerrit.wikimedia.org/r/394572 (https://phabricator.wikimedia.org/T181647) [13:19:56] (03CR) 10jerkins-bot: [V: 04-1] apt: unattended-upgrades: add reporter script [puppet] - 10https://gerrit.wikimedia.org/r/394572 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [13:22:32] (03PS13) 10Rush: cloud: setup for attended upgrade process [puppet] - 10https://gerrit.wikimedia.org/r/394200 (https://phabricator.wikimedia.org/T181647) [13:26:38] !log mobrovac@tin Started deploy [restbase/deploy@187d8ba]: Remove Trending Edits end point and stop storing feed results in Cassandra - T180384 T179412 [13:26:45] (03CR) 10Rush: [C: 032] cloud: setup for attended upgrade process [puppet] - 10https://gerrit.wikimedia.org/r/394200 (https://phabricator.wikimedia.org/T181647) (owner: 10Rush) [13:26:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:49] T179412: Stop storing feeds in Cassandra - https://phabricator.wikimedia.org/T179412 [13:26:49] T180384: Turn off Trending Service - https://phabricator.wikimedia.org/T180384 [13:27:16] jynus: I picked up Jcrespo: mariadb: Remove mariadb.pp and move some old roles to profiles (c698af4) via puppet-merge is that gtg? [13:28:23] marostegui: do you know^ [13:28:37] chasemp: I thought he already merged it [13:28:53] ok I'm going ot pull it down w/ my change then [13:29:27] He disabled puppet on eqiad just in case it breaks something [13:29:29] (03CR) 10Filippo Giunchedi: "LGTM" [debs/prometheus-rabbitmq-exporter] - 10https://gerrit.wikimedia.org/r/398003 (https://phabricator.wikimedia.org/T181802) (owner: 10Muehlenhoff) [13:29:31] it only breaks codfw :) [13:29:31] (03PS6) 10Arturo Borrero Gonzalez: apt: unattended-upgrades: add reporter script [puppet] - 10https://gerrit.wikimedia.org/r/394572 (https://phabricator.wikimedia.org/T181647) [13:29:43] 10Operations, 10Cloud-Services: Rename @thiemowmde's account in LDAP, Wikitech, and Gerrit - https://phabricator.wikimedia.org/T181130#3837373 (10thiemowmde) 05Open>03Resolved Thanks! [13:29:55] (03CR) 10jerkins-bot: [V: 04-1] apt: unattended-upgrades: add reporter script [puppet] - 10https://gerrit.wikimedia.org/r/394572 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [13:29:55] :) no worries, fyi tho it didn't land until ...now :) [13:30:11] then we will now see what it breaks :) [13:30:14] jynus: ^ [13:30:22] (he is out for lunch, but I will keep an eye) [13:30:59] (03CR) 10Lucas Werkmeister (WMDE): "2017-12-18 is unfortunately within the holiday “no deploys” period, so this will have to wait a bit longer. I’ll probably request it for S" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396311 (https://phabricator.wikimedia.org/T180614) (owner: 10Lucas Werkmeister (WMDE)) [13:32:05] (03CR) 10Filippo Giunchedi: [C: 031] Add Prometheus exporter for RabbitMQ [debs/prometheus-rabbitmq-exporter] - 10https://gerrit.wikimedia.org/r/398003 (https://phabricator.wikimedia.org/T181802) (owner: 10Muehlenhoff) [13:32:15] !log mobrovac@tin Finished deploy [restbase/deploy@187d8ba]: Remove Trending Edits end point and stop storing feed results in Cassandra - T180384 T179412 (duration: 05m 37s) [13:32:21] (03PS7) 10Arturo Borrero Gonzalez: apt: unattended-upgrades: add reporter script [puppet] - 10https://gerrit.wikimedia.org/r/394572 (https://phabricator.wikimedia.org/T181647) [13:32:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:27] T179412: Stop storing feeds in Cassandra - https://phabricator.wikimedia.org/T179412 [13:32:27] T180384: Turn off Trending Service - https://phabricator.wikimedia.org/T180384 [13:33:05] (03PS1) 10Elukey: Replace kafka1018 with kafka1023 in the analytics kafka cluster [puppet] - 10https://gerrit.wikimedia.org/r/398255 (https://phabricator.wikimedia.org/T181518) [13:33:21] (03PS1) 10Rush: Revert "cloud: setup for attended upgrade process" [puppet] - 10https://gerrit.wikimedia.org/r/398256 [13:33:33] (03CR) 10Rush: [V: 032 C: 032] Revert "cloud: setup for attended upgrade process" [puppet] - 10https://gerrit.wikimedia.org/r/398256 (owner: 10Rush) [13:37:06] PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:38:31] 10Operations, 10ORES, 10Patch-For-Review, 10Performance, 10Scoring-platform-team (Current): Diagnose and fix 4.5k req/min ceiling for ores* requests - https://phabricator.wikimedia.org/T182249#3837415 (10awight) Ran another test: https://grafana-admin.wikimedia.org/dashboard/db/ores?orgId=1&from=15132561... [13:38:48] marostegui: ^ [13:38:49] Error: Could not retrieve catalog from remote server: Error 500 on SERVER: {"message":"Server Error: Evaluation Error: Error while evaluating a Function Call, Could not find class ::role::mariadb::monitor::dba for labservices1001.wikimedia.org at /etc/puppet/modules/profile/manifests/openstack/base/pdns/auth/db.pp:13:5 on node labservices1001.wikimedia.org","issue_kind":"RUNTIME_ERROR","stacktrace":["Warning: The [13:38:49] 'stacktrace' property is deprecated and will be removed in a future version of Puppet. For security reasons, stacktraces are not returned with Puppet HTTP Error responses."]} [13:38:50] Warning: Not using cache on failed catalog [13:38:51] (03PS2) 10Elukey: Replace kafka1018 with kafka1023 in the analytics kafka cluster [puppet] - 10https://gerrit.wikimedia.org/r/398255 (https://phabricator.wikimedia.org/T181518) [13:38:52] Error: Could not retrieve catalog; skipping run [13:38:56] oops, sorry that was spammy [13:39:52] I ran puppet on db1089 and went fine [13:40:14] you think it is jaime's patch? [13:41:11] !log update facts for puppet compiler to pick up new hosts [13:41:21] marostegui: pretty sure as that role is called within that profile for the databses used for powerdns [13:41:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:29] !log upgrade grafana to 4.6.3 - T182294 [13:42:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:39] T182294: Upgrade grafana to 4.6.2 - https://phabricator.wikimedia.org/T182294 [13:42:55] 10Operations, 10monitoring: Upgrade grafana to 4.6.3 - https://phabricator.wikimedia.org/T182294#3837425 (10fgiunchedi) [13:43:19] 10Operations, 10monitoring: Upgrade grafana to 4.6.3 - https://phabricator.wikimedia.org/T182294#3819283 (10fgiunchedi) 05Open>03Resolved Updated grafana to 4.6.3 (released today) [13:43:58] marostegui: shoudl I revert or do you know what the right thing to go in profile/manifests/openstack/base/pdns/auth/db.pp is for the powerdns databases? [13:44:22] I am getting some stacktraces to comment on the ticket [13:44:37] kk [13:44:41] but yeah, let's revert jynus patch [13:45:15] (03PS1) 10Rush: Revert "mariadb: Remove mariadb.pp and move some old roles to profiles" [puppet] - 10https://gerrit.wikimedia.org/r/398257 [13:45:20] (03PS2) 10Rush: Revert "mariadb: Remove mariadb.pp and move some old roles to profiles" [puppet] - 10https://gerrit.wikimedia.org/r/398257 [13:45:22] kk [13:45:51] (03CR) 10Rush: [V: 032 C: 032] Revert "mariadb: Remove mariadb.pp and move some old roles to profiles" [puppet] - 10https://gerrit.wikimedia.org/r/398257 (owner: 10Rush) [13:46:35] marostegui: I it's a small thing to correct but I couldn't parse out what the right thing to do there is based on teh changes quickly [13:46:35] (03PS1) 10Jcrespo: Revert "Revert "mariadb: Remove mariadb.pp and move some old roles to profiles"" [puppet] - 10https://gerrit.wikimedia.org/r/398258 [13:47:05] 10Operations, 10Puppet, 10Epic, 10Need-volunteer, 10Patch-For-Review: align puppet-lint config with coding style - https://phabricator.wikimedia.org/T93645#3837445 (10Marostegui) [13:47:21] !log I'm reverting https://gerrit.wikimedia.org/r/#/c/394541/ as it broke the database puppet for labservices (used for powerdns backend) [13:47:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:17] marostegui: jynus reverting fixed puppet on labservices1001 fyi [13:48:40] chasemp: Thanks - I commented on the ticket and added you as a subscriber. Sorry for the hassle [13:49:06] no biggie, that's easy to miss but would benefit greatly from teh cleanup I bet [13:52:04] (03PS2) 10Jcrespo: Revert "Revert "mariadb: Remove mariadb.pp and move some old roles to profiles"" [puppet] - 10https://gerrit.wikimedia.org/r/398258 [13:52:06] RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:52:12] (03PS1) 10Arturo Borrero Gonzalez: Revert "Revert "cloud: setup for attended upgrade process"" [puppet] - 10https://gerrit.wikimedia.org/r/398259 [13:52:28] It was a single ::role into ::profile [13:53:33] BTW, a profile cannot include other profiles [13:53:36] (03PS2) 10Rush: Revert "Revert "cloud: setup for attended upgrade process"" [puppet] - 10https://gerrit.wikimedia.org/r/398259 (owner: 10Arturo Borrero Gonzalez) [13:53:48] so that is for you to fix later [13:54:09] I think teh directive is it should be rare but not never [13:54:09] chasemp: https://gerrit.wikimedia.org/r/#/c/398258/2/modules/profile/manifests/openstack/base/pdns/auth/db.pp [13:54:17] but functionally it's not a problem for puppet [13:54:23] I think it will make the linter complain [13:54:41] I couldn't care less anyway, please +1 the patch above [13:54:42] jynus: I don't think so, but we'll see soon [13:54:50] (03CR) 10Rush: [C: 031] Revert "Revert "mariadb: Remove mariadb.pp and move some old roles to profiles"" [puppet] - 10https://gerrit.wikimedia.org/r/398258 (owner: 10Jcrespo) [13:55:03] (03CR) 10Jcrespo: [C: 032] Revert "Revert "mariadb: Remove mariadb.pp and move some old roles to profiles"" [puppet] - 10https://gerrit.wikimedia.org/r/398258 (owner: 10Jcrespo) [13:55:41] sorry about that [13:56:30] it shouldn't have created runtime issues, did it? [13:56:42] nah, just puppet complaining [13:56:44] if it did, it would be a bug [13:56:53] ok, that is good to know, we on purpose do that [13:56:57] thanks jynus [13:57:15] can you check that , or if you tell me the server [13:57:19] I can check [13:57:33] already there jynus, looking now [13:57:59] jynus: all is well, noop [13:58:12] it is hard [13:58:18] so many non-obvious dependencies [13:58:22] I use git grep [13:58:30] but it is hard to get everthing right [13:58:33] (03CR) 10Elukey: "Pcc https://puppet-compiler.wmflabs.org/compiler02/9351/ for:" [puppet] - 10https://gerrit.wikimedia.org/r/398255 (https://phabricator.wikimedia.org/T181518) (owner: 10Elukey) [13:58:49] on the other side, doing back-compatible changes makes sometimes things worse [13:59:42] how can it create so many errors if it only afects 1 or 2 servers? [14:00:04] addshore, hashar, anomie, no_justification, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for European Mid-day SWAT(Max 8 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171214T1400). [14:00:05] stephanebisson: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:11] Hi! [14:00:28] o/ [14:00:36] o/ [14:00:36] 10Operations, 10Discovery-Search (Current work), 10Goal, 10Patch-For-Review, and 2 others: Port elasticsearch metrics to Prometheus - https://phabricator.wikimedia.org/T181627#3837468 (10fgiunchedi) >>! In T181627#3836838, @Gehel wrote: > While updating the [[ https://grafana.wikimedia.org/dashboard/db/ela... [14:00:42] (03PS3) 10Rush: Revert "Revert "cloud: setup for attended upgrade process"" [puppet] - 10https://gerrit.wikimedia.org/r/398259 (owner: 10Arturo Borrero Gonzalez) [14:00:51] (03CR) 10Rush: [C: 031] Revert "Revert "cloud: setup for attended upgrade process"" [puppet] - 10https://gerrit.wikimedia.org/r/398259 (owner: 10Arturo Borrero Gonzalez) [14:00:59] I can SWAT today, unless stephanebisson wants to deploy his own changes? [14:01:19] zeljkof: I can't [14:01:35] bah we should have +2ed them before the swat :D [14:01:52] hashar: want to do the swat? (I can swat, just asking) :) [14:02:25] (03CR) 10Rush: [C: 032] Revert "Revert "cloud: setup for attended upgrade process"" [puppet] - 10https://gerrit.wikimedia.org/r/398259 (owner: 10Arturo Borrero Gonzalez) [14:02:49] one of the change failed bah [14:03:22] 10Operations, 10Discovery-Search (Current work), 10Goal, 10Patch-For-Review, and 2 others: Port elasticsearch metrics to Prometheus - https://phabricator.wikimedia.org/T181627#3837473 (10Gehel) The sum of all shard states does make some kind of sense: the some of all states should give the total number of... [14:03:23] hashar: you are doing the SWAT? [14:03:59] zeljkof: yeah I can do it [14:04:07] one of the jenkins job fail due to some unrelated oddity [14:04:08] hashar: great! :D [14:04:09] was fixing it [14:05:02] 10Operations, 10Trending-Service, 10Patch-For-Review, 10Reading-Infrastructure-Team-Backlog (Kanban), and 2 others: Turn off Trending Service - https://phabricator.wikimedia.org/T180384#3837485 (10mobrovac) [14:05:30] stephanebisson: and I guess the issue has been verified to be fixed on beta / locally isn't it ? :) [14:06:16] hashar: Which change failed? [14:06:22] 10Operations, 10Trending-Service, 10Patch-For-Review, 10Reading-Infrastructure-Team-Backlog (Kanban), and 2 others: Turn off Trending Service - https://phabricator.wikimedia.org/T180384#3755911 (10mobrovac) The public end point has been removed from RESTBase, so the service is no longer reachable. The actu... [14:06:25] Pl217: one of them not sure [14:06:40] Pl217: the jenkins job failed due to some .git/config.lock which was stil laround [14:07:14] (03CR) 10Muehlenhoff: [V: 032 C: 032] Add Prometheus exporter for RabbitMQ [debs/prometheus-rabbitmq-exporter] - 10https://gerrit.wikimedia.org/r/398003 (https://phabricator.wikimedia.org/T181802) (owner: 10Muehlenhoff) [14:07:26] hashar: one of them (https://gerrit.wikimedia.org/r/#/c/398242/) has been tested in beta and the other one (https://gerrit.wikimedia.org/r/#/c/398239/) prevents a race condition which is quite to trigger but has terrible consequences when it happens, hard to test on beta [14:07:47] * hard to trigger [14:08:12] stephanebisson: sounds good to me :] [14:08:43] hashar: .git/config.lock sounds completely unrelated to the changes [14:08:57] Pl217: it is yes [14:09:04] Pl217: I mean "it is unrelated to the change" [14:09:13] one of the previous has left some lock file behind [14:09:57] hashar: That's how usually locks get stuck behind :) [14:12:35] stephanebisson: ok pulling the patches on mwdebug1001 [14:13:19] which is taking a while bah [14:15:27] stephanebisson: should be ready to test [14:15:52] testing... [14:19:30] hashar: lgtm [14:20:53] stephanebisson: deploying both changes :] [14:21:56] !log hashar@tin Synchronized php-1.31.0-wmf.12/resources/src/mediawiki.rcfilters: Swat for RCFilters https://gerrit.wikimedia.org/r/#/c/398242/ https://gerrit.wikimedia.org/r/#/c/398239/ (duration: 01m 09s) [14:22:04] done [14:22:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:15] hashar: all good, thanks! [14:34:51] (03PS1) 10Elukey: eventlogging_cleaner.py: add backticks to all the mysql fields to purge [puppet] - 10https://gerrit.wikimedia.org/r/398264 [14:35:34] jouncebot: next [14:35:34] In 1 hour(s) and 24 minute(s): Wikimania scholarships app deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171214T1600) [14:39:50] (03CR) 10Mforns: [C: 031] "LGTM!!" [puppet] - 10https://gerrit.wikimedia.org/r/398264 (owner: 10Elukey) [14:40:22] (03PS3) 10Hashar: Add .gitreview [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/397747 [14:40:24] (03PS2) 10Hashar: tests: migrate from nose to pytest [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/398136 [14:56:49] (03PS1) 10Hashar: Tag 'latest' during build instead of at publishing [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/398265 [14:57:08] (03PS2) 10Elukey: eventlogging_cleaner.py: add backticks to all the mysql fields to purge [puppet] - 10https://gerrit.wikimedia.org/r/398264 [14:57:25] (03PS1) 10Rush: openstack: nova::common dependency handled higher up [puppet] - 10https://gerrit.wikimedia.org/r/398266 (https://phabricator.wikimedia.org/T171494) [14:57:40] (03CR) 10Elukey: [C: 032] eventlogging_cleaner.py: add backticks to all the mysql fields to purge [puppet] - 10https://gerrit.wikimedia.org/r/398264 (owner: 10Elukey) [14:59:13] (03PS2) 10Rush: openstack: nova::common dependency handled higher up [puppet] - 10https://gerrit.wikimedia.org/r/398266 (https://phabricator.wikimedia.org/T171494) [15:01:05] (03CR) 10Rush: [C: 032] openstack: nova::common dependency handled higher up [puppet] - 10https://gerrit.wikimedia.org/r/398266 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [15:01:48] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1089" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398269 [15:03:38] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1089" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398269 (owner: 10Marostegui) [15:05:10] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1089" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398269 (owner: 10Marostegui) [15:06:56] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1089" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398269 (owner: 10Marostegui) [15:07:12] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1089 - T174569 (duration: 01m 08s) [15:07:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:24] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [15:08:03] 10Operations, 10Commons, 10Multimedia, 10media-storage: Generate a list of files that are supposed to exist but 404s - https://phabricator.wikimedia.org/T182822#3837657 (10fgiunchedi) The images in the category page linked above are from a single user and a single bot, do you have more examples? As to fin... [15:08:20] (03PS5) 10Alexandros Kosiaris: postgresql::user: Allow password to be undefined [puppet] - 10https://gerrit.wikimedia.org/r/392437 [15:08:23] (03CR) 10Alexandros Kosiaris: [C: 032] postgresql::user: Allow password to be undefined [puppet] - 10https://gerrit.wikimedia.org/r/392437 (owner: 10Alexandros Kosiaris) [15:09:28] (03PS6) 10Alexandros Kosiaris: Add postgresql::prometheus class [puppet] - 10https://gerrit.wikimedia.org/r/392438 (https://phabricator.wikimedia.org/T177196) [15:13:45] 10Operations, 10Puppet, 10Epic, 10Need-volunteer, 10Patch-For-Review: align puppet-lint config with coding style - https://phabricator.wikimedia.org/T93645#3837682 (10jcrespo) [15:16:29] jouncebot: next [15:16:29] In 0 hour(s) and 43 minute(s): Wikimania scholarships app deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171214T1600) [15:21:26] (03CR) 10Alexandros Kosiaris: [C: 032] Add postgresql::prometheus class [puppet] - 10https://gerrit.wikimedia.org/r/392438 (https://phabricator.wikimedia.org/T177196) (owner: 10Alexandros Kosiaris) [15:25:05] !log stop, upgrade and restart labsdb1011 [15:25:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:37] (03CR) 10Alexandros Kosiaris: "The other interesting thing we could do is actually populate the docker group with the users we want. Let me have a quick check if this is" [puppet] - 10https://gerrit.wikimedia.org/r/398240 (https://phabricator.wikimedia.org/T182860) (owner: 10Hashar) [15:27:47] dbproxy is me [15:27:50] see last log [15:28:36] PROBLEM - haproxy failover on dbproxy1011 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [15:29:43] (03PS1) 10Muehlenhoff: Add .gitreview file [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/398271 [15:30:34] (03CR) 10Muehlenhoff: [V: 032 C: 032] Add .gitreview file [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/398271 (owner: 10Muehlenhoff) [15:33:53] (03PS1) 10Muehlenhoff: Add Prometheus exporter for Blazegraph [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/398272 (https://phabricator.wikimedia.org/T182857) [15:38:33] (03CR) 10Hashar: "If one manage to add us to a docker group, that would save us from having to use sudo. It is probably easier than having to remember 'sudo" [puppet] - 10https://gerrit.wikimedia.org/r/398240 (https://phabricator.wikimedia.org/T182860) (owner: 10Hashar) [15:38:59] (03PS1) 10Alexandros Kosiaris: WIP: Populate the docker group in admin module [puppet] - 10https://gerrit.wikimedia.org/r/398276 (https://phabricator.wikimedia.org/T182860) [15:40:14] (03PS1) 10Muehlenhoff: Add Debianisation for prometheus-blazegraph-exporter [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/398277 [15:42:35] (03PS2) 10Volans: base: fix dependency relationship [puppet] - 10https://gerrit.wikimedia.org/r/398244 (https://phabricator.wikimedia.org/T182702) [15:42:37] (03PS1) 10Volans: wmf-auto-reimage: generate Puppet cert if needed [puppet] - 10https://gerrit.wikimedia.org/r/398279 (https://phabricator.wikimedia.org/T182702) [15:45:14] (03PS1) 10Muehlenhoff: Add Blazegraph exporter to WDQS hosts [puppet] - 10https://gerrit.wikimedia.org/r/398280 [15:48:51] !log Deploy schema change on dbstore1002 (s1) - T174569 [15:49:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:01] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [16:00:04] Niharika and bd808: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Wikimania scholarships app deploy . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171214T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:00:25] o/ [16:02:29] 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, and 2 others: Redirect several wikis - https://phabricator.wikimedia.org/T169450#3837829 (10StevenJ81) As LangCom clerk, I think before this goes much further I should step in and make a couple of things clear. # mo.wi... [16:04:38] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 854.36 seconds [16:07:40] !log Scholarships: updated database schema with 20171212-add-scholarship-orgs-field.sql (T181072) [16:07:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:51] T181072: Updates to scholarship application form for Wikimania 2018 - https://phabricator.wikimedia.org/T181072 [16:08:05] (03CR) 10Volans: [C: 032] base: fix dependency relationship [puppet] - 10https://gerrit.wikimedia.org/r/398244 (https://phabricator.wikimedia.org/T182702) (owner: 10Volans) [16:09:29] 10Operations, 10Puppet: Trusty puppet 4 approach - https://phabricator.wikimedia.org/T182894#3837838 (10herron) p:05Triage>03Normal [16:10:20] (03PS1) 10Volans: Revert "base: fix dependency relationship" [puppet] - 10https://gerrit.wikimedia.org/r/398281 [16:10:45] (03CR) 10Volans: [V: 032 C: 032] "Dependency cycle" [puppet] - 10https://gerrit.wikimedia.org/r/398281 (owner: 10Volans) [16:12:38] PROBLEM - puppet last run on labvirt1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:12:58] PROBLEM - puppet last run on db2028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:13:13] this is me, alrady fixed [16:13:28] PROBLEM - puppet last run on labnet1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:13:29] PROBLEM - puppet last run on labvirt1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:13:58] PROBLEM - puppet last run on labvirt1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:14:00] volans: is that a weird dep loop for tshark? [16:14:04] 10Operations, 10Puppet: Trusty puppet 4 approach - https://phabricator.wikimedia.org/T182894#3837838 (10faidon) Option (1) is not really an option -- as I said in the other task, those packages are horrible, a bigger maintenance burden, and they are also very different from everything else we use. Between (2)... [16:14:08] PROBLEM - puppet last run on db1030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:14:17] yeah, already merged the fix, running cumin to make it go away [16:14:18] PROBLEM - puppet last run on labtestservices2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:14:18] PROBLEM - puppet last run on labvirt1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:14:18] PROBLEM - puppet last run on labcontrol1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:14:21] (03PS1) 10Elukey: profile::hadoop::prometheus_jmx_exporter: blacklist unwanted Mbeans [puppet] - 10https://gerrit.wikimedia.org/r/398282 (https://phabricator.wikimedia.org/T177458) [16:15:54] puppet compiler was happy :( [16:15:59] RECOVERY - puppet last run on lvs1007 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:16:58] PROBLEM - puppet last run on snapshot1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:17:32] bd808: I'm here now. [16:17:38] RECOVERY - puppet last run on labvirt1005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:17:58] RECOVERY - puppet last run on db2028 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:18:28] RECOVERY - puppet last run on labnet1001 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [16:18:28] RECOVERY - puppet last run on labvirt1009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:18:51] Niharika: the db should be updated, so you can run scap in tin:/srv/deployment/scholarships/scholarships when you are ready [16:18:58] RECOVERY - puppet last run on labvirt1013 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:19:08] RECOVERY - puppet last run on db1030 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:19:18] RECOVERY - puppet last run on labtestservices2002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:19:18] RECOVERY - puppet last run on labvirt1011 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:19:18] RECOVERY - puppet last run on labcontrol1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:19:38] Niharika: we should update https://wikitech.wikimedia.org/wiki/Scholarships.wikimedia.org#How_is_it_deployed.3F for the scap migration too [16:20:31] bd808: It has some local modifications to scap/checks.yaml. Should I stash & apply them after pulling the latest? [16:21:51] Niharika: hmm... looks like the check string was quoted. Maybe we should make a patch for that and merge it? [16:21:51] That should be submitted through gerrit though. [16:21:54] Yeah. [16:21:58] RECOVERY - puppet last run on snapshot1005 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:22:02] Doing that. [16:22:16] cool [16:24:31] Oh looks like it's already changed in master. [16:25:12] !log niharika29@tin Started deploy [scholarships/scholarships@872381d]: Deploy wikimania scholarships app for 2018 T181072 [16:25:14] !log niharika29@tin Finished deploy [scholarships/scholarships@872381d]: Deploy wikimania scholarships app for 2018 T181072 (duration: 00m 02s) [16:25:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:22] T181072: Updates to scholarship application form for Wikimania 2018 - https://phabricator.wikimedia.org/T181072 [16:25:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:03] bd808: Done that. How do I open it up for applications? [16:26:27] Oh, the env. [16:26:46] Is that date in Puppet? [16:28:33] bd808: It's in the .env in my local. But there is no explicit .env in prod. [16:29:23] the .evn in prod is in /etc/wikimania-scholarships.ini, but there is no key there for the date [16:30:35] 10Operations, 10ops-eqiad, 10DBA: Rack and setup db1113 and db1114 - https://phabricator.wikimedia.org/T182896#3837888 (10Marostegui) p:05Triage>03Normal [16:31:16] Niharika: didn't we move that into the database at some point? [16:31:49] Niharika: yes! it is in https://scholarships.wikimedia.org/admin/settings [16:31:56] 10Operations, 10Mail, 10fundraising-tech-ops: Forward katherine@wikipedia.org and jimmy@wikipedia.org emails to katherine@wikimedia.org and jimmy@wikimedia.org, respectively - https://phabricator.wikimedia.org/T182456#3837912 (10faidon) 05Open>03Resolved a:03faidon Done! [16:32:00] what are the dates? I can set them [16:32:49] bd808: We can open it right away. End date is Jan 22, 2018, 23:59 UTC. [16:33:14] (03PS3) 10Elukey: Replace kafka1018 with kafka1023 in the analytics kafka cluster [puppet] - 10https://gerrit.wikimedia.org/r/398255 (https://phabricator.wikimedia.org/T181518) [16:34:36] !log Scholarships: Set application start and close dates via web UI (T181072) [16:34:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:46] T181072: Updates to scholarship application form for Wikimania 2018 - https://phabricator.wikimedia.org/T181072 [16:34:53] Niharika: take a look. I think I got it right [16:35:47] bd808: Looks good. I'll create a mock application. [16:36:40] RECOVERY - haproxy failover on dbproxy1011 is OK: OK check_failover servers up 2 down 0 [16:39:07] 10Operations, 10Mail, 10fundraising-tech-ops: Forward katherine@wikipedia.org and jimmy@wikipedia.org emails to katherine@wikimedia.org and jimmy@wikimedia.org, respectively - https://phabricator.wikimedia.org/T182456#3837926 (10CCogdill_WMF) Thank you! [16:40:49] bd808: Looks good to me. I don't think we have a way of getting rid of the dummy application besides getting rid of it in the DB, do we? [16:41:12] nope. But I can do that [16:41:32] Cool, thanks! [16:42:55] Niharika: all cleaned up. I think we're done [16:44:12] (03PS8) 10Rush: apt: unattended-upgrades: add reporter script [puppet] - 10https://gerrit.wikimedia.org/r/394572 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [16:44:57] (03CR) 10Rush: [C: 031] apt: unattended-upgrades: add reporter script [puppet] - 10https://gerrit.wikimedia.org/r/394572 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [16:45:30] \o/ [16:45:36] I told Ellie. [16:46:49] 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, and 2 others: Redirect several wikis - https://phabricator.wikimedia.org/T169450#3837929 (10Reedy) All projects are dumped at https://dumps.wikimedia.org/backup-index.html [16:48:00] (03CR) 10Arturo Borrero Gonzalez: [C: 032] apt: unattended-upgrades: add reporter script [puppet] - 10https://gerrit.wikimedia.org/r/394572 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [16:48:00] PROBLEM - puppet last run on lvs1007 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 6 minutes ago with 4 failures. Failed resources (up to 3 shown): Exec[ethtool eth2 -K lro off],Exec[txqueuelen-eth2],Exec[ethtool eth3 -K lro off],Exec[txqueuelen-eth3] [16:56:06] (03CR) 10Herron: [C: 031] "Looks like this will fix the current assumption that on jessie/stretch the puppet service would automatically have started at boot." [puppet] - 10https://gerrit.wikimedia.org/r/398279 (https://phabricator.wikimedia.org/T182702) (owner: 10Volans) [16:57:29] (03CR) 10Marostegui: "> Looks like this will fix the current assumption that on" [puppet] - 10https://gerrit.wikimedia.org/r/398279 (https://phabricator.wikimedia.org/T182702) (owner: 10Volans) [16:58:17] marostegui: not on purpose, but just the other day we had to reimage a trusty into trusty again [16:58:29] oh wow.. [17:00:04] godog, moritzm, and _joe_: Your horoscope predicts another unfortunate Puppet SWAT(Max 8 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171214T1700). [17:00:05] No GERRIT patches in the queue for this window AFAICS. [17:03:25] \o/ [17:07:24] yeah in general so long as we have existing trusties, we have to have the capability to reimage them as trusty [17:07:42] not all reimaging scenarios are going to allow for the blockage to re-engineer them as jessie or stretch [17:08:33] yep [17:08:59] (03PS1) 10Mobrovac: Trending Edits: Stop and mask the service [puppet] - 10https://gerrit.wikimedia.org/r/398286 (https://phabricator.wikimedia.org/T180384) [17:11:06] (03PS4) 10Elukey: Replace kafka1018 with kafka1023 in the analytics kafka cluster [puppet] - 10https://gerrit.wikimedia.org/r/398255 (https://phabricator.wikimedia.org/T181518) [17:12:31] 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, and 2 others: Redirect several wikis - https://phabricator.wikimedia.org/T169450#3837981 (10StevenJ81) Thanks for that. [17:15:05] that's our situation currently yes, open questions on next distro but problems to solve before that one [17:21:35] (03PS1) 10Andrew Bogott: labsaliaser: handle requests for the simple hostname 'puppet' [puppet] - 10https://gerrit.wikimedia.org/r/398290 (https://phabricator.wikimedia.org/T181375) [17:21:40] 10Operations, 10Goal, 10User-fgiunchedi: Port nutcracker statistics to Prometheus - https://phabricator.wikimedia.org/T181995#3838023 (10fgiunchedi) a:03fgiunchedi I tried to package https://github.com/xavierholt/twemproxy_exporter and it would work in buster/sid but not stretch/jessie because these packag... [17:22:03] (03CR) 10jerkins-bot: [V: 04-1] labsaliaser: handle requests for the simple hostname 'puppet' [puppet] - 10https://gerrit.wikimedia.org/r/398290 (https://phabricator.wikimedia.org/T181375) (owner: 10Andrew Bogott) [17:22:37] (03Abandoned) 10Elukey: Replace kafka1018 with kafka1023 in the analytics kafka cluster [puppet] - 10https://gerrit.wikimedia.org/r/398255 (https://phabricator.wikimedia.org/T181518) (owner: 10Elukey) [17:23:32] (03PS2) 10Andrew Bogott: labsaliaser: handle requests for the simple hostname 'puppet' [puppet] - 10https://gerrit.wikimedia.org/r/398290 (https://phabricator.wikimedia.org/T181375) [17:23:59] (03CR) 10jerkins-bot: [V: 04-1] labsaliaser: handle requests for the simple hostname 'puppet' [puppet] - 10https://gerrit.wikimedia.org/r/398290 (https://phabricator.wikimedia.org/T181375) (owner: 10Andrew Bogott) [17:27:15] (03PS1) 10Elukey: Establish ipsec session between kafka1023 and the cp nodes [puppet] - 10https://gerrit.wikimedia.org/r/398292 (https://phabricator.wikimedia.org/T181518) [17:27:44] (03CR) 10jerkins-bot: [V: 04-1] Establish ipsec session between kafka1023 and the cp nodes [puppet] - 10https://gerrit.wikimedia.org/r/398292 (https://phabricator.wikimedia.org/T181518) (owner: 10Elukey) [17:28:03] (03PS3) 10Andrew Bogott: labsaliaser: handle requests for the simple hostname 'puppet' [puppet] - 10https://gerrit.wikimedia.org/r/398290 (https://phabricator.wikimedia.org/T181375) [17:28:35] (03CR) 10jerkins-bot: [V: 04-1] labsaliaser: handle requests for the simple hostname 'puppet' [puppet] - 10https://gerrit.wikimedia.org/r/398290 (https://phabricator.wikimedia.org/T181375) (owner: 10Andrew Bogott) [17:29:03] that multi-role style thing is getting annoying :P [17:29:21] (03CR) 10BBlack: [C: 031] Establish ipsec session between kafka1023 and the cp nodes [puppet] - 10https://gerrit.wikimedia.org/r/398292 (https://phabricator.wikimedia.org/T181518) (owner: 10Elukey) [17:29:32] thanks bblack :) [17:29:48] (03CR) 10jerkins-bot: [V: 04-1] Establish ipsec session between kafka1023 and the cp nodes [puppet] - 10https://gerrit.wikimedia.org/r/398292 (https://phabricator.wikimedia.org/T181518) (owner: 10Elukey) [17:29:51] merging and then going back to the other bit of the change [17:30:16] lol, did it auto-retest when I removed the -1? :) [17:30:40] fascinating [17:30:49] (03CR) 10Elukey: [V: 032 C: 032] Establish ipsec session between kafka1023 and the cp nodes [puppet] - 10https://gerrit.wikimedia.org/r/398292 (https://phabricator.wikimedia.org/T181518) (owner: 10Elukey) [17:30:53] yes wonderful :D [17:32:36] (03PS4) 10Andrew Bogott: labsaliaser: handle requests for the simple hostname 'puppet' [puppet] - 10https://gerrit.wikimedia.org/r/398290 (https://phabricator.wikimedia.org/T181375) [17:34:13] (03PS5) 10Andrew Bogott: labsaliaser: handle requests for the simple hostname 'puppet' [puppet] - 10https://gerrit.wikimedia.org/r/398290 (https://phabricator.wikimedia.org/T181375) [17:35:37] 10Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure (shipyard), 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Allow contint-admins to interact with docker on CI hosts - https://phabricator.wikimedia.org/T182860#3836956 (10RobH) As this increases sudo permissions for a... [17:38:17] (03PS5) 10Ema: varnishxcps.mtail: use prometheus labels [puppet] - 10https://gerrit.wikimedia.org/r/397876 (https://phabricator.wikimedia.org/T177199) [17:42:22] (03PS5) 10Ema: tox: run mtail tests [puppet] - 10https://gerrit.wikimedia.org/r/394552 (https://phabricator.wikimedia.org/T181794) (owner: 10Filippo Giunchedi) [17:42:32] (03CR) 10Ema: [V: 032 C: 032] tox: run mtail tests [puppet] - 10https://gerrit.wikimedia.org/r/394552 (https://phabricator.wikimedia.org/T181794) (owner: 10Filippo Giunchedi) [17:43:18] (03PS1) 10Elukey: Add interface::add_ip6_mapped to kafka1023 [puppet] - 10https://gerrit.wikimedia.org/r/398295 (https://phabricator.wikimedia.org/T181518) [17:43:46] (03CR) 10jerkins-bot: [V: 04-1] Add interface::add_ip6_mapped to kafka1023 [puppet] - 10https://gerrit.wikimedia.org/r/398295 (https://phabricator.wikimedia.org/T181518) (owner: 10Elukey) [17:43:59] (03PS6) 10Ema: varnishxcps.mtail: use prometheus labels [puppet] - 10https://gerrit.wikimedia.org/r/397876 (https://phabricator.wikimedia.org/T177199) [17:44:44] (03CR) 10BBlack: [C: 031] Add interface::add_ip6_mapped to kafka1023 [puppet] - 10https://gerrit.wikimedia.org/r/398295 (https://phabricator.wikimedia.org/T181518) (owner: 10Elukey) [17:44:51] (03CR) 10Elukey: [V: 032 C: 032] Add interface::add_ip6_mapped to kafka1023 [puppet] - 10https://gerrit.wikimedia.org/r/398295 (https://phabricator.wikimedia.org/T181518) (owner: 10Elukey) [17:45:00] (03PS2) 10Elukey: Add interface::add_ip6_mapped to kafka1023 [puppet] - 10https://gerrit.wikimedia.org/r/398295 (https://phabricator.wikimedia.org/T181518) [17:45:15] (03CR) 10Elukey: [V: 032 C: 032] Add interface::add_ip6_mapped to kafka1023 [puppet] - 10https://gerrit.wikimedia.org/r/398295 (https://phabricator.wikimedia.org/T181518) (owner: 10Elukey) [17:46:07] (03CR) 10Filippo Giunchedi: [C: 031] varnishxcps.mtail: use prometheus labels [puppet] - 10https://gerrit.wikimedia.org/r/397876 (https://phabricator.wikimedia.org/T177199) (owner: 10Ema) [17:46:14] ema: good to merge? [17:46:20] elukey: yes! [17:46:49] (03PS7) 10Ema: varnishxcps.mtail: use prometheus labels [puppet] - 10https://gerrit.wikimedia.org/r/397876 (https://phabricator.wikimedia.org/T177199) [17:46:57] (03CR) 10Ema: [V: 032 C: 032] varnishxcps.mtail: use prometheus labels [puppet] - 10https://gerrit.wikimedia.org/r/397876 (https://phabricator.wikimedia.org/T177199) (owner: 10Ema) [17:53:03] (03PS2) 10Gehel: Add Prometheus exporter for Blazegraph [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/398272 (https://phabricator.wikimedia.org/T182857) (owner: 10Muehlenhoff) [17:56:42] 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, and 2 others: Redirect several wikis - https://phabricator.wikimedia.org/T169450#3838168 (10EddieGP) a:05EddieGP>03None [17:59:30] (03PS1) 10Elukey: Allow ipsec in iptables/ferm rules for kafka1023 [puppet] - 10https://gerrit.wikimedia.org/r/398299 (https://phabricator.wikimedia.org/T181518) [17:59:40] Jenkins you are not going to like it [18:00:04] cscott, arlolra, subbu, halfak, and Amir1: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Graphoid / Parsoid / Citoid / ORES . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171214T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:00:08] (03CR) 10jerkins-bot: [V: 04-1] Allow ipsec in iptables/ferm rules for kafka1023 [puppet] - 10https://gerrit.wikimedia.org/r/398299 (https://phabricator.wikimedia.org/T181518) (owner: 10Elukey) [18:00:17] ORES is going ahead with a deployment. [18:00:20] arlwill deploy 12:30ish [18:00:22] \o/ [18:00:26] *arlolra [18:01:01] (03CR) 10Elukey: [V: 032 C: 032] Allow ipsec in iptables/ferm rules for kafka1023 [puppet] - 10https://gerrit.wikimedia.org/r/398299 (https://phabricator.wikimedia.org/T181518) (owner: 10Elukey) [18:01:02] PROBLEM - puppet last run on conf1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:01:02] I <3 whoever maintains jouncebot's humor. [18:01:12] PROBLEM - puppet last run on restbase1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:01:13] PROBLEM - puppet last run on mw1238 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:01:23] PROBLEM - puppet last run on dysprosium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:02:00] puppetdb issues, checking nitrogen [18:02:23] restarted 4min 32s ago [18:02:38] OOM in the dmesg [18:02:41] sigh [18:02:42] PROBLEM - puppet last run on rhodium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:02:52] PROBLEM - puppet last run on rdb1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:02:54] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Port exim statistics to Prometheus - https://phabricator.wikimedia.org/T179565#3838201 (10fgiunchedi) [18:02:54] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Have jenkins run mtail tests via tox/nose - https://phabricator.wikimedia.org/T181794#3838198 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi All done! nose-based mtail tests are being run by tox now. [18:03:12] PROBLEM - puppet last run on mw1312 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:03:22] PROBLEM - puppet last run on hafnium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:03:42] PROBLEM - puppet last run on mw1282 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:03:52] PROBLEM - puppet last run on mw1187 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:03:52] PROBLEM - puppet last run on ms-be1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:04:02] PROBLEM - puppet last run on elastic1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:04:05] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:04:42] PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:05:12] PROBLEM - puppet last run on baham is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:06:42] RECOVERY - IPsec on cp4031 is OK: Strongswan OK - 44 ESP OK [18:06:42] RECOVERY - IPsec on cp3007 is OK: Strongswan OK - 28 ESP OK [18:06:42] RECOVERY - IPsec on cp2004 is OK: Strongswan OK - 56 ESP OK [18:06:42] RECOVERY - IPsec on cp3031 is OK: Strongswan OK - 44 ESP OK [18:06:42] RECOVERY - IPsec on cp3032 is OK: Strongswan OK - 44 ESP OK [18:06:42] RECOVERY - IPsec on cp3034 is OK: Strongswan OK - 54 ESP OK [18:06:42] RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 54 ESP OK [18:06:43] RECOVERY - IPsec on cp3039 is OK: Strongswan OK - 54 ESP OK [18:06:43] RECOVERY - IPsec on cp3040 is OK: Strongswan OK - 44 ESP OK [18:06:44] RECOVERY - IPsec on cp3041 is OK: Strongswan OK - 44 ESP OK [18:06:44] RECOVERY - IPsec on cp3042 is OK: Strongswan OK - 44 ESP OK [18:06:45] RECOVERY - IPsec on cp3047 is OK: Strongswan OK - 54 ESP OK [18:07:03] RECOVERY - IPsec on cp3046 is OK: Strongswan OK - 54 ESP OK [18:07:12] RECOVERY - IPsec on cp2018 is OK: Strongswan OK - 26 ESP OK [18:07:12] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 70 ESP OK [18:07:12] RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 70 ESP OK [18:07:13] RECOVERY - IPsec on cp2006 is OK: Strongswan OK - 26 ESP OK [18:07:13] RECOVERY - IPsec on cp4024 is OK: Strongswan OK - 54 ESP OK [18:07:13] RECOVERY - IPsec on cp4025 is OK: Strongswan OK - 54 ESP OK [18:07:13] RECOVERY - IPsec on cp3008 is OK: Strongswan OK - 28 ESP OK [18:07:13] RECOVERY - IPsec on cp3010 is OK: Strongswan OK - 28 ESP OK [18:07:13] RECOVERY - IPsec on cp4032 is OK: Strongswan OK - 44 ESP OK [18:07:14] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 70 ESP OK [18:07:14] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 70 ESP OK [18:07:15] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 70 ESP OK [18:07:32] RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 54 ESP OK [18:07:32] RECOVERY - IPsec on cp4021 is OK: Strongswan OK - 54 ESP OK [18:08:50] \o/ [18:08:52] RECOVERY - IPsec on cp3035 is OK: Strongswan OK - 54 ESP OK [18:08:53] RECOVERY - IPsec on cp3037 is OK: Strongswan OK - 54 ESP OK [18:08:53] RECOVERY - IPsec on cp3043 is OK: Strongswan OK - 44 ESP OK [18:10:19] 10Operations, 10ORES, 10Patch-For-Review, 10Performance, 10Scoring-platform-team (Current): Diagnose and fix 4.5k req/min ceiling for ores* requests - https://phabricator.wikimedia.org/T182249#3838216 (10awight) Ran a tricky test, in which I stepped up from 1 to 3 test harnesses, then back down. * tester... [18:11:31] !log awight@tin Started deploy [ores/deploy@b67bba7]: Update ORES service to b67bba77acb [18:11:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:41] (03PS1) 10Elukey: Replace kafka1018 with kafka1023 in the Analytics Kafka cluster [puppet] - 10https://gerrit.wikimedia.org/r/398301 (https://phabricator.wikimedia.org/T181518) [18:16:36] (03CR) 10Ottomata: [C: 031] Replace kafka1018 with kafka1023 in the Analytics Kafka cluster [puppet] - 10https://gerrit.wikimedia.org/r/398301 (https://phabricator.wikimedia.org/T181518) (owner: 10Elukey) [18:17:23] RECOVERY - IPsec on cp2016 is OK: Strongswan OK - 56 ESP OK [18:17:23] RECOVERY - IPsec on cp3038 is OK: Strongswan OK - 54 ESP OK [18:22:32] (03PS2) 10Volans: wmf-auto-reimage: generate Puppet cert if needed [puppet] - 10https://gerrit.wikimedia.org/r/398279 (https://phabricator.wikimedia.org/T182702) [18:22:35] (03PS1) 10Volans: base: fix dependency relationship [puppet] - 10https://gerrit.wikimedia.org/r/398303 (https://phabricator.wikimedia.org/T182702) [18:22:35] take 2 [18:22:36] (03PS6) 10Andrew Bogott: labsaliaser: handle requests for the simple hostname 'puppet' [puppet] - 10https://gerrit.wikimedia.org/r/398290 (https://phabricator.wikimedia.org/T181375) [18:24:22] (03CR) 10Elukey: [C: 032] Replace kafka1018 with kafka1023 in the Analytics Kafka cluster [puppet] - 10https://gerrit.wikimedia.org/r/398301 (https://phabricator.wikimedia.org/T181518) (owner: 10Elukey) [18:24:23] (03CR) 10Andrew Bogott: [C: 032] labsaliaser: handle requests for the simple hostname 'puppet' [puppet] - 10https://gerrit.wikimedia.org/r/398290 (https://phabricator.wikimedia.org/T181375) (owner: 10Andrew Bogott) [18:24:48] (03PS7) 10Andrew Bogott: labsaliaser: handle requests for the simple hostname 'puppet' [puppet] - 10https://gerrit.wikimedia.org/r/398290 (https://phabricator.wikimedia.org/T181375) [18:24:50] !log replace kafka1018 with kafka1023 (Analytics Kafka cluster) [18:25:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:52] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 223.91 seconds [18:28:08] RECOVERY - puppet last run on mw1312 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [18:28:19] RECOVERY - puppet last run on hafnium is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [18:28:38] RECOVERY - puppet last run on mw1282 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [18:28:49] RECOVERY - puppet last run on mw1187 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [18:28:49] RECOVERY - puppet last run on ms-be1018 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [18:28:58] RECOVERY - puppet last run on elastic1043 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:28:58] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [18:29:38] RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:30:08] RECOVERY - puppet last run on baham is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:30:58] RECOVERY - puppet last run on conf1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:31:08] RECOVERY - puppet last run on restbase1013 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [18:31:08] RECOVERY - puppet last run on mw1238 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [18:31:28] RECOVERY - puppet last run on dysprosium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:32:31] !log arlolra@tin Started deploy [parsoid/deploy@13b5cb5]: (no justification provided) [18:32:38] RECOVERY - puppet last run on rhodium is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:32:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:49] RECOVERY - puppet last run on rdb1006 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:35:51] 10Operations, 10ORES, 10Patch-For-Review, 10Performance, 10Scoring-platform-team (Current): Diagnose and fix 4.5k req/min ceiling for ores* requests - https://phabricator.wikimedia.org/T182249#3838258 (10Halfak) Based on this report, I think we should go live with this. Any follow-up stress testing can... [18:40:14] (03PS1) 10Urbanecm: Define new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398307 (https://phabricator.wikimedia.org/T182889) [18:41:14] (03CR) 10jerkins-bot: [V: 04-1] Define new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398307 (https://phabricator.wikimedia.org/T182889) (owner: 10Urbanecm) [18:41:50] !log arlolra@tin Finished deploy [parsoid/deploy@13b5cb5]: (no justification provided) (duration: 09m 19s) [18:41:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:15] (03CR) 10Thcipriani: [C: 04-1] "I hope the idea of populating the docker group works. If not we'll need to add an additional path here, I think." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/398240 (https://phabricator.wikimedia.org/T182860) (owner: 10Hashar) [18:46:55] 10Operations, 10ORES, 10Patch-For-Review, 10Performance, 10Scoring-platform-team (Current): Diagnose and fix 4.5k req/min ceiling for ores* requests - https://phabricator.wikimedia.org/T182249#3838288 (10awight) I'm happy with that. It looks like it's going to be difficult to break through this ceiling,... [18:47:08] (03PS2) 10Urbanecm: Define new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398307 (https://phabricator.wikimedia.org/T182889) [18:47:58] PROBLEM - Disk space on furud is CRITICAL: DISK CRITICAL - free space: /mnt/1a 1308891 MB (3% inode=96%) [18:48:33] !log bsitzmann@tin Started deploy [mobileapps/deploy@bf85a55]: Update mobileapps to ff74bb1 (T182868 T182774) [18:48:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:45] T182774: Improve section validation - https://phabricator.wikimedia.org/T182774 [18:48:45] T182868: On This Day endpoint returning empty list on frwiki - https://phabricator.wikimedia.org/T182868 [18:48:58] RECOVERY - Disk space on furud is OK: DISK OK [18:51:50] !log Updated Parsoid to ca20680 (T182793, T182774) [18:52:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:02] T182793: Crasher in logs in section wrapping - https://phabricator.wikimedia.org/T182793 [18:54:34] !log awight@tin Started restart [ores/deploy@b67bba7]: Restart ORES services [18:54:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:10] (03PS3) 10Zoranzoki21: Define new throttle rule and cleaning expired rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398307 (https://phabricator.wikimedia.org/T182889) (owner: 10Urbanecm) [18:59:38] (03PS1) 10Ema: vcl: add hash function name to CHACHA20-POLY1305 cipher [puppet] - 10https://gerrit.wikimedia.org/r/398311 [19:00:04] addshore, hashar, anomie, no_justification, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Morning SWAT (Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171214T1900). [19:00:05] subbu, Deskana, and framawiki: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:05] (03CR) 10jerkins-bot: [V: 04-1] vcl: add hash function name to CHACHA20-POLY1305 cipher [puppet] - 10https://gerrit.wikimedia.org/r/398311 (owner: 10Ema) [19:00:12] * subbu is around [19:00:59] !log bsitzmann@tin Finished deploy [mobileapps/deploy@bf85a55]: Update mobileapps to ff74bb1 (T182868 T182774) (duration: 12m 26s) [19:01:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:11] T182774: Improve section validation - https://phabricator.wikimedia.org/T182774 [19:01:12] T182868: On This Day endpoint returning empty list on frwiki - https://phabricator.wikimedia.org/T182868 [19:01:14] o/ [19:01:47] !log awight@tin Started deploy [ores/deploy@b67bba7]: Redeploy ORES to scb1001 [19:01:58] (03PS2) 10Ema: vcl: add hash function name to CHACHA20-POLY1305 cipher [puppet] - 10https://gerrit.wikimedia.org/r/398311 [19:01:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:44] (03PS1) 10Rush: openstack: nova/compute/server.pp manage nova shell [puppet] - 10https://gerrit.wikimedia.org/r/398312 [19:02:51] !log awight@tin Finished deploy [ores/deploy@b67bba7]: Redeploy ORES to scb1001 (duration: 01m 04s) [19:03:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:13] (03CR) 10jerkins-bot: [V: 04-1] openstack: nova/compute/server.pp manage nova shell [puppet] - 10https://gerrit.wikimedia.org/r/398312 (owner: 10Rush) [19:04:29] (03PS2) 10Rush: openstack: nova/compute/server.pp manage nova shell [puppet] - 10https://gerrit.wikimedia.org/r/398312 [19:04:44] (03PS3) 10Rush: openstack: nova/compute/server.pp manage nova shell [puppet] - 10https://gerrit.wikimedia.org/r/398312 [19:05:07] any swatters around? [19:05:07] (03CR) 10jerkins-bot: [V: 04-1] openstack: nova/compute/server.pp manage nova shell [puppet] - 10https://gerrit.wikimedia.org/r/398312 (owner: 10Rush) [19:05:49] I can SWAT [19:06:13] (03PS4) 10Rush: openstack: nova/compute/server.pp manage nova shell [puppet] - 10https://gerrit.wikimedia.org/r/398312 [19:06:34] (03PS5) 10Rush: openstack: nova/compute/server.pp manage nova shell [puppet] - 10https://gerrit.wikimedia.org/r/398312 [19:07:23] subbu: do you need this for both wmf.11 and 12 (https://tools.wmflabs.org/versions/) ? [19:07:38] only this week's train that went out .. don't know what that is. [19:08:03] ok, wmf.12 went out on tuesday and should be everywhere this afternoon, I can backport it to that. [19:08:21] so wmf.12 yes. [19:08:31] (03CR) 10Andrew Bogott: [C: 031] openstack: nova/compute/server.pp manage nova shell [puppet] - 10https://gerrit.wikimedia.org/r/398312 (owner: 10Rush) [19:10:28] Deskana: ping for SWAT if you're around [19:10:37] thcipriani: I am! [19:10:51] (03PS1) 10Ema: vcl: remove X-CP-Full-Cipher [puppet] - 10https://gerrit.wikimedia.org/r/398314 [19:11:27] Deskana: okie doke assuming your -1 has been addressed here: https://gerrit.wikimedia.org/r/#/c/393121/ (safe assumption?) [19:11:45] thcipriani: Yes. [19:12:02] (There's a grumpy comment by me on Phab that explains why a patch that I gave a -1 to is being deployed) [19:12:11] (03CR) 10Deskana: Remove single editor tab for plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393121 (https://phabricator.wikimedia.org/T181045) (owner: 10TerraCodes) [19:12:18] (03PS17) 10Thcipriani: Remove single editor tab for plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393121 (https://phabricator.wikimedia.org/T181045) (owner: 10TerraCodes) [19:12:20] thcipriani: I removed the -1 for clarity. [19:12:31] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393121 (https://phabricator.wikimedia.org/T181045) (owner: 10TerraCodes) [19:12:32] thanks :) [19:13:03] * thcipriani crossing I's dotting T's [19:13:27] * subbu hopes not :) [19:13:35] :P [19:13:48] This patch will be deployed too in this swat? https://gerrit.wikimedia.org/r/#/c/398307/ [19:13:55] (03Merged) 10jenkins-bot: Remove single editor tab for plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393121 (https://phabricator.wikimedia.org/T181045) (owner: 10TerraCodes) [19:14:11] Zoranzoki21: yep, on my list :) [19:14:29] thcipriani: Ok. Thank you [19:15:16] Deskana: your change is live on mwdebug1002, check please [19:16:03] 10Operations, 10hardware-requests: reclaim and return all cisco servers - https://phabricator.wikimedia.org/T128821#3838336 (10RobH) [19:16:15] thcipriani: It works. Thanks. [19:16:17] subbu: since it looks like your change requires a full scap I'm going to put it at the end of the window, sorry for the inconvenience [19:16:21] Deskana: ok, going live [19:16:27] thcipriani, np [19:16:42] 10Operations, 10hardware-requests: reclaim and return all cisco servers - https://phabricator.wikimedia.org/T128821#2087114 (10RobH) The last two systems are labsdb100[13]. Current roadmap off T142807 is to have them decommissioned in January 2018. [19:17:03] (03CR) 10jenkins-bot: Remove single editor tab for plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393121 (https://phabricator.wikimedia.org/T181045) (owner: 10TerraCodes) [19:20:23] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:393121|Remove single editor tab for plwiki]] T181045 (duration: 01m 09s) [19:20:32] ^ Deskana live everywhere now [19:20:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:33] T181045: Enable both "edit" and "edit code" tabs in Polish Wikipedia - https://phabricator.wikimedia.org/T181045 [19:20:38] thcipriani: THanks. [19:20:43] yw :) [19:21:11] (03PS3) 10Thcipriani: Set $wgNamespaceRobotPolicies for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393845 (https://phabricator.wikimedia.org/T181525) (owner: 10Framawiki) [19:21:19] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393845 (https://phabricator.wikimedia.org/T181525) (owner: 10Framawiki) [19:22:41] (03Merged) 10jenkins-bot: Set $wgNamespaceRobotPolicies for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393845 (https://phabricator.wikimedia.org/T181525) (owner: 10Framawiki) [19:22:51] (03CR) 10jenkins-bot: Set $wgNamespaceRobotPolicies for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393845 (https://phabricator.wikimedia.org/T181525) (owner: 10Framawiki) [19:23:35] framawiki: your change is live on mwdebug1002, check please [19:23:43] (03PS1) 10Awight: Tune down ORES worker counts on the stress testing cluster [puppet] - 10https://gerrit.wikimedia.org/r/398316 (https://phabricator.wikimedia.org/T182249) [19:26:49] thcipriani: it's good for me, thx [19:27:02] framawiki: cool, thanks for checking, going live [19:27:22] 10Operations, 10Ops-Access-Requests: Requesting access to analytics-privatedata-users group for Jonas Kress - https://phabricator.wikimedia.org/T182908#3838373 (10Jonas) [19:28:00] 10Operations, 10ops-ulsfo, 10Traffic: decom cp40(09|1[078]) - https://phabricator.wikimedia.org/T178815#3838385 (10RobH) a:05RobH>03None [19:28:07] 10Operations, 10ops-ulsfo, 10Traffic, 10hardware-requests, 10Patch-For-Review: Decom cp4005-8,13-16 (8 nodes) - https://phabricator.wikimedia.org/T176366#3838386 (10RobH) a:05RobH>03None [19:28:12] 10Operations, 10ops-ulsfo, 10hardware-requests: Decommission cp4011, cp4012, cp4019, cp4020 - https://phabricator.wikimedia.org/T167377#3838387 (10RobH) a:05RobH>03None [19:28:17] 10Operations, 10ops-ulsfo, 10hardware-requests: Decommission cp400[1-4] - https://phabricator.wikimedia.org/T169020#3838389 (10RobH) a:05RobH>03None [19:29:07] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:393845|Set $wgNamespaceRobotPolicies for wikidata]] T181525 (duration: 01m 04s) [19:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:17] T181525: Disable robot indexing for user pages on Wikidata - https://phabricator.wikimedia.org/T181525 [19:29:17] ^ framawiki live now [19:29:44] 10Operations, 10ops-eqsin: rack/setup scs-eqsin.mgmt.eqsin.wmnet - https://phabricator.wikimedia.org/T181569#3838395 (10RobH) The replacement scs has arrived at the office. I'll go into the office tomorrow (Friday) and program the new device, and then sycn with Robert Miller to get it shipped to eqsin (with f... [19:30:02] (03PS3) 10Volans: wmflib: use string for parameter of package, not symbol [puppet] - 10https://gerrit.wikimedia.org/r/395695 (owner: 10Giuseppe Lavagetto) [19:30:03] thanks ! [19:33:21] (03PS4) 10Thcipriani: Define new throttle rule and cleaning expired rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398307 (https://phabricator.wikimedia.org/T182889) (owner: 10Urbanecm) [19:35:42] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398307 (https://phabricator.wikimedia.org/T182889) (owner: 10Urbanecm) [19:37:10] (03Merged) 10jenkins-bot: Define new throttle rule and cleaning expired rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398307 (https://phabricator.wikimedia.org/T182889) (owner: 10Urbanecm) [19:37:20] (03CR) 10jenkins-bot: Define new throttle rule and cleaning expired rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398307 (https://phabricator.wikimedia.org/T182889) (owner: 10Urbanecm) [19:39:15] ^ Urbanecm Zoranzoki21 unless there's anything you want to check on mwdebug1002, I will just go ahead and sync this change out [19:40:07] thcipriani: How to check throttle rule?? [19:40:39] thcipriani: you can sync it. Should be ok [19:40:48] right? Not sure. Nothing explodes so I'll go ahead and sync :) [19:41:39] thcipriani: ok.. You can deploy. Sorry for two "??" [19:41:50] no worries at all :) [19:43:21] !log thcipriani@tin Synchronized wmf-config/throttle.php: SWAT: [[gerrit:398307|Define new throttle rule and cleaning expired rules]] T182889 (duration: 01m 08s) [19:43:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:32] T182889: Temporary Lift of IP Cap - https://phabricator.wikimedia.org/T182889 [19:43:36] ^ Urbanecm Zoranzoki21 all live [19:43:58] thcipriani: Thank you very much [19:44:04] yw! [19:48:50] !log thcipriani@tin Started scap: SWAT: [[gerrit:398313|Update en/i18n message for multiple-unclosed-formatting-tags]] [19:48:59] ^ subbu here we go! [19:49:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:22] (03PS2) 10Dzahn: Tune down ORES worker counts on the stress testing cluster [puppet] - 10https://gerrit.wikimedia.org/r/398316 (https://phabricator.wikimedia.org/T182249) (owner: 10Awight) [19:56:36] (03CR) 10Dzahn: [C: 032] Tune down ORES worker counts on the stress testing cluster [puppet] - 10https://gerrit.wikimedia.org/r/398316 (https://phabricator.wikimedia.org/T182249) (owner: 10Awight) [19:57:18] mutante: Thanks! [20:00:05] no_justification: Dear deployers, time to do the MediaWiki train deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171214T2000). [20:00:05] No GERRIT patches in the queue for this window AFAICS. [20:00:12] Oh go away, I don't wanna [20:00:23] no_justification: I'm still scap-ing from SWAT [20:00:31] I can ping you when done? [20:00:42] * no_justification goes and makes a sandwich instead [20:00:43] :) [20:01:07] awight: welcome! (i just keep dropping off IRC) [20:02:11] !log lowering cassandra compaction throughput to 5MB/s, restbase101{2,4}-{a,b,c} [20:02:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:37] (03PS3) 10Ema: vcl: add hash function name to CHACHA20-POLY1305 cipher [puppet] - 10https://gerrit.wikimedia.org/r/398311 [20:10:19] (03PS2) 10Ema: vcl: remove X-CP-Full-Cipher [puppet] - 10https://gerrit.wikimedia.org/r/398314 [20:19:46] !log thcipriani@tin Finished scap: SWAT: [[gerrit:398313|Update en/i18n message for multiple-unclosed-formatting-tags]] (duration: 30m 55s) [20:19:56] ^ subbu all done! [20:19:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:08] \o/ [20:20:12] no_justification: should be clear for train, sorry for the delay :( [20:20:20] https://www.mediawiki.org/wiki/Special:LintErrors looks fixed. thanks. [20:20:24] PROBLEM - Kafka Broker Replica Max Lag on kafka1023 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16fullscreenorgId=1 [20:20:36] * subbu wishes he had caught that message change before the train left the station [20:20:43] thcipriani: I'm still sandwiching [20:21:01] fair. I should probably sandwich myself. [20:21:15] !log awight@tin Started restart [ores/deploy@b67bba7]: (non-production) Restart ORES services on ores* [20:21:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:26] !log demon@tin Pruned MediaWiki: 1.31.0-wmf.10 [keeping static files] (duration: 03m 32s) [20:30:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:24] (03PS4) 10Dzahn: mwlog/xenon: access should be based on role, not host names [puppet] - 10https://gerrit.wikimedia.org/r/393994 [20:33:39] !log stress testing ores* [20:33:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:56] (03CR) 10Dzahn: [C: 032] mwlog/xenon: access should be based on role, not host names [puppet] - 10https://gerrit.wikimedia.org/r/393994 (owner: 10Dzahn) [20:37:31] (03CR) 10Dzahn: "no-op on mwlog1001/mwlog2001" [puppet] - 10https://gerrit.wikimedia.org/r/393994 (owner: 10Dzahn) [20:39:10] 10Operations, 10MediaWiki-Platform-Team, 10TechCom-RfC, 10HHVM, and 2 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370#3838529 (10daniel) Due to holiday season, the Last Call for this proposal will end on January 10th. If no pertinent issues remain unaddressed by that... [20:42:21] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scap, and 2 others: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3838537 (10thcipriani) [20:42:23] 10Operations, 10Packaging, 10Scap: SCAP: Upload debian package version 3.7.4-1 - https://phabricator.wikimedia.org/T182347#3838535 (10thcipriani) 05Resolved>03Open I'm unclear what happened here, but we're missing an important configuration flag `scap3_mediawiki` from the current release on tin: https:/... [20:42:28] 10Operations, 10Reading List Service, 10Reading-Infrastructure-Team-Backlog, 10Traffic, 10Patch-For-Review: PUT blocked by Varnish - https://phabricator.wikimedia.org/T182825#3838538 (10Tgr) 05Open>03Resolved a:03Tgr [20:45:02] (03PS1) 10Andrew Bogott: wmcs puppet: Support instance agents using the 'puppet' master hostname [puppet] - 10https://gerrit.wikimedia.org/r/398323 [20:45:27] (03CR) 10jerkins-bot: [V: 04-1] wmcs puppet: Support instance agents using the 'puppet' master hostname [puppet] - 10https://gerrit.wikimedia.org/r/398323 (owner: 10Andrew Bogott) [20:47:48] 10Operations, 10Packaging, 10Scap: SCAP: Upload debian package version 3.7.4-1 - https://phabricator.wikimedia.org/T182347#3838567 (10thcipriani) I don't see the `debian/3.7.4-1` tag in the repo, but I do see `3.7.4` tagged. The configuration flag is present at that tag, could an opsen rebuild and upload a `... [20:47:56] 10Operations, 10Ops-Access-Requests: Requesting access to analytics-privatedata-users group for Jonas Kress - https://phabricator.wikimedia.org/T182908#3838569 (10RobH) a:03Jonas Jonas: Typically web logins are not tied to shell user groups. Is this request to be able to login to https://hue.wikimedia.org... [20:48:39] (03PS2) 10Andrew Bogott: wmcs puppet: Support instance agents using the 'puppet' master hostname [puppet] - 10https://gerrit.wikimedia.org/r/398323 [20:54:24] 10Operations, 10Ops-Access-Requests: Requesting access to analytics-privatedata-users group for Jonas Kress - https://phabricator.wikimedia.org/T182908#3838581 (10Ottomata) @Jonas, actually, when you emailed me, I also should have asked: what data are you trying to access? You might only need to be in the an... [20:54:32] 10Operations, 10Ops-Access-Requests: Requesting access to analytics-privatedata-users group for Jonas Kress - https://phabricator.wikimedia.org/T182908#3838582 (10RobH) Corrections: (Seems easier to append a new comment than try to edit my above) It seems login to hue is a manual process detailed on https://... [20:58:43] 10Operations, 10Ops-Access-Requests: Requesting access to analytics-privatedata-users group for Jonas Kress - https://phabricator.wikimedia.org/T182908#3838589 (10Dzahn) > get access to maintain services around Wikidata in production ^ This sounds quite different from "acccess to hue" and "analytics-privateda... [21:07:35] (03PS1) 10Chad: group2 to wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398329 [21:16:54] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current), and 2 others: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3838621 (10awight) [21:16:59] 10Operations, 10ORES, 10Patch-For-Review, 10Performance, 10Scoring-platform-team (Current): Diagnose and fix 4.5k req/min ceiling for ores* requests - https://phabricator.wikimedia.org/T182249#3838619 (10awight) 05Open>03stalled I think we've got our tuning parameters! 45 minutes of overload, and ev... [21:17:07] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current), and 2 others: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3456617 (10awight) [21:17:09] 10Operations, 10ORES, 10Patch-For-Review, 10Performance, 10Scoring-platform-team (Current): Diagnose and fix 4.5k req/min ceiling for ores* requests - https://phabricator.wikimedia.org/T182249#3838622 (10awight) [21:23:02] 10Operations, 10ORES, 10Scoring-platform-team: Reimage ores* hosts with Debian Stretch - https://phabricator.wikimedia.org/T171851#3838630 (10awight) [21:23:07] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current), and 2 others: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3838628 (10awight) 05Open>03Resolved Ok, done for real now. @Halfak and I decided that the remaining bottlenecks are something n... [21:24:08] 10Operations, 10ORES, 10Scap, 10Scoring-platform-team: Use external dsh group to list pooled ORES nodes - https://phabricator.wikimedia.org/T179501#3838636 (10awight) [21:25:26] 10Operations, 10ORES, 10Scoring-platform-team, 10Release-Engineering-Team (Kanban), and 2 others: Git refusing to clone some ORES submodules - https://phabricator.wikimedia.org/T181552#3838638 (10awight) 05Open>03Resolved I haven't seen this issue in a few weeks, closing. Thank you! [21:33:39] (03PS1) 10Kaldari: Updating ssh key for kaldari [puppet] - 10https://gerrit.wikimedia.org/r/398331 [21:38:44] 10Operations, 10Ops-Access-Requests: Requesting access to analytics-privatedata-users group for Jonas Kress - https://phabricator.wikimedia.org/T182908#3838673 (10Jonas) Sorry for the confusion! @Ottomata One of the things I would like to do is using hive queries to analyze web API endpoint usage. This would... [21:40:35] mutante: I'm finally updating my ssh key (and upgrading from 2048 bit to 4096 bit). I updated modules/admin/data/data.yaml here: https://gerrit.wikimedia.org/r/#/c/398331/ and on my user page: https://www.mediawiki.org/wiki/User:Kaldari. What else do I need to do? [21:41:43] 10Operations, 10Ops-Access-Requests: Requesting access to analytics-privatedata-users group for Jonas Kress - https://phabricator.wikimedia.org/T182908#3838676 (10Ottomata) Ok, then you need analytics-privatedata-users. @Jonas, yes, you need a shell login, because access to the data is controlled by verifying... [21:55:25] 10Operations, 10DBA, 10Performance-Team, 10Availability (Multiple-active-datacenters): Perform testing for TLS effect on connection rate - https://phabricator.wikimedia.org/T171071#3838724 (10aaron) I keep coming with times like: ``` Same-DC (db2070.codfw.wmnet): string(56) "0.10926739454269 sec/conn (non... [22:01:57] no_justification, is the train deploy to group2 done / under way? [22:02:07] Yes [22:02:31] (03CR) 10Chad: [C: 032] group2 to wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398329 (owner: 10Chad) [22:02:48] (03PS11) 10Aaron Schulz: [WIP] Add mcrouter module and mcrouter_wancache profile [puppet] - 10https://gerrit.wikimedia.org/r/392221 [22:05:43] (03Merged) 10jenkins-bot: group2 to wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398329 (owner: 10Chad) [22:07:14] (03CR) 10jenkins-bot: group2 to wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398329 (owner: 10Chad) [22:07:17] AaronSchulz: Maybe you know. See my question to mutante above ^ [22:08:33] !log demon@tin rebuilt and synchronized wikiversions files: group2 to wmf.12 [22:08:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:04] so close, almost worked the first time...darn default.erb params missing ;) [22:14:22] kaldari: about keys? [22:14:28] yeah [22:14:56] (03PS1) 10Ayounsi: BirdLG: Add fake python secret session key [labs/private] - 10https://gerrit.wikimedia.org/r/398380 [22:15:09] kaldari: userpage seems a bit weak. Why not put the new pub key in your home dir? Then ping someone from ops that's around. [22:15:27] doesn't have to be a specific person [22:16:13] is anyone from ops around? [22:17:44] !log demon@tin rebuilt and synchronized wikiversions files: rollback, ORES breaking stuff on enwiki [22:17:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:55] awight|afk, halfak: ^^^^ [22:17:56] :( [22:18:03] ohgod [22:18:12] Notice: Undefined property: stdClass::$oresm_name in /srv/mediawiki/php-1.31.0-wmf.12/extensions/ORES/includes/Hooks/ApiHooksHandler.php on line 415 [22:18:40] Well, notices aren't "breaking" [22:18:50] no_justification: You’re rolling back a train? Is there a task yet? [22:18:51] But they were superrrrrrrr logspammy so I couldn't see any real breakages [22:18:56] I just rolled it back 10 seconds ago [22:19:10] well shucks. [22:19:28] I know it's not an "error" but I don't want something that spammy hiding real errors :( [22:19:47] Actually, there's a *lot* of logspam on wmf.12 [22:19:50] Not just ORES :( [22:20:47] 10Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure (shipyard), 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Allow contint-admins to interact with docker on CI hosts - https://phabricator.wikimedia.org/T182860#3838762 (10hashar) @RobH thanks! I was at first hesitant t... [22:20:48] _joe_: Could you help with my ssh key switch? [22:20:48] I guess there’s nothing we can do at the moment? [22:21:20] Ideally fix that undefined notice :) [22:21:24] lol [22:21:34] (03CR) 10Hashar: "From T182860, Alexandros change is https://gerrit.wikimedia.org/r/#/c/398276/" [puppet] - 10https://gerrit.wikimedia.org/r/398240 (https://phabricator.wikimedia.org/T182860) (owner: 10Hashar) [22:25:31] Ah, easy fix [22:25:34] Patch incoming [22:29:03] awight|afk: Idk if this is right, but seems right :) https://gerrit.wikimedia.org/r/398381 [22:31:29] no_justification: Are you trying to hotfix and roll it back out? [22:32:14] I'm not trying to deploy a hotfix without it being merged to master, that's for sure [22:32:24] So ideally that patch can get reviewed & merged or fixed in another way [22:32:28] Otherwise, no trainz [22:32:54] Merged. TY for the footwork [22:33:25] Yay ty for +2 [22:33:40] Now we get to do the Jenkins dance for the next while :p [22:34:25] kaldari: it was all good, you were just missing the last step to add some ops on the patch in gerrit. could you put the new key in your home dir maybe like Aaron suggested [22:35:10] (03PS2) 10Dzahn: admins: Updating ssh key for kaldari [puppet] - 10https://gerrit.wikimedia.org/r/398331 (owner: 10Kaldari) [22:36:47] mutante: it's in my local home dir now. Is that where you mean? [22:37:17] kaldari: on which host though? [22:38:07] kaldari: i mean any wikimedia server you already have access to [22:38:17] ah, OK, I'll put it on terbium... [22:40:00] mutante: OK, now it in /home/kaldari/.ssh/id_rsa.pub on terbium [22:41:06] ok, cool, this is just for manual verification purposes. confirmed:) [22:41:13] (03CR) 10Dzahn: [C: 032] admins: Updating ssh key for kaldari [puppet] - 10https://gerrit.wikimedia.org/r/398331 (owner: 10Kaldari) [22:42:14] i'll remove it again and let puppet run the update [22:43:34] kaldari: updated on terbium and bast1001/bast2001, feel free to try it [22:46:52] 10Operations, 10Ops-Access-Requests, 10User-Addshore: Requesting access to analytics-privatedata-users group for Jonas Kress - https://phabricator.wikimedia.org/T182908#3838811 (10Addshore) [22:47:58] !log demon@tin Synchronized php-1.31.0-wmf.12/extensions/ORES/includes/Hooks/ApiHooksHandler.php: fix undefined property (duration: 01m 08s) [22:48:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:25] (03PS12) 10Aaron Schulz: [WIP] Add mcrouter module and mcrouter_wancache profile [puppet] - 10https://gerrit.wikimedia.org/r/392221 [22:58:11] !log demon@tin rebuilt and synchronized wikiversions files: group2 to wmf.12 (#2) [22:58:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:38] kaldari: resolved?:) [22:58:45] mutante: seems to work! [22:58:47] Thanks!!! [22:58:54] kaldari: cool:) yw [23:05:21] 10Operations, 10Mail, 10fundraising-tech-ops: Forward katherine@wikipedia.org and jimmy@wikipedia.org emails to katherine@wikimedia.org and jimmy@wikimedia.org, respectively - https://phabricator.wikimedia.org/T182456#3838852 (10bcampbell) Thank you, Faidon! [23:20:05] no_justification: just to confirm, wmf.12 is live everywhere, and it's here to stay? (i'm preparing some backports) [23:22:15] MatmaRex: Depends how broken it is [23:22:28] i'm asking how broken is it [23:22:44] should i backport stuff to wmf.11 and wmf.12, or just wmf.12 is enough? [23:24:41] MatmaRex: Here to stay until somebody else breaks it ;-) [23:24:47] aka: how much more logspam I get [23:25:04] wmf.12 seems like a shitty release [23:25:06] :( [23:25:42] It's about right, when it's just before a freeze [23:25:44] *Sounds [23:28:55] 10Operations, 10Commons, 10Multimedia, 10media-storage: Generate a list of files that are supposed to exist but 404s - https://phabricator.wikimedia.org/T182822#3835699 (10Dispenser) I did something similar years ago to pre-generate thumbnails for WikiMiniAtlas with its unusual sizes (48x48). Stored the c... [23:34:32] no_justification: okay, but i'm asking seriously [23:34:41] no_justification: do you anticipate reverting to wmf.11? [23:34:48] the next SWAT has a config change that depends on wmf.12 [23:34:49] No, I don't. But I can't promise that [23:35:26] no_justification: should i backport the thing it depends on to wmf.11 and SWAT that too? [23:35:37] It certainly can't hurt [23:36:01] Safest option is to skip swat entirely :) [23:37:14] you're being super unhelpful right now. eh [23:38:44] If people didn't write broken code, we wouldn't have to revert [23:40:39] MatmaRex: Well when people ship /obviously/ broken code and I spend part of my afternoon rolling the train back and forth....I'm of a mind to cancel swat on said days so people can get things fixed instead. [23:40:54] Maybe then people will start to care about spammy log messages. [23:52:22] (03CR) 10Ayounsi: [V: 032 C: 032] BirdLG: Add fake python secret session key [labs/private] - 10https://gerrit.wikimedia.org/r/398380 (owner: 10Ayounsi) [23:54:51] (03PS1) 10Dzahn: dbstore: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/398390 (https://phabricator.wikimedia.org/T177225)