[00:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190118T0000). [00:00:04] No GERRIT patches in the queue for this window AFAICS. [00:02:09] (03CR) 10Dzahn: [C: 04-1] "fails now because the "host" variable has either an IP address or an actual host name..." [puppet] - 10https://gerrit.wikimedia.org/r/485106 (owner: 10Dzahn) [00:06:43] (03PS1) 10Bstorm: sonofgridengine: make exec hosts into submit hosts [puppet] - 10https://gerrit.wikimedia.org/r/485129 (https://phabricator.wikimedia.org/T123270) [00:11:34] (03CR) 10Bstorm: [C: 03+2] sonofgridengine: make exec hosts into submit hosts [puppet] - 10https://gerrit.wikimedia.org/r/485129 (https://phabricator.wikimedia.org/T123270) (owner: 10Bstorm) [00:11:49] I have a last minute patch fo this (empty) swat window if there's no objections. [00:13:53] stephanebisson: you might need to ping the deployers to see if anyone is still around to deploy it [00:14:11] I will deploy it [00:15:06] (03CR) 10Sbisson: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485112 (https://phabricator.wikimedia.org/T213356) (owner: 10Sbisson) [00:15:29] I've added it to the deployment schedule for good measures [00:16:07] (03Merged) 10jenkins-bot: Enable the Welcome survey on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485112 (https://phabricator.wikimedia.org/T213356) (owner: 10Sbisson) [00:17:49] (03CR) 10jenkins-bot: Enable the Welcome survey on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485112 (https://phabricator.wikimedia.org/T213356) (owner: 10Sbisson) [00:24:06] (03PS1) 10Sbisson: Revert "Enable the Welcome survey on viwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485134 [00:25:42] (03CR) 10Sbisson: [C: 03+2] "Abort mission" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485134 (owner: 10Sbisson) [00:26:39] (03Merged) 10jenkins-bot: Revert "Enable the Welcome survey on viwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485134 (owner: 10Sbisson) [00:28:26] SWAT finished (nothing deployed, didn't work as expected) [00:30:45] (03CR) 10jenkins-bot: Revert "Enable the Welcome survey on viwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485134 (owner: 10Sbisson) [00:34:42] 10Operations, 10ops-eqiad: cloudstore100{8,9} - Upgrade to 10GbE - https://phabricator.wikimedia.org/T214079 (10CDanis) p:05Triage→03Normal [00:40:08] (03PS1) 10Smalyshev: Add allocator metrics export for Blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/485135 (https://phabricator.wikimedia.org/T213372) [00:40:27] (03PS3) 10Smalyshev: wdqs: convert prom exporter script to py3 [puppet] - 10https://gerrit.wikimedia.org/r/484974 (https://phabricator.wikimedia.org/T213305) (owner: 10Mathew.onipe) [00:40:54] (03CR) 10jerkins-bot: [V: 04-1] Add allocator metrics export for Blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/485135 (https://phabricator.wikimedia.org/T213372) (owner: 10Smalyshev) [00:41:43] (03PS2) 10Smalyshev: Add allocator metrics export for Blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/485135 (https://phabricator.wikimedia.org/T213372) [00:42:27] (03CR) 10jerkins-bot: [V: 04-1] Add allocator metrics export for Blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/485135 (https://phabricator.wikimedia.org/T213372) (owner: 10Smalyshev) [00:44:24] (03PS3) 10Smalyshev: Add allocator metrics export for Blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/485135 (https://phabricator.wikimedia.org/T213372) [00:45:07] (03CR) 10jerkins-bot: [V: 04-1] Add allocator metrics export for Blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/485135 (https://phabricator.wikimedia.org/T213372) (owner: 10Smalyshev) [00:48:52] (03PS4) 10Smalyshev: Add allocator metrics export for Blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/485135 (https://phabricator.wikimedia.org/T213372) [00:50:06] (03CR) 10jerkins-bot: [V: 04-1] Add allocator metrics export for Blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/485135 (https://phabricator.wikimedia.org/T213372) (owner: 10Smalyshev) [00:56:35] (03PS5) 10Smalyshev: Add allocator metrics export for Blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/485135 (https://phabricator.wikimedia.org/T213372) [00:57:18] (03CR) 10jerkins-bot: [V: 04-1] Add allocator metrics export for Blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/485135 (https://phabricator.wikimedia.org/T213372) (owner: 10Smalyshev) [00:58:12] (03CR) 10Smalyshev: "Hopefully I'm doing this right... I have no idea what flake8 wants from me though, I have no way to find out what indentation it wants." [puppet] - 10https://gerrit.wikimedia.org/r/485135 (https://phabricator.wikimedia.org/T213372) (owner: 10Smalyshev) [01:05:32] 10Operations, 10SRE-Access-Requests: Requesting access to production for dsharpe - https://phabricator.wikimedia.org/T214130 (10Dsharpe) [01:05:41] (03PS3) 10Krinkle: Remove unnecessary exception handling from wfGetPrivilegedGroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481116 (owner: 10Gergő Tisza) [01:06:00] (03PS4) 10Krinkle: Remove unnecessary exception handling from wmfGetPrivilegedGroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481116 (owner: 10Gergő Tisza) [01:09:31] (03CR) 10Krinkle: "None of the called methods have @throws or otherwise throw from what I can see, which the exception of (indirectly) from IDatabase::select" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481116 (owner: 10Gergő Tisza) [01:10:31] 10Operations, 10SRE-Access-Requests: Requesting access to production for dsharpe - https://phabricator.wikimedia.org/T214130 (10Peachey88) [01:12:25] RECOVERY - MariaDB Slave Lag: s7 on db2040 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [01:12:33] RECOVERY - MariaDB Slave Lag: s7 on db2054 is OK: OK slave_sql_lag Replication lag: 0.28 seconds [01:12:47] RECOVERY - MariaDB Slave Lag: s7 on db2047 is OK: OK slave_sql_lag Replication lag: 0.18 seconds [01:12:49] RECOVERY - MariaDB Slave Lag: s7 on db2077 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [01:12:55] RECOVERY - MariaDB Slave Lag: s7 on db2061 is OK: OK slave_sql_lag Replication lag: 0.07 seconds [01:13:13] RECOVERY - MariaDB Slave Lag: s7 on db2095 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [01:13:17] RECOVERY - MariaDB Slave Lag: s7 on db2068 is OK: OK slave_sql_lag Replication lag: 0.26 seconds [01:13:25] RECOVERY - MariaDB Slave Lag: s7 on db2086 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [01:13:27] RECOVERY - MariaDB Slave Lag: s7 on db2087 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [01:19:43] 10Operations, 10SRE-Access-Requests: Requesting access to production for dsharpe - https://phabricator.wikimedia.org/T214130 (10Dsharpe) [01:33:49] (03CR) 10Gergő Tisza: "> getSafeReadDB() throws when CA is read only. that might be a problem?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481116 (owner: 10Gergő Tisza) [01:45:14] 10Operations, 10Performance-Team, 10VisualEditor, 10Software-Licensing: New MongoDB version is not DFSG-compatible, dropped by Debian - https://phabricator.wikimedia.org/T213996 (10Legoktm) The VisualEditor "rebaser" has some code/config related to mongodb: https://gerrit.wikimedia.org/g/VisualEditor/Visua... [01:47:18] (03PS6) 10Smalyshev: Add allocator metrics export for Blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/485135 (https://phabricator.wikimedia.org/T213372) [01:47:49] (03CR) 10Mobrovac: [C: 03+1] "LGTM, sorry for the late review." (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/483184 (owner: 10Alexandros Kosiaris) [02:10:46] PROBLEM - MariaDB Slave Lag: s2 on db2056 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.89 seconds [02:10:48] PROBLEM - MariaDB Slave Lag: s2 on db2063 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.35 seconds [02:10:56] PROBLEM - MariaDB Slave Lag: s2 on db2088 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.17 seconds [02:10:58] PROBLEM - MariaDB Slave Lag: s2 on db2035 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.03 seconds [02:11:10] PROBLEM - MariaDB Slave Lag: s2 on db2041 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.88 seconds [02:11:26] PROBLEM - MariaDB Slave Lag: s2 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.25 seconds [02:11:46] PROBLEM - MariaDB Slave Lag: s2 on db2091 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 313.05 seconds [02:11:50] PROBLEM - MariaDB Slave Lag: s2 on db2049 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 313.20 seconds [02:21:33] 10Operations, 10Cassandra, 10Dependency-Tracking, 10Wikibase-Quality, and 7 others: Store WikibaseQualityConstraint check data in persistent storage instead of in the cache - https://phabricator.wikimedia.org/T204024 (10mobrovac) >>! In T204024#4888623, @Addshore wrote: >> will we store data only for the l... [02:35:05] 10Operations, 10SRE-Access-Requests: Requesting access to production for dsharpe - https://phabricator.wikimedia.org/T214130 (10sbassett) Hey @Dsharpe I'd imagine you'd probably need `deployment` and `analytics-privatedata-users` access for now - this is what @Bawolff and I have. [[ https://gerrit.wikimedia.... [02:35:08] RECOVERY - MariaDB Slave Lag: s2 on db2049 is OK: OK slave_sql_lag Replication lag: 49.73 seconds [02:35:18] RECOVERY - MariaDB Slave Lag: s2 on db2056 is OK: OK slave_sql_lag Replication lag: 17.58 seconds [02:35:20] RECOVERY - MariaDB Slave Lag: s2 on db2063 is OK: OK slave_sql_lag Replication lag: 12.60 seconds [02:35:28] RECOVERY - MariaDB Slave Lag: s2 on db2088 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [02:35:30] RECOVERY - MariaDB Slave Lag: s2 on db2035 is OK: OK slave_sql_lag Replication lag: 0.27 seconds [02:35:44] RECOVERY - MariaDB Slave Lag: s2 on db2041 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [02:35:58] RECOVERY - MariaDB Slave Lag: s2 on db2095 is OK: OK slave_sql_lag Replication lag: 0.49 seconds [02:36:16] RECOVERY - MariaDB Slave Lag: s2 on db2091 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [02:39:06] PROBLEM - MariaDB Slave Lag: s7 on db2068 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.62 seconds [02:39:08] PROBLEM - MariaDB Slave Lag: s7 on db2040 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.59 seconds [02:39:16] PROBLEM - MariaDB Slave Lag: s7 on db2077 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.79 seconds [02:39:20] PROBLEM - MariaDB Slave Lag: s7 on db2086 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.57 seconds [02:39:20] PROBLEM - MariaDB Slave Lag: s7 on db2087 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.61 seconds [02:39:32] PROBLEM - MariaDB Slave Lag: s7 on db2054 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.37 seconds [02:40:04] PROBLEM - MariaDB Slave Lag: s7 on db2061 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 312.60 seconds [02:40:08] PROBLEM - MariaDB Slave Lag: s7 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 313.43 seconds [02:40:12] PROBLEM - MariaDB Slave Lag: s7 on db2047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 314.18 seconds [03:18:20] RECOVERY - MariaDB Slave Lag: s7 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 286.62 seconds [03:59:35] (03PS1) 10CRusnov: Upgrade netbox to v2.5.2 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/485142 [04:00:00] (03CR) 10CRusnov: [C: 04-2] "For testing purposes." [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/485142 (owner: 10CRusnov) [04:18:43] (03PS2) 10CRusnov: Upgrade netbox to v2.5.2 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/485142 [04:26:41] (03CR) 10CRusnov: "Tested in WMCS test instance (af-netbox.wmflabs.org) and works (incl migrations but maybe not reports). I did switch it to using the githu" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/485142 (owner: 10CRusnov) [04:29:12] (03PS3) 10CRusnov: Upgrade netbox to v2.5.2 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/485142 (https://phabricator.wikimedia.org/T212524) [05:22:17] PROBLEM - Apache HTTP on mw1289 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.002 second response time [05:23:19] RECOVERY - Apache HTTP on mw1289 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.108 second response time [05:24:47] PROBLEM - HHVM rendering on mw1285 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1309 bytes in 5.643 second response time [05:24:55] PROBLEM - Apache HTTP on mw1285 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.002 second response time [05:25:45] RECOVERY - HHVM rendering on mw1285 is OK: HTTP OK: HTTP/1.1 200 OK - 79050 bytes in 0.149 second response time [05:25:59] RECOVERY - Apache HTTP on mw1285 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.029 second response time [05:53:09] PROBLEM - MariaDB Slave Lag: s2 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.24 seconds [05:53:15] PROBLEM - MariaDB Slave Lag: s2 on db2088 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.54 seconds [05:53:33] PROBLEM - MariaDB Slave Lag: s2 on db2035 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 306.15 seconds [05:53:35] PROBLEM - MariaDB Slave Lag: s2 on db2091 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 306.40 seconds [05:53:35] PROBLEM - MariaDB Slave Lag: s2 on db2063 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 307.06 seconds [05:53:39] PROBLEM - MariaDB Slave Lag: s2 on db2049 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 307.63 seconds [05:53:43] PROBLEM - MariaDB Slave Lag: s2 on db2041 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 308.40 seconds [05:53:53] PROBLEM - MariaDB Slave Lag: s2 on db2056 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.13 seconds [06:22:13] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool hosts for a2 rack maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485145 [06:24:57] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool hosts for a2 rack maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485145 (owner: 10Marostegui) [06:28:02] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool hosts for a2 rack maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485145 (owner: 10Marostegui) [06:28:17] PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:29:08] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool DBs on A2 rack T213748 (duration: 00m 47s) [06:29:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:11] T213748: swap a2-eqiad PDU with on-site spare - https://phabricator.wikimedia.org/T213748 [06:29:39] !log Deploy schema change on db1075 - T85757 [06:29:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:41] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [06:32:31] PROBLEM - puppet last run on mw1319 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/puppet-enabled] [06:33:46] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool hosts for a2 rack maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485145 (owner: 10Marostegui) [06:34:12] (03PS1) 10Marostegui: db-eqiad.php: Repool db1075,db1103 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485147 [06:36:59] RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational [06:37:16] ^ just needed to be restarted as with 1002 yesterday [06:42:31] RECOVERY - MariaDB Slave Lag: s7 on db2047 is OK: OK slave_sql_lag Replication lag: 59.46 seconds [06:42:31] RECOVERY - MariaDB Slave Lag: s7 on db2087 is OK: OK slave_sql_lag Replication lag: 59.40 seconds [06:42:39] RECOVERY - MariaDB Slave Lag: s7 on db2086 is OK: OK slave_sql_lag Replication lag: 57.35 seconds [06:42:42] (03PS2) 10Marostegui: db-eqiad.php: Repool db1075,db1103 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485147 [06:42:43] RECOVERY - MariaDB Slave Lag: s7 on db2040 is OK: OK slave_sql_lag Replication lag: 57.62 seconds [06:42:43] RECOVERY - MariaDB Slave Lag: s7 on db2068 is OK: OK slave_sql_lag Replication lag: 56.64 seconds [06:43:03] RECOVERY - MariaDB Slave Lag: s7 on db2095 is OK: OK slave_sql_lag Replication lag: 52.52 seconds [06:43:03] RECOVERY - MariaDB Slave Lag: s7 on db2077 is OK: OK slave_sql_lag Replication lag: 50.32 seconds [06:43:05] RECOVERY - MariaDB Slave Lag: s7 on db2054 is OK: OK slave_sql_lag Replication lag: 51.36 seconds [06:43:09] RECOVERY - MariaDB Slave Lag: s7 on db2061 is OK: OK slave_sql_lag Replication lag: 51.71 seconds [06:44:12] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Repool db1075,db1103 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485147 (owner: 10Marostegui) [06:45:21] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1075,db1103 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485147 (owner: 10Marostegui) [06:46:21] !log Deploy schema change on dbstore1002:s3 - T85757 [06:46:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:24] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [06:46:39] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1075,db1103 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485147 (owner: 10Marostegui) [06:46:56] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1075 and db1103 after DC hw maintenance (duration: 00m 44s) [06:46:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:38] (03CR) 10Muehlenhoff: [C: 04-1] "Indeed, the semantics of $DOMAIN_NETWORKS is that when used on prod is allows full access to prod and full access in labs when run in WMCS" [puppet] - 10https://gerrit.wikimedia.org/r/485092 (owner: 10Gehel) [06:52:00] (03Abandoned) 10Muehlenhoff: Align thumbor.profile and mediawiki-converters.profile [puppet] - 10https://gerrit.wikimedia.org/r/481139 (owner: 10Muehlenhoff) [06:54:39] !log Drop table tag_summary from s7 - T212255 [06:54:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:42] T212255: Drop tag_summary table - https://phabricator.wikimedia.org/T212255 [06:55:53] (03CR) 10Muehlenhoff: [C: 03+1] apt: repository: trust also the source repo [puppet] - 10https://gerrit.wikimedia.org/r/483140 (owner: 10Arturo Borrero Gonzalez) [06:58:39] RECOVERY - puppet last run on mw1319 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:07:01] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:07:49] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:16:09] !log installing OpenSSL security updates [07:16:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:22] (03PS1) 10Marostegui: db-eqiad.php: Depool db1113:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485148 (https://phabricator.wikimedia.org/T210478) [07:22:02] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1113:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485148 (https://phabricator.wikimedia.org/T210478) (owner: 10Marostegui) [07:23:09] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1113:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485148 (https://phabricator.wikimedia.org/T210478) (owner: 10Marostegui) [07:23:39] RECOVERY - MariaDB Slave Lag: s2 on db2091 is OK: OK slave_sql_lag Replication lag: 15.68 seconds [07:23:41] RECOVERY - MariaDB Slave Lag: s2 on db2035 is OK: OK slave_sql_lag Replication lag: 3.69 seconds [07:23:43] RECOVERY - MariaDB Slave Lag: s2 on db2063 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [07:23:45] RECOVERY - MariaDB Slave Lag: s2 on db2049 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [07:23:49] RECOVERY - MariaDB Slave Lag: s2 on db2041 is OK: OK slave_sql_lag Replication lag: 0.45 seconds [07:23:59] RECOVERY - MariaDB Slave Lag: s2 on db2056 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [07:24:25] RECOVERY - MariaDB Slave Lag: s2 on db2095 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [07:24:33] RECOVERY - MariaDB Slave Lag: s2 on db2088 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [07:24:35] (03PS1) 10Marostegui: dbstore1003: Replace s6 with s5 [puppet] - 10https://gerrit.wikimedia.org/r/485149 (https://phabricator.wikimedia.org/T210478) [07:24:40] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1113:3315 - T210478 (duration: 00m 46s) [07:24:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:42] T210478: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5] - https://phabricator.wikimedia.org/T210478 [07:25:09] \o/ [07:25:33] (03PS1) 10Marostegui: db-eqiad.php: Depool db1113:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485150 [07:25:35] (03CR) 10Marostegui: [C: 03+2] dbstore1003: Replace s6 with s5 [puppet] - 10https://gerrit.wikimedia.org/r/485149 (https://phabricator.wikimedia.org/T210478) (owner: 10Marostegui) [07:25:55] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1113:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485148 (https://phabricator.wikimedia.org/T210478) (owner: 10Marostegui) [07:27:27] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1113:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485150 (owner: 10Marostegui) [07:28:30] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1113:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485150 (owner: 10Marostegui) [07:29:29] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1113:3316 for mysql upgrade (duration: 00m 45s) [07:29:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:17] !log Stop MySQL on db1113:3315 and db1113:3316 to clone dbstore1003 and for mysql and kernel upgrade [07:30:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:19] !log rolling restart of AQS to pick up OpenSSL security updates for nodejs [07:31:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:02] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1113:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485150 (owner: 10Marostegui) [08:09:12] (03PS2) 10Elukey: eventlogging: Remove all mentions of MongoDB [puppet] - 10https://gerrit.wikimedia.org/r/485127 (owner: 10MaxSem) [08:10:03] (03CR) 10Elukey: [C: 03+2] eventlogging: Remove all mentions of MongoDB [puppet] - 10https://gerrit.wikimedia.org/r/485127 (owner: 10MaxSem) [08:12:57] !log depool and take snapshots of prometheus data on prometheus2003 to test v2 conversion - T187987 [08:12:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:00] T187987: Serve >= 50% of production Prometheus systems with Prometheus v2 - https://phabricator.wikimedia.org/T187987 [08:16:42] (03PS1) 10Amire80: Define ImportSources for nywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485151 [08:17:34] PROBLEM - HHVM rendering on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:18:08] (03PS2) 10Amire80: Define ImportSources for nywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485151 [08:18:36] RECOVERY - HHVM rendering on mw1227 is OK: HTTP OK: HTTP/1.1 200 OK - 78991 bytes in 0.111 second response time [08:28:35] (03PS2) 10Gehel: elasticsearch: allow cumin to connect to relforge [puppet] - 10https://gerrit.wikimedia.org/r/485092 [08:30:11] (03CR) 10Gehel: "I'm pretty sure I did the check from cumin and it failed for search.svc.eqiad.wmnet. Probably a typo somewhere." [puppet] - 10https://gerrit.wikimedia.org/r/485092 (owner: 10Gehel) [08:35:18] (03CR) 10Muehlenhoff: "The patch looks fine, but could you clarify in the commit message why the change is made? The topic mentions Icinga shard checks, but this" [puppet] - 10https://gerrit.wikimedia.org/r/485092 (owner: 10Gehel) [08:37:46] (03PS3) 10Gehel: elasticsearch: allow cumin to connect to relforge [puppet] - 10https://gerrit.wikimedia.org/r/485092 [08:38:13] (03CR) 10Muehlenhoff: [C: 03+1] elasticsearch: allow cumin to connect to relforge [puppet] - 10https://gerrit.wikimedia.org/r/485092 (owner: 10Gehel) [08:39:40] (03CR) 10Gehel: [C: 03+2] elasticsearch: allow cumin to connect to relforge [puppet] - 10https://gerrit.wikimedia.org/r/485092 (owner: 10Gehel) [08:51:34] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1113:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485157 [08:57:13] (03PS1) 10Marostegui: db-codfw.php: Add new wikis to s5 codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485158 (https://phabricator.wikimedia.org/T184805) [08:58:58] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1113:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485157 (owner: 10Marostegui) [09:00:01] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1113:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485157 (owner: 10Marostegui) [09:01:10] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1113:3316 after mysql upgrade (duration: 00m 46s) [09:01:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:50] (03Abandoned) 10Jcrespo: Add s8 section to the list of databases [switchdc] - 10https://gerrit.wikimedia.org/r/454583 (https://phabricator.wikimedia.org/T199079) (owner: 10Jcrespo) [09:09:25] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1113:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485157 (owner: 10Marostegui) [09:19:04] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 55, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:19:22] PROBLEM - Nginx local proxy to apache on mw1313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:19:44] PROBLEM - Apache HTTP on mw1313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:20:02] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:20:04] PROBLEM - HHVM rendering on mw1313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:21:18] 10Operations, 10Performance-Team, 10Release-Engineering-Team: Backport pygerrit2 to Debian Stretch - https://phabricator.wikimedia.org/T214149 (10Joe) a:05Gilles→03Joe [09:21:22] (03CR) 10Jcrespo: [C: 03+1] db-codfw.php: Add new wikis to s5 codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485158 (https://phabricator.wikimedia.org/T184805) (owner: 10Marostegui) [09:21:30] (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Add new wikis to s5 codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485158 (https://phabricator.wikimedia.org/T184805) (owner: 10Marostegui) [09:22:16] there was a load spike on mw1313 [09:22:32] (03Merged) 10jenkins-bot: db-codfw.php: Add new wikis to s5 codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485158 (https://phabricator.wikimedia.org/T184805) (owner: 10Marostegui) [09:22:48] (03CR) 10jenkins-bot: db-codfw.php: Add new wikis to s5 codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485158 (https://phabricator.wikimedia.org/T184805) (owner: 10Marostegui) [09:23:49] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Add migrated wikis from s3 to s5 to codfw config T184805 (duration: 00m 45s) [09:23:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:52] T184805: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805 [09:24:03] 10Operations, 10ops-eqiad, 10RESTBase, 10RESTBase-Cassandra, and 3 others: Memory error on restbase1016 - https://phabricator.wikimedia.org/T212418 (10fgiunchedi) >>! In T212418#4890534, @mobrovac wrote: > So now we should be able to get `restbase1016` back into the cluster. Since we need to re-bootstrap t... [09:26:57] 10Operations, 10ops-eqiad, 10RESTBase, 10RESTBase-Cassandra, and 3 others: Memory error on restbase1016 - https://phabricator.wikimedia.org/T212418 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by filippo on cumin1001.eqiad.wmnet for hosts: ` restbase1016.eqiad.wmnet ` The log can be found in... [09:27:13] (03CR) 10Elukey: "Ran https://puppet-compiler.wmflabs.org/compiler1002/14373/ against most of the hosts (if not all?) in the catalog, the only failures that" [puppet] - 10https://gerrit.wikimedia.org/r/482275 (https://phabricator.wikimedia.org/T212949) (owner: 10Elukey) [09:28:44] (03PS1) 10Mathew.onipe: kartotherian: install nodejs-legacy module [puppet] - 10https://gerrit.wikimedia.org/r/485164 (https://phabricator.wikimedia.org/T198622) [09:28:46] 10Operations, 10ops-eqiad, 10RESTBase, 10RESTBase-Cassandra, and 3 others: Memory error on restbase1016 - https://phabricator.wikimedia.org/T212418 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['restbase1016.eqiad.wmnet'] ` Of which those **FAILED**: ` ['restbase1016.eqiad.wmnet'] ` [09:28:59] (03PS1) 10Filippo Giunchedi: install_server: reimage restbase1016 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/485165 (https://phabricator.wikimedia.org/T212418) [09:29:14] <_joe_> !log uploading python{,3}-pygerrit2 to stretch-wikimedia, T214149 [09:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:17] T214149: Backport pygerrit2 to Debian Stretch - https://phabricator.wikimedia.org/T214149 [09:29:33] (03PS2) 10Filippo Giunchedi: install_server: reimage restbase1016 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/485165 (https://phabricator.wikimedia.org/T212418) [09:31:08] 10Operations, 10Performance-Team, 10Release-Engineering-Team: Backport pygerrit2 to Debian Stretch - https://phabricator.wikimedia.org/T214149 (10Joe) 05Open→03Resolved [09:31:38] (03CR) 10Filippo Giunchedi: [C: 03+2] install_server: reimage restbase1016 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/485165 (https://phabricator.wikimedia.org/T212418) (owner: 10Filippo Giunchedi) [09:32:58] 10Operations, 10Pybal, 10Traffic, 10monitoring: prometheus metrics apparently are missing some ipvs entries - https://phabricator.wikimedia.org/T214072 (10Vgutierrez) It appears that prometheus is not listing any IPVS service without backends, and right now (IPVS wise), dns_rec6 doesn't have any backend se... [09:35:25] !log Deploy schema change on db2067 - T210713 [09:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:28] T210713: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 [09:36:09] 10Operations, 10ops-eqiad, 10RESTBase, 10RESTBase-Cassandra, and 3 others: Memory error on restbase1016 - https://phabricator.wikimedia.org/T212418 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by filippo on cumin1001.eqiad.wmnet for hosts: ` restbase1016.eqiad.wmnet ` The log can be found in... [09:36:13] 10Operations, 10ops-eqiad, 10RESTBase, 10RESTBase-Cassandra, and 3 others: Memory error on restbase1016 - https://phabricator.wikimedia.org/T212418 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['restbase1016.eqiad.wmnet'] ` Of which those **FAILED**: ` ['restbase1016.eqiad.wmnet'] ` [09:36:20] (03PS1) 10Elukey: Assign the Hadoop coordinator role to analytics1030 [puppet] - 10https://gerrit.wikimedia.org/r/485167 (https://phabricator.wikimedia.org/T212256) [09:36:31] (03CR) 10Gehel: [C: 04-1] "see comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/485164 (https://phabricator.wikimedia.org/T198622) (owner: 10Mathew.onipe) [09:40:25] (03PS2) 10Elukey: Assign the Hadoop coordinator role to analytics1030 [puppet] - 10https://gerrit.wikimedia.org/r/485167 (https://phabricator.wikimedia.org/T212256) [09:44:29] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/14374/analytics1030.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/485167 (https://phabricator.wikimedia.org/T212256) (owner: 10Elukey) [09:45:46] PROBLEM - MariaDB Slave Lag: s7 on db2086 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.55 seconds [09:45:52] PROBLEM - MariaDB Slave Lag: s7 on db2068 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.06 seconds [09:45:54] PROBLEM - MariaDB Slave Lag: s7 on db2040 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.67 seconds [09:46:00] PROBLEM - MariaDB Slave Lag: s7 on db2087 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.40 seconds [09:46:12] PROBLEM - MariaDB Slave Lag: s7 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.03 seconds [09:46:18] PROBLEM - MariaDB Slave Lag: s7 on db2077 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.47 seconds [09:50:09] PROBLEM - MariaDB Slave Lag: s7 on db2054 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 329.60 seconds [09:50:54] (03CR) 10Gehel: wdqs: convert prom exporter script to py3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/484974 (https://phabricator.wikimedia.org/T213305) (owner: 10Mathew.onipe) [09:52:59] !log Deploy schema change on db2089 - T210713 [09:53:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:02] T210713: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 [09:53:30] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1113:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485168 [09:55:51] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1113:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485168 (owner: 10Marostegui) [09:56:55] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1113:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485168 (owner: 10Marostegui) [09:57:31] !log Add dbstore1003:3315 to tendril - T210478 [09:57:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:34] T210478: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5] - https://phabricator.wikimedia.org/T210478 [09:58:04] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1113:3315 - T210478 (duration: 00m 45s) [09:58:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:53] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1113:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485168 (owner: 10Marostegui) [10:02:13] !log Add dbstore1003:3315 to zarcillo - T210478 [10:02:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:34] PROBLEM - MariaDB Slave Lag: s7 on db2047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 468.52 seconds [10:04:34] PROBLEM - MariaDB Slave Lag: s7 on db2061 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 468.52 seconds [10:10:27] (03PS1) 10Vgutierrez: add missing IPv6 records for dns200[12].wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/485169 (https://phabricator.wikimedia.org/T214072) [10:12:28] PROBLEM - Hive Metastore on analytics1030 is CRITICAL: NRPE: Command check_hive-metasore not defined [10:13:38] this is a testing host --^ [10:13:45] I've set notifications disabled though [10:13:56] maybe only for the host [10:14:16] PROBLEM - Hive Server on analytics1030 is CRITICAL: NRPE: Command check_hive-server2 not defined [10:14:52] (03CR) 10Vgutierrez: [C: 03+2] add missing IPv6 records for dns200[12].wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/485169 (https://phabricator.wikimedia.org/T214072) (owner: 10Vgutierrez) [10:16:49] 10Operations, 10Discovery-Search, 10Maps: Fix node vs nodejs dependency issue - https://phabricator.wikimedia.org/T214153 (10Mathew.onipe) [10:18:58] 10Operations, 10Discovery-Search, 10Maps: Fix node vs nodejs dependency issue - https://phabricator.wikimedia.org/T214153 (10MoritzMuehlenhoff) if you install the nodejs-legacy package, it will provide a symlink from node to nodejs. [10:19:23] (03PS2) 10Mathew.onipe: kartotherian: install nodejs-legacy module [puppet] - 10https://gerrit.wikimedia.org/r/485164 (https://phabricator.wikimedia.org/T198622) [10:19:30] 10Operations, 10Discourse, 10Wikimedia-Mailing-lists: Create discourse-test mailing list - https://phabricator.wikimedia.org/T214077 (10AdHuikeshoven) That temporary list still exists. I granted Quim admin rights. [10:23:38] !log restarting pybal in lvs2005 - T214072 [10:23:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:41] !log Deploy schema change on db2087:3316 - T210713 [10:23:42] T214072: prometheus metrics apparently are missing some ipvs entries - https://phabricator.wikimedia.org/T214072 [10:23:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:45] T210713: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 [10:24:11] 10Operations, 10Discovery-Search, 10Maps: Fix node vs nodejs dependency issue - https://phabricator.wikimedia.org/T214153 (10Gehel) >>! In T214153#4891838, @MoritzMuehlenhoff wrote: > if you install the nodejs-legacy package, it will provide a symlink from node to nodejs. I was going to say the same :) Wha... [10:25:41] RECOVERY - PyBal IPVS diff check on lvs2005 is OK: OK: no difference between hosts in IPVS/PyBal [10:25:48] _joe_: ^^ :) [10:26:56] <_joe_> lol [10:28:29] that appeared yesterday, after commiting an update in that icinga check [10:28:45] so at least now we will detect that kind of issue [10:29:48] !log restarting pybal in lvs2002 - T214072 [10:29:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:51] T214072: prometheus metrics apparently are missing some ipvs entries - https://phabricator.wikimedia.org/T214072 [10:30:37] RECOVERY - PyBal IPVS diff check on lvs2002 is OK: OK: no difference between hosts in IPVS/PyBal [10:31:22] 10Operations, 10Pybal, 10Traffic, 10monitoring, 10Patch-For-Review: prometheus metrics apparently are missing some ipvs entries - https://phabricator.wikimedia.org/T214072 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez as @Joe properly pointed out in our IRC discussions, the main issue here is tha... [10:31:43] 10Operations, 10ops-eqiad, 10RESTBase, 10RESTBase-Cassandra, and 3 others: Memory error on restbase1016 - https://phabricator.wikimedia.org/T212418 (10fgiunchedi) @mobrovac @Eevans the host has been reimaged and ready for cassandra bootstraps. I'll start with the bootstraps later today. [10:31:47] 10Operations, 10Pybal, 10Traffic, 10monitoring, 10Patch-For-Review: dns200[12] lack IPv6 records - https://phabricator.wikimedia.org/T214072 (10Vgutierrez) [10:33:11] RECOVERY - HHVM rendering on mw1313 is OK: HTTP OK: HTTP/1.1 200 OK - 78998 bytes in 0.208 second response time [10:33:21] RECOVERY - Nginx local proxy to apache on mw1313 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.045 second response time [10:33:51] RECOVERY - Apache HTTP on mw1313 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 1.551 second response time [10:34:03] 10Operations, 10Discovery-Search, 10Maps: Fix node vs nodejs dependency issue - https://phabricator.wikimedia.org/T214153 (10Mathew.onipe) But it could also be one of 'millions' of nodes modules in node_modules calling `/usr/bin/node` or `/usr/bin/nodejs` [10:36:00] 10Operations, 10Discovery-Search, 10Maps: Fix node vs nodejs dependency issue - https://phabricator.wikimedia.org/T214153 (10MoritzMuehlenhoff) True that, also note that in the nodejs 10 packages (from component/node10), the nodejs-legacy package is gone. Debian dropped it, we could patch it back in, but it... [10:39:09] (03PS1) 10Dr0ptp4kt: Add a Google Translate specific redirect-to-mobile [puppet] - 10https://gerrit.wikimedia.org/r/485171 (https://phabricator.wikimedia.org/T212197) [10:39:52] (03PS2) 10Dr0ptp4kt: WIP: Add a Google Translate specific redirect-to-mobile [puppet] - 10https://gerrit.wikimedia.org/r/485171 (https://phabricator.wikimedia.org/T212197) [10:41:35] !log Deploy schema change on db2076 - T210713 [10:41:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:38] T210713: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 [10:42:21] !log killing and removing data from db1118 [10:42:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:16] 10Operations, 10MediaWiki-Cache, 10serviceops, 10Patch-For-Review, and 3 others: Apply -R 200 to all the memcached mw object cache instances running in eqiad/codfw - https://phabricator.wikimedia.org/T208844 (10jijiki) [10:47:29] 10Operations, 10DBA, 10Patch-For-Review: correctable memory errors db1068 (commons primary master database) - https://phabricator.wikimedia.org/T213664 (10jcrespo) 05Open→03Resolved a:03jcrespo The errors are mostly gone, resolving and keeping an eye on it in case it happens again. [10:53:48] (03PS3) 10Arturo Borrero Gonzalez: apt: repository: trust also the source repo [puppet] - 10https://gerrit.wikimedia.org/r/483140 [10:53:52] 10Operations, 10Discovery-Search, 10Maps: Fix node vs nodejs dependency issue - https://phabricator.wikimedia.org/T214153 (10Gehel) >>! In T214153#4891867, @Mathew.onipe wrote: > But it could also be one of 'millions' of nodes modules in node_modules calling `/usr/bin/node` or `/usr/bin/nodejs` While this i... [10:54:08] (03PS3) 10Gehel: kartotherian: install nodejs-legacy module [puppet] - 10https://gerrit.wikimedia.org/r/485164 (https://phabricator.wikimedia.org/T198622) (owner: 10Mathew.onipe) [10:54:22] !log Deploy schema change on dbstore2001:3316 - T210713 [10:54:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:25] T210713: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 [10:55:10] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] apt: repository: trust also the source repo [puppet] - 10https://gerrit.wikimedia.org/r/483140 (owner: 10Arturo Borrero Gonzalez) [10:55:40] (03PS4) 10Gehel: kartotherian: install nodejs-legacy module [puppet] - 10https://gerrit.wikimedia.org/r/485164 (https://phabricator.wikimedia.org/T198622) (owner: 10Mathew.onipe) [10:57:36] (03CR) 10Gehel: [C: 03+2] kartotherian: install nodejs-legacy module [puppet] - 10https://gerrit.wikimedia.org/r/485164 (https://phabricator.wikimedia.org/T198622) (owner: 10Mathew.onipe) [10:57:43] (03PS1) 10Effie Mouzeli: Apply -R 200 to memcached on mc1025 [puppet] - 10https://gerrit.wikimedia.org/r/485172 (https://phabricator.wikimedia.org/T208844) [11:08:36] PROBLEM - Apache HTTP on mw1279 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.002 second response time [11:08:38] PROBLEM - Nginx local proxy to apache on mw1343 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.010 second response time [11:08:48] PROBLEM - HHVM rendering on mw1343 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [11:09:00] PROBLEM - HHVM rendering on mw1279 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.002 second response time [11:09:50] RECOVERY - Apache HTTP on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.047 second response time [11:09:52] RECOVERY - Nginx local proxy to apache on mw1343 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.043 second response time [11:09:52] !log Deploy schema change on db2039 (s6 codfw master) - T210713 [11:09:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:55] T210713: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 [11:10:02] RECOVERY - HHVM rendering on mw1343 is OK: HTTP OK: HTTP/1.1 200 OK - 78998 bytes in 0.151 second response time [11:10:12] RECOVERY - HHVM rendering on mw1279 is OK: HTTP OK: HTTP/1.1 200 OK - 78998 bytes in 0.157 second response time [11:12:35] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Good overall, I'm amending the name of the cli switch and adding some more tests before merging" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/480533 (https://phabricator.wikimedia.org/T210438) (owner: 10Hashar) [11:13:53] (03PS2) 10Giuseppe Lavagetto: Option to use Docker cache [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/480533 (https://phabricator.wikimedia.org/T210438) (owner: 10Hashar) [11:15:17] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Option to use Docker cache [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/480533 (https://phabricator.wikimedia.org/T210438) (owner: 10Hashar) [11:16:22] (03Merged) 10jenkins-bot: Option to use Docker cache [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/480533 (https://phabricator.wikimedia.org/T210438) (owner: 10Hashar) [11:16:48] (03CR) 10jenkins-bot: Option to use Docker cache [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/480533 (https://phabricator.wikimedia.org/T210438) (owner: 10Hashar) [11:36:05] (03PS3) 10Dr0ptp4kt: WIP: Add a Google Translate specific redirect-to-mobile [puppet] - 10https://gerrit.wikimedia.org/r/485171 (https://phabricator.wikimedia.org/T212197) [11:37:26] (03PS4) 10Dr0ptp4kt: WIP: Add a Google Translate specific redirect-to-mobile [puppet] - 10https://gerrit.wikimedia.org/r/485171 (https://phabricator.wikimedia.org/T212197) [11:39:21] 10Operations, 10MediaWiki-Cache, 10serviceops, 10Patch-For-Review, and 3 others: Apply -R 200 to all the memcached mw object cache instances running in eqiad/codfw - https://phabricator.wikimedia.org/T208844 (10jijiki) [11:42:44] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Again +1, will just add a few tests and a version change." [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/475843 (https://phabricator.wikimedia.org/T200720) (owner: 10Hashar) [11:46:06] !log copied prometheus-rsyslog-exporter from stretch-wikimedia to buster-wikimedia [11:46:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:05] (03PS5) 10Dr0ptp4kt: WIP: Add a Google Translate specific redirect-to-mobile [puppet] - 10https://gerrit.wikimedia.org/r/485171 (https://phabricator.wikimedia.org/T212197) [11:49:00] (03PS6) 10Dr0ptp4kt: WIP: Add a Google Translate specific redirect-to-mobile [puppet] - 10https://gerrit.wikimedia.org/r/485171 (https://phabricator.wikimedia.org/T212197) [11:51:24] (03PS4) 10Giuseppe Lavagetto: Attempt to pull images before building [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/475843 (https://phabricator.wikimedia.org/T200720) (owner: 10Hashar) [11:52:38] 10Operations, 10serviceops, 10Performance-Team (Radar), 10User-Elukey, 10User-jijiki: Upgrade memcached for Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10jijiki) [11:54:14] moritzm: are you conducting busters tests somewhere? [11:54:53] (03PS7) 10Dr0ptp4kt: WIP: Add a Google Translate specific redirect-to-mobile [puppet] - 10https://gerrit.wikimedia.org/r/485171 (https://phabricator.wikimedia.org/T212197) [11:55:02] I may be interesting in jumping by ssh and taking a look [11:55:05] interested* [11:55:20] I would like to know that, too- I don't have any plan to do any production change soon, but I would like to do some early testing [11:56:28] arturo: yeah, for https://phabricator.wikimedia.org/T213527. currently nothing installed via d-i yet, but initially an upgraded stretch image [12:02:48] RECOVERY - MariaDB Slave Lag: s7 on db2087 is OK: OK slave_sql_lag Replication lag: 46.77 seconds [12:02:50] RECOVERY - MariaDB Slave Lag: s7 on db2054 is OK: OK slave_sql_lag Replication lag: 43.31 seconds [12:02:52] RECOVERY - MariaDB Slave Lag: s7 on db2068 is OK: OK slave_sql_lag Replication lag: 41.91 seconds [12:03:16] RECOVERY - MariaDB Slave Lag: s7 on db2061 is OK: OK slave_sql_lag Replication lag: 12.47 seconds [12:03:24] RECOVERY - MariaDB Slave Lag: s7 on db2095 is OK: OK slave_sql_lag Replication lag: 3.88 seconds [12:03:26] RECOVERY - MariaDB Slave Lag: s7 on db2086 is OK: OK slave_sql_lag Replication lag: 3.17 seconds [12:03:36] RECOVERY - MariaDB Slave Lag: s7 on db2047 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [12:03:46] RECOVERY - MariaDB Slave Lag: s7 on db2077 is OK: OK slave_sql_lag Replication lag: 0.45 seconds [12:03:50] RECOVERY - MariaDB Slave Lag: s7 on db2040 is OK: OK slave_sql_lag Replication lag: 0.35 seconds [12:05:20] (03PS8) 10Dr0ptp4kt: WIP: Add a Google Translate-specific redirect-to-mobile [puppet] - 10https://gerrit.wikimedia.org/r/485171 (https://phabricator.wikimedia.org/T212197) [12:08:33] (03PS9) 10Dr0ptp4kt: WIP: Add a Google Translate-specific redirect-to-mobile [puppet] - 10https://gerrit.wikimedia.org/r/485171 (https://phabricator.wikimedia.org/T212197) [12:09:54] (03PS10) 10Dr0ptp4kt: WIP: Add a Google Translate-specific redirect-to-mobile [puppet] - 10https://gerrit.wikimedia.org/r/485171 (https://phabricator.wikimedia.org/T212197) [12:14:19] 10Operations, 10ExternalGuidance, 10Traffic, 10Patch-For-Review: Deliver mobile-based version for automatic translations - https://phabricator.wikimedia.org/T212197 (10dr0ptp4kt) @BBlack patch posted for your review ^. Would you please review and let me know on patch for any additions? [12:17:05] 10Operations, 10ExternalGuidance, 10Traffic, 10Patch-For-Review: Deliver mobile-based version for automatic translations - https://phabricator.wikimedia.org/T212197 (10dr0ptp4kt) @BBlack hold that thought, one more condition to add. [12:27:05] (03PS11) 10Dr0ptp4kt: WIP: Add a Google Translate-specific redirect-to-mobile [puppet] - 10https://gerrit.wikimedia.org/r/485171 (https://phabricator.wikimedia.org/T212197) [12:27:59] ACKNOWLEDGEMENT - Check systemd state on registry1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Fsero this is a new registry being set up, actually this alert shouldnt have been created yet. systemd is failing due to swit container not available in eqiad [12:28:06] (03PS12) 10Dr0ptp4kt: WIP: Add a Google Translate-specific redirect-to-mobile [puppet] - 10https://gerrit.wikimedia.org/r/485171 (https://phabricator.wikimedia.org/T212197) [12:31:01] 10Operations, 10ExternalGuidance, 10Traffic, 10Patch-For-Review: Deliver mobile-based version for automatic translations - https://phabricator.wikimedia.org/T212197 (10dr0ptp4kt) Okay, @BBlack, //now// it's ready for review. [12:32:43] (03PS1) 10BBlack: esams/eqsin: flip unified to globalsign-2018 [puppet] - 10https://gerrit.wikimedia.org/r/485179 (https://phabricator.wikimedia.org/T209515) [12:33:33] 10Operations, 10ExternalGuidance, 10Traffic, 10Patch-For-Review: Deliver mobile-based version for automatic translations - https://phabricator.wikimedia.org/T212197 (10dr0ptp4kt) Heads up @phuedx . @BBlack and I spoke yesterday and we'll go with a simpler patch instead of the fuller refactor, given the pla... [12:38:46] RECOVERY - MariaDB Slave Lag: s4 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 230.36 seconds [12:43:42] (03CR) 10BBlack: [C: 03+2] "Compiler output looks right https://puppet-compiler.wmflabs.org/compiler1002/14375/" [puppet] - 10https://gerrit.wikimedia.org/r/485179 (https://phabricator.wikimedia.org/T209515) (owner: 10BBlack) [12:50:14] !log uploaded ferm 2.4-1+wmf1 to buster-wikimedia (T213527) [12:50:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:18] T213527: Prepare our base system layer for Debian buster - https://phabricator.wikimedia.org/T213527 [13:00:22] RECOVERY - haproxy failover on dbproxy1004 is OK: OK check_failover servers up 2 down 0 [13:04:02] PROBLEM - haproxy failover on dbproxy1004 is CRITICAL: CRITICAL check_failover servers up 2 down 1 [13:06:34] !log mbsantos@deploy1001 Started deploy [kartotherian/deploy@0d11a2b] (stretch): Updating stretch instance with latest code, maps1003 have wrong dependencies installed [13:06:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:20] !log mbsantos@deploy1001 Finished deploy [kartotherian/deploy@0d11a2b] (stretch): Updating stretch instance with latest code, maps1003 have wrong dependencies installed (duration: 00m 45s) [13:07:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:25] gehel: onimisionipe: ^ [13:07:49] cool! [13:10:05] 10Operations, 10Operations-Software-Development, 10Patch-For-Review: debdeploy: show help message if invoked with no arguments - https://phabricator.wikimedia.org/T207845 (10jbond) 05Open→03Resolved The new package has now been deployed please re-open if there are further issues Thanks John [13:10:44] 10Operations, 10Puppet, 10Packaging: Prepare puppet for Debian buster - https://phabricator.wikimedia.org/T213546 (10jbond) a:03jbond [13:18:42] !log start cassandra-a on restbase1016 - T212418 [13:18:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:44] T212418: Memory error on restbase1016 - https://phabricator.wikimedia.org/T212418 [13:19:57] akosiaris: draft from my notes from yesterday: https://wikitech.wikimedia.org/w/index.php?title=User:Mvolz/Deploying_Zotero [13:20:40] (03CR) 10Hashar: "Thanks for the test addition! :)" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/480533 (https://phabricator.wikimedia.org/T210438) (owner: 10Hashar) [13:25:38] (03CR) 10Filippo Giunchedi: "I don't have a strong opinion on this, however intuitively having both base_dir and log_dir and the latter overrides the former when speci" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/483691 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [13:29:30] 10Operations, 10ops-eqiad, 10RESTBase, 10RESTBase-Cassandra, and 3 others: Memory error on restbase1016 - https://phabricator.wikimedia.org/T212418 (10fgiunchedi) Bootstrap failed ATM, I'll try again with `replace_address` ` ERROR [main] 2019-01-18 13:18:28,813 CassandraDaemon.java:708 - Exception encount... [13:30:12] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Attempt to pull images before building [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/475843 (https://phabricator.wikimedia.org/T200720) (owner: 10Hashar) [13:31:25] (03Merged) 10jenkins-bot: Attempt to pull images before building [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/475843 (https://phabricator.wikimedia.org/T200720) (owner: 10Hashar) [13:31:51] (03CR) 10jenkins-bot: Attempt to pull images before building [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/475843 (https://phabricator.wikimedia.org/T200720) (owner: 10Hashar) [13:36:11] 10Operations, 10Pybal, 10Traffic: inconsistencies between pybal configuration and IPVS status - https://phabricator.wikimedia.org/T214041 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [13:41:16] elukey: dbproxy1004 came back from downtime I guess and it wasn't reloaded since db1107's maintenance yesterday [13:41:20] I will reload it, it is not used anyways [13:41:27] !log reload haproxy on dbproxy1004 [13:41:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:00] RECOVERY - haproxy failover on dbproxy1004 is OK: OK check_failover servers up 2 down 0 [13:49:31] (03PS1) 10Arturo Borrero Gonzalez: labtestneutron2001: reimage to stretch and rename to cloudnet2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/485185 (https://phabricator.wikimedia.org/T212302) [13:50:04] (03CR) 10jerkins-bot: [V: 04-1] labtestneutron2001: reimage to stretch and rename to cloudnet2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/485185 (https://phabricator.wikimedia.org/T212302) (owner: 10Arturo Borrero Gonzalez) [13:51:28] marostegui: thanks! [13:51:31] (03PS2) 10Arturo Borrero Gonzalez: labtestneutron2001: reimage to stretch and rename to cloudnet2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/485185 (https://phabricator.wikimedia.org/T214167) [13:51:33] (03CR) 10Muehlenhoff: [C: 04-1] labtestneutron2001: reimage to stretch and rename to cloudnet2001-dev (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/485185 (https://phabricator.wikimedia.org/T214167) (owner: 10Arturo Borrero Gonzalez) [13:52:03] (03CR) 10jerkins-bot: [V: 04-1] labtestneutron2001: reimage to stretch and rename to cloudnet2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/485185 (https://phabricator.wikimedia.org/T214167) (owner: 10Arturo Borrero Gonzalez) [13:55:04] 10Operations, 10SRE-Access-Requests, 10Developer-Advocacy (Jan-Mar 2019), 10Documentation: Add Srishti to analytics-privatedata-users - https://phabricator.wikimedia.org/T213780 (10CDanis) [13:57:48] PROBLEM - cassandra-b SSL 10.64.0.33:7001 on restbase1016 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [13:58:02] PROBLEM - cassandra-b service on restbase1016 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive [13:58:04] PROBLEM - cassandra-a CQL 10.64.0.32:9042 on restbase1016 is CRITICAL: connect to address 10.64.0.32 and port 9042: Connection refused [13:58:30] PROBLEM - Restbase root url on restbase1016 is CRITICAL: connect to address 10.64.0.31 and port 7231: Connection refused [13:58:32] PROBLEM - cassandra-c SSL 10.64.0.34:7001 on restbase1016 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [13:58:36] PROBLEM - cassandra-c CQL 10.64.0.34:9042 on restbase1016 is CRITICAL: connect to address 10.64.0.34 and port 9042: Connection refused [13:58:36] PROBLEM - cassandra-c service on restbase1016 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive [13:58:38] PROBLEM - cassandra-b CQL 10.64.0.33:9042 on restbase1016 is CRITICAL: connect to address 10.64.0.33 and port 9042: Connection refused [13:58:41] reimage? [13:58:48] yeah [13:58:54] gah, I'll silence [14:01:14] (03PS1) 10CDanis: srishakatux: shell access and analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/485186 (https://phabricator.wikimedia.org/T213780) [14:01:25] (03PS1) 10Arturo Borrero Gonzalez: labtestneutron2001: rename to cloudnet2001-dev [dns] - 10https://gerrit.wikimedia.org/r/485187 (https://phabricator.wikimedia.org/T214167) [14:01:36] (03CR) 10jerkins-bot: [V: 04-1] labtestneutron2001: rename to cloudnet2001-dev [dns] - 10https://gerrit.wikimedia.org/r/485187 (https://phabricator.wikimedia.org/T214167) (owner: 10Arturo Borrero Gonzalez) [14:02:55] (03CR) 10CDanis: [C: 03+2] srishakatux: shell access and analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/485186 (https://phabricator.wikimedia.org/T213780) (owner: 10CDanis) [14:09:17] 10Operations, 10SRE-Access-Requests, 10Developer-Advocacy (Jan-Mar 2019), 10Documentation, 10Patch-For-Review: Add Srishti to analytics-privatedata-users - https://phabricator.wikimedia.org/T213780 (10CDanis) 05Open→03Resolved You should be all set. [14:09:30] 10Operations, 10SRE-Access-Requests, 10Developer-Advocacy (Jan-Mar 2019), 10Documentation, 10Patch-For-Review: Add Srishti to analytics-privatedata-users - https://phabricator.wikimedia.org/T213780 (10CDanis) [14:22:35] 10Operations, 10SRE-Access-Requests: Requesting access to production for dsharpe - https://phabricator.wikimedia.org/T214130 (10CDanis) p:05Triage→03Normal a:03Dsharpe [14:24:04] (03PS1) 10Elukey: hive::metastore::sql: simply mysql commands [puppet/cdh] - 10https://gerrit.wikimedia.org/r/485189 [14:24:30] 10Operations, 10SRE-Access-Requests: Requesting access to production for dsharpe - https://phabricator.wikimedia.org/T214130 (10CDanis) Hi David, Just a couple things for you: - WMF policy requires your manager comment on this ticket giving approval for access - Can you confirm that this is a fresh SSH k... [14:34:51] 10Operations, 10Analytics, 10Research, 10Article-Recommendation, 10User-Marostegui: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10bmansurov) @Marostegui thanks, your last suggestion is captured a {T211980}. [14:35:43] 10Operations, 10Analytics, 10Research, 10Article-Recommendation, 10User-Marostegui: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10bmansurov) [14:44:44] (03PS1) 10Elukey: Assign role::analytics_test_cluster::hadoop::ui to analytics1039 [puppet] - 10https://gerrit.wikimedia.org/r/485190 (https://phabricator.wikimedia.org/T212256) [14:47:17] (03CR) 10Elukey: [C: 03+2] Assign role::analytics_test_cluster::hadoop::ui to analytics1039 [puppet] - 10https://gerrit.wikimedia.org/r/485190 (https://phabricator.wikimedia.org/T212256) (owner: 10Elukey) [15:01:09] (03PS1) 10Muehlenhoff: package_builder: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/485191 [15:01:43] (03PS3) 10GTirloni: labstore::device_backup - Expose systemd OnCalendar syntax [puppet] - 10https://gerrit.wikimedia.org/r/485079 (https://phabricator.wikimedia.org/T209527) [15:02:12] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Replace Torrus with Prometheus snmp_exporter for PDUs monitoring - https://phabricator.wikimedia.org/T148541 (10faidon) @fgiunchendi so could you describe in a bit more detail what is needed here and what were the challenges you faced with... [15:04:26] (03CR) 10GTirloni: [C: 03+2] labstore::device_backup - Expose systemd OnCalendar syntax [puppet] - 10https://gerrit.wikimedia.org/r/485079 (https://phabricator.wikimedia.org/T209527) (owner: 10GTirloni) [15:05:40] (03PS1) 10Mathew.onipe: maps: increase wal_size for postgres 9.6 on stretch [puppet] - 10https://gerrit.wikimedia.org/r/485192 (https://phabricator.wikimedia.org/T198622) [15:08:48] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:10:40] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:11:24] (03PS2) 10Mathew.onipe: maps: increase wal_size for postgres 9.6 on stretch [puppet] - 10https://gerrit.wikimedia.org/r/485192 (https://phabricator.wikimedia.org/T198622) [15:11:44] (03PS1) 10Filippo Giunchedi: prometheus: update prometheus-labs-targets to use keystone/nova clients [puppet] - 10https://gerrit.wikimedia.org/r/485193 (https://phabricator.wikimedia.org/T214058) [15:13:06] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:13:09] (03PS3) 10Mathew.onipe: maps: increase wal_size for postgres 9.6 on stretch [puppet] - 10https://gerrit.wikimedia.org/r/485192 (https://phabricator.wikimedia.org/T198622) [15:13:19] 10Operations, 10SRE-Access-Requests: Requesting access to production for dsharpe - https://phabricator.wikimedia.org/T214130 (10JBennett) Approved [15:13:42] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:16:51] 10Operations, 10SRE-Access-Requests: Requesting access to production for dsharpe - https://phabricator.wikimedia.org/T214130 (10Dsharpe) Yes, the ssh key pair is entirely new, and not used any where else at all. Thank you! [15:20:17] (03CR) 10Gehel: [C: 04-1] maps: increase wal_size for postgres 9.6 on stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/485192 (https://phabricator.wikimedia.org/T198622) (owner: 10Mathew.onipe) [15:20:38] PROBLEM - puppet last run on analytics1039 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 17 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[hue] [15:22:50] PROBLEM - Request latencies on acrab is CRITICAL: instance=10.192.16.26:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:23:25] (03PS4) 10Mathew.onipe: maps: increase wal_size for postgres 9.6 on stretch [puppet] - 10https://gerrit.wikimedia.org/r/485192 (https://phabricator.wikimedia.org/T198622) [15:23:54] RECOVERY - Request latencies on acrab is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:24:59] (03CR) 10Mathew.onipe: maps: increase wal_size for postgres 9.6 on stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/485192 (https://phabricator.wikimedia.org/T198622) (owner: 10Mathew.onipe) [15:25:54] PROBLEM - Hue Server on analytics1039 is CRITICAL: NRPE: Command check_hue not defined [15:26:42] test hosts --^ [15:27:16] PROBLEM - DPKG on analytics1039 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:27:18] !log rebooting elnath to pick up SSBD-enabled qemu [15:27:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:30] PROBLEM - Check systemd state on analytics1039 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:33:25] (03CR) 10Mathew.onipe: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/14377/" [puppet] - 10https://gerrit.wikimedia.org/r/485192 (https://phabricator.wikimedia.org/T198622) (owner: 10Mathew.onipe) [15:33:42] gehel: ^ [15:36:26] RECOVERY - DPKG on analytics1039 is OK: All packages OK [15:36:28] RECOVERY - Check systemd state on analytics1039 is OK: OK - running: The system is fully operational [15:36:37] !log rebooting mwdebug servers in codfw to pick up SSBD-enabled qemu [15:36:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:20] RECOVERY - Hue Server on analytics1039 is OK: PROCS OK: 1 process with command name python2.7, args /usr/lib/hue/build/env/bin/hue [15:37:57] 10Operations, 10SRE-Access-Requests: Requesting access to production for dsharpe - https://phabricator.wikimedia.org/T214130 (10CDanis) [15:39:54] (03PS3) 10Arturo Borrero Gonzalez: labtestneutron2001: reimage to stretch and rename to cloudnet2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/485185 (https://phabricator.wikimedia.org/T214167) [15:41:10] RECOVERY - puppet last run on analytics1039 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:47:12] (03PS7) 10Daimona Eaytoy: Update AbuseFilter config to keep the status quo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475772 [15:48:04] (03CR) 10jerkins-bot: [V: 04-1] Update AbuseFilter config to keep the status quo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475772 (owner: 10Daimona Eaytoy) [15:53:59] (03PS5) 10Gehel: maps: increase wal_size for postgres 9.6 on stretch [puppet] - 10https://gerrit.wikimedia.org/r/485192 (https://phabricator.wikimedia.org/T198622) (owner: 10Mathew.onipe) [15:55:00] (03CR) 10Gehel: [C: 03+2] maps: increase wal_size for postgres 9.6 on stretch [puppet] - 10https://gerrit.wikimedia.org/r/485192 (https://phabricator.wikimedia.org/T198622) (owner: 10Mathew.onipe) [15:55:51] (03PS4) 10Arturo Borrero Gonzalez: labtestneutron2001: reimage to stretch and rename to cloudnet2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/485185 (https://phabricator.wikimedia.org/T214167) [15:56:05] (03CR) 10Muehlenhoff: [C: 03+1] labtestneutron2001: reimage to stretch and rename to cloudnet2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/485185 (https://phabricator.wikimedia.org/T214167) (owner: 10Arturo Borrero Gonzalez) [15:59:08] (03CR) 10Arturo Borrero Gonzalez: "Make sure you include 'profile::openstack::eqiad1::observerenv' somewhere to make the credentials available in the system." [puppet] - 10https://gerrit.wikimedia.org/r/485193 (https://phabricator.wikimedia.org/T214058) (owner: 10Filippo Giunchedi) [15:59:46] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] labtestneutron2001: reimage to stretch and rename to cloudnet2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/485185 (https://phabricator.wikimedia.org/T214167) (owner: 10Arturo Borrero Gonzalez) [16:01:20] (03PS5) 10Daimona Eaytoy: Move all AbuseFilter config to abusefilter.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477063 (https://phabricator.wikimedia.org/T145931) [16:03:11] 10Operations, 10ops-eqiad, 10Analytics, 10DBA, 10Patch-For-Review: swap a2-eqiad PDU with on-site spare - https://phabricator.wikimedia.org/T213748 (10faidon) >>! In T213748#4890195, @RobH wrote: > Synced up with Chris via IRC: > > All systems were able to come back up within a2 without incident. The s... [16:05:08] 10Operations, 10ops-eqiad, 10Analytics: Rack A2's hosts alarm for PSU broken - https://phabricator.wikimedia.org/T212861 (10RobH) [16:05:18] 10Operations, 10ops-eqiad, 10Analytics, 10DBA, 10Patch-For-Review: swap a2-eqiad PDU with on-site spare - https://phabricator.wikimedia.org/T213748 (10RobH) 05Resolved→03Open a:05RobH→03Cmjohnson >>! In T213748#4892549, @faidon wrote: >>>! In T213748#4890195, @RobH wrote: >> Synced up with Chris... [16:06:39] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] "Ignoring jenkins -1 because the duplicated records were added on purpose." [dns] - 10https://gerrit.wikimedia.org/r/485187 (https://phabricator.wikimedia.org/T214167) (owner: 10Arturo Borrero Gonzalez) [16:06:43] is gerrit down? [16:06:46] (03PS1) 10GTirloni: labstore - Allow multiple bdsync jobs per host [puppet] - 10https://gerrit.wikimedia.org/r/485200 (https://phabricator.wikimedia.org/T209527) [16:06:55] nop for me elukey [16:06:57] works for me elukey [16:07:03] ah no weird works in incognito [16:07:16] (03CR) 10jerkins-bot: [V: 04-1] labstore - Allow multiple bdsync jobs per host [puppet] - 10https://gerrit.wikimedia.org/r/485200 (https://phabricator.wikimedia.org/T209527) (owner: 10GTirloni) [16:07:26] it returns me 400 [16:07:27] mmm [16:08:26] nevermind, my browser had weird settings.. thanks fsero,cdanis :) [16:09:33] 10Operations, 10SRE-Access-Requests: Requesting access to production for dsharpe - https://phabricator.wikimedia.org/T214130 (10CDanis) @faidon @mark Could one of you approve this request? We won't have an SRE meeting next week (holiday), and I'm told that we usually don't have one the week of allhands or the... [16:09:57] (03PS2) 10GTirloni: labstore - Allow multiple bdsync jobs per host [puppet] - 10https://gerrit.wikimedia.org/r/485200 (https://phabricator.wikimedia.org/T209527) [16:11:32] 10Operations, 10Discovery-Search, 10Maps: Fix node vs nodejs dependency issue - https://phabricator.wikimedia.org/T214153 (10CDanis) p:05Triage→03Normal [16:12:15] 10Operations, 10ops-eqiad, 10Analytics, 10DBA, 10Patch-For-Review: swap a2-eqiad PDU with on-site spare - https://phabricator.wikimedia.org/T213748 (10Cmjohnson) Correct, it was only the fuse [16:12:35] (03PS1) 10Muehlenhoff: Remove mw.cfg partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/485201 (https://phabricator.wikimedia.org/T156955) [16:12:54] (03CR) 10Ottomata: [C: 03+1] "Seems fine if it works...I think there may have been some reason for this that my current self wishes my past self had left a comment abou" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/485189 (owner: 10Elukey) [16:14:50] !log T214167 reimage+rename labtestneutron2001.codfw.wmnet (jessie) to cloudnet2001-dev.codfw.wmnet (stretch) [16:14:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:53] T214167: labtestneutron2001: reimage to stretch & rename to cloudnet2001-dev - https://phabricator.wikimedia.org/T214167 [16:15:09] (03PS2) 10Muehlenhoff: Remove mw.cfg partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/485201 (https://phabricator.wikimedia.org/T156955) [16:15:13] 10Operations, 10Analytics, 10Research, 10Article-Recommendation, 10User-Marostegui: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10Ottomata) > I don't think writing from Hadoop directly to M2 master is a good idea. But it is not really my call.... [16:17:17] (03PS1) 10Alexandros Kosiaris: Update to 5.0.34 [software/otrs] - 10https://gerrit.wikimedia.org/r/485204 [16:18:53] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Update to 5.0.34 [software/otrs] - 10https://gerrit.wikimedia.org/r/485204 (owner: 10Alexandros Kosiaris) [16:19:45] (03PS5) 10Elukey: systemd::syslog: allow to modify the $local_logdir convention [puppet] - 10https://gerrit.wikimedia.org/r/483691 (https://phabricator.wikimedia.org/T172532) [16:19:46] 10Operations, 10Analytics, 10Research, 10Article-Recommendation, 10User-Marostegui: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10Nuria) >Finding a separate place to run the import scripts is probably the path of least resistance. +1 [16:22:26] 10Operations, 10OTRS, 10Security: Upgrade to OTRS version 5.0.34 - https://phabricator.wikimedia.org/T214177 (10akosiaris) [16:22:35] 10Operations, 10OTRS, 10Security: Upgrade to OTRS version 5.0.34 - https://phabricator.wikimedia.org/T214177 (10akosiaris) 05Open→03Resolved a:03akosiaris Upgrade done. [16:24:41] (03CR) 10Elukey: "Marko/Giuseppe any comment?" [puppet] - 10https://gerrit.wikimedia.org/r/483691 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [16:37:22] (03CR) 10Elukey: "It turns out that changing convention is easier than merging this change, so I'll abandon it :)" [puppet] - 10https://gerrit.wikimedia.org/r/483691 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [16:37:26] (03Abandoned) 10Elukey: systemd::syslog: allow to modify the $local_logdir convention [puppet] - 10https://gerrit.wikimedia.org/r/483691 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [16:37:56] (03CR) 10Dzahn: [C: 03+1] Remove mw.cfg partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/485201 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [16:38:49] RECOVERY - cassandra-a CQL 10.64.0.32:9042 on restbase1016 is OK: TCP OK - 0.000 second response time on 10.64.0.32 port 9042 [16:41:06] (03PS1) 10GTirloni: labstore: Add id_cloudstore [labs/private] - 10https://gerrit.wikimedia.org/r/485207 (https://phabricator.wikimedia.org/T209527) [16:44:12] (03Abandoned) 10Elukey: profile::reportupdater::jobs::hadoop: move jobs to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/483715 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [16:44:16] (03PS1) 10Elukey: profile::reportupdater::jobs::hadoop: move jobs to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/485210 (https://phabricator.wikimedia.org/T172532) [16:44:20] (03Abandoned) 10Elukey: systemd::syslog|timer: add proper handling of ensure [puppet] - 10https://gerrit.wikimedia.org/r/483698 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [16:44:43] (03CR) 10jerkins-bot: [V: 04-1] profile::reportupdater::jobs::hadoop: move jobs to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/485210 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [16:44:47] RECOVERY - Memory correctable errors -EDAC- on db1068 is OK: (C)4 ge (W)2 ge 1 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=db1068&var-datasource=eqiad+prometheus/ops [16:45:45] (03CR) 10GTirloni: [V: 03+2 C: 03+2] labstore: Add id_cloudstore [labs/private] - 10https://gerrit.wikimedia.org/r/485207 (https://phabricator.wikimedia.org/T209527) (owner: 10GTirloni) [16:46:02] (03PS1) 10BryanDavis: toolforge: allow relay for hosts in 172.16.0.0/21 [puppet] - 10https://gerrit.wikimedia.org/r/485211 (https://phabricator.wikimedia.org/T214131) [16:51:04] (03PS2) 10Elukey: profile::reportupdater::jobs::hadoop: move jobs to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/485210 (https://phabricator.wikimedia.org/T172532) [16:53:08] (03CR) 10GTirloni: [C: 03+1] "Need to be changed here as well for the new Toolforge: modules/profile/templates/toolforge/mail-relay.exim4.conf.erb" [puppet] - 10https://gerrit.wikimedia.org/r/485211 (https://phabricator.wikimedia.org/T214131) (owner: 10BryanDavis) [16:56:13] (03PS3) 10Elukey: profile::reportupdater::jobs::hadoop: move jobs to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/485210 (https://phabricator.wikimedia.org/T172532) [16:56:20] (03PS2) 10BryanDavis: toolforge: allow relay for hosts in 172.16.0.0/21 [puppet] - 10https://gerrit.wikimedia.org/r/485211 (https://phabricator.wikimedia.org/T214131) [16:57:51] (03PS1) 10Sbisson: Revert "Revert "Enable the Welcome survey on viwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485212 [16:58:45] (03PS2) 10Sbisson: Revert "Revert "Enable the Welcome survey on viwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485212 [16:59:31] (03CR) 10GTirloni: [C: 03+1] toolforge: allow relay for hosts in 172.16.0.0/21 [puppet] - 10https://gerrit.wikimedia.org/r/485211 (https://phabricator.wikimedia.org/T214131) (owner: 10BryanDavis) [16:59:34] (03CR) 10Elukey: "Dan: you can see the final layout in https://puppet-compiler.wmflabs.org/compiler1002/14381/stat1007.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/485210 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [16:59:40] (03PS3) 10Sbisson: labs: Remove $wgGEHelpPanelSearchDevMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485097 (https://phabricator.wikimedia.org/T214083) (owner: 10Catrope) [17:00:11] (03CR) 10Sbisson: labs: Remove $wgGEHelpPanelSearchDevMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485097 (https://phabricator.wikimedia.org/T214083) (owner: 10Catrope) [17:00:19] (03CR) 10Sbisson: [C: 03+2] labs: Remove $wgGEHelpPanelSearchDevMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485097 (https://phabricator.wikimedia.org/T214083) (owner: 10Catrope) [17:02:36] (03Merged) 10jenkins-bot: labs: Remove $wgGEHelpPanelSearchDevMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485097 (https://phabricator.wikimedia.org/T214083) (owner: 10Catrope) [17:02:42] (03PS5) 10Dzahn: webperf: add data types, split statsd host/port params [puppet] - 10https://gerrit.wikimedia.org/r/485106 [17:06:05] elukey: I will reload dbproxy1009 too, as had the same issue as dbproxy1004 [17:06:18] !log Reload haproxy on dbproxy1009 after rack a2 maintenance [17:06:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:42] ack! [17:07:06] (03PS6) 10Dzahn: webperf: add data types, split statsd host/port params [puppet] - 10https://gerrit.wikimedia.org/r/485106 [17:07:21] RECOVERY - haproxy failover on dbproxy1009 is OK: OK check_failover servers up 2 down 0 [17:08:35] 10Operations, 10ops-codfw, 10DC-Ops: codfw: tename/relabel labtestneutron2001 to cloudnet2001-dev - https://phabricator.wikimedia.org/T214181 (10aborrero) p:05Triage→03Normal [17:09:36] (03CR) 10Dzahn: [C: 03+1] "works now https://puppet-compiler.wmflabs.org/compiler1002/14384/" [puppet] - 10https://gerrit.wikimedia.org/r/485106 (owner: 10Dzahn) [17:15:23] (03CR) 10jenkins-bot: labs: Remove $wgGEHelpPanelSearchDevMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485097 (https://phabricator.wikimedia.org/T214083) (owner: 10Catrope) [17:16:40] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Replace Torrus with Prometheus snmp_exporter for PDUs monitoring - https://phabricator.wikimedia.org/T148541 (10fgiunchedi) >>! In T148541#4892455, @faidon wrote: > @fgiunchedi so could you describe in a bit more detail what is needed here... [17:19:16] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Replace Torrus with Prometheus snmp_exporter for PDUs monitoring - https://phabricator.wikimedia.org/T148541 (10fgiunchedi) [17:19:58] PROBLEM - Disk space on cloudnet2001-dev is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.20.4: Connection reset by peer [17:23:08] RECOVERY - cassandra-b service on restbase1016 is OK: OK - cassandra-b is active [17:23:40] 10Operations: uwsgi's logsocket_plugin.so causes segfaults during log rotation - https://phabricator.wikimedia.org/T212697 (10crusnov) I am looking at this a bit, and I find it curious that logsocket plugin specifically is where the crash is happening since that is what is logging to logstash presumably (and not... [17:24:16] !log bootstrap cassandra-b on restbase1016 - T212418 [17:24:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:19] T212418: Memory error on restbase1016 - https://phabricator.wikimedia.org/T212418 [17:27:20] PROBLEM - MD RAID on cloudnet2001-dev is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.20.4: Connection reset by peer [17:28:22] RECOVERY - Disk space on cloudnet2001-dev is OK: DISK OK [17:28:26] RECOVERY - MD RAID on cloudnet2001-dev is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [17:30:08] (03PS4) 10Giuseppe Lavagetto: Log docker build output [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/475779 (owner: 10Hashar) [17:30:10] (03PS1) 10Giuseppe Lavagetto: Fix the logic of the FSM to account for the fact we allow pulling [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/485219 [17:30:38] PROBLEM - MariaDB Slave Lag: s2 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.01 seconds [17:30:52] PROBLEM - MariaDB Slave Lag: s2 on db2049 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.21 seconds [17:31:02] PROBLEM - MariaDB Slave Lag: s2 on db2063 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.37 seconds [17:31:04] PROBLEM - MariaDB Slave Lag: s2 on db2056 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.47 seconds [17:31:32] PROBLEM - MariaDB Slave Lag: s2 on db2041 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.85 seconds [17:31:36] (03CR) 10jerkins-bot: [V: 04-1] Log docker build output [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/475779 (owner: 10Hashar) [17:31:42] PROBLEM - MariaDB Slave Lag: s2 on db2035 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.61 seconds [17:31:44] PROBLEM - MariaDB Slave Lag: s2 on db2088 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.28 seconds [17:32:30] PROBLEM - MariaDB Slave Lag: s2 on db2091 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.21 seconds [17:35:03] 10Operations, 10ops-codfw, 10DC-Ops: codfw: rename/relabel labtestneutron2001 to cloudnet2001-dev - https://phabricator.wikimedia.org/T214181 (10aborrero) [17:38:12] RECOVERY - cassandra-b SSL 10.64.0.33:7001 on restbase1016 is OK: SSL OK - Certificate restbase1016-b valid until 2020-06-24 13:01:15 +0000 (expires in 522 days) [17:38:30] (03PS1) 10Bstorm: toolforge: upgrade the stretch grid to PHP 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/485221 (https://phabricator.wikimedia.org/T213666) [17:41:00] (03PS1) 10MSantos: Restore cpu ratio for maps1004 [puppet] - 10https://gerrit.wikimedia.org/r/485222 [17:41:01] does anyone have a recommendation for a Firefox extension that simply takes a text file with URLs and then downloads them all.. in exaactly the same way that would happen if i manually clicked "save page" [17:41:21] all the mass downloader extensions i tried failed in one way or another or were overly complex [17:41:42] (03PS2) 10Bstorm: toolforge: upgrade the stretch grid to PHP 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/485221 (https://phabricator.wikimedia.org/T213666) [17:41:44] no but you could stick them in a text file and stick them through curl? [17:41:59] like wget seems like the obvious chocie but that ofc doesn't do it like Firefox save page [17:42:10] yeah [17:42:57] i already tried curl and wget ..including setting cookie data , sending POST data to get past a login etc [17:43:08] mutante, might be some way with headless chromium or something [17:43:14] oh jeeze complex scenario [17:43:14] and that still doesnt mean i get the nice HTML with the images and fixed CSS path etc [17:43:34] if i simply login and click save though.. i get what i want [17:43:56] and if i use the REST api i have to get ticket and ticket history in separate calls and combine it..etc [17:44:17] maybe if the pages are alike you can save page to get the basic template, then fill it out via script [17:44:19] this points to a major tooling gap [17:44:41] bit of effort but it may produce a decent result depending on what you're handling [17:44:43] the easist would be to automate my mouse clicks :p [17:44:49] heh true [17:44:54] there are tools for that iirc [17:45:15] selenium :) [17:45:24] what i want to do is the equivalent of static-Bugzilla [17:45:34] selenium! true [17:45:54] my goal is to kill RT :p [17:45:55] idk what kind of click automation you get but that's like essentially the scriptable headless chrome isn't it? [17:46:02] a laudable goal [17:46:15] i'll try that, thanks [17:46:19] (03PS2) 10MSantos: Restore cpu ratio for maps1004 [puppet] - 10https://gerrit.wikimedia.org/r/485222 [17:47:23] mutante, could loose the RT software and generate your own output from the DB? :p [17:47:51] haha that was exactly what i was thinking [17:48:16] yea, i thought about that option too. but it would be like using the REST api.. you have to get page content.. then history ..comments and build the full page from it. certainly possible but more work [17:48:54] and you dont have the style [17:49:20] I'd be very careful running random firefox extensions with this sort of data [17:51:03] 10Operations, 10Cloud-Services, 10Wikibase-Quality, 10Wikibase-Quality-Constraints, and 4 others: Flood of WDQS requests from wbqc - https://phabricator.wikimedia.org/T204267 (10Smalyshev) 05Open→03Resolved a:03Smalyshev [17:51:24] still, you might look into something like https://addons.mozilla.org/en-US/firefox/addon/simple-mass-downloader/ - I haven't looked into the source of this and can't vouch for it or anything [17:57:20] it may just get the HTML and not the full firefox save functionality [17:57:40] that's one i tried, it could not get past the login [18:12:44] 10Operations, 10monitoring, 10Goal, 10cloud-services-team (Kanban): Toolforge: Port sge.py stats to Prometheus - https://phabricator.wikimedia.org/T211684 (10GTirloni) a:05GTirloni→03None [18:13:57] 10Operations, 10monitoring, 10Goal, 10cloud-services-team (Kanban): Toolforge: Port sge.py stats to Prometheus - https://phabricator.wikimedia.org/T211684 (10GTirloni) https://github.com/prometheus/node_exporter#textfile-collector [18:16:37] (03PS1) 10BryanDavis: toolforge: move SGE diamond collector to cronrunner on Trusty grid [puppet] - 10https://gerrit.wikimedia.org/r/485229 (https://phabricator.wikimedia.org/T214182) [18:26:13] (03CR) 10GTirloni: [C: 03+1] "Seems okay to me. Diamond is running on tools-sgecron so it shouldn't require any other changes." [puppet] - 10https://gerrit.wikimedia.org/r/485229 (https://phabricator.wikimedia.org/T214182) (owner: 10BryanDavis) [18:26:44] (03CR) 10Bstorm: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/485229 (https://phabricator.wikimedia.org/T214182) (owner: 10BryanDavis) [18:27:02] (03CR) 10Bstorm: [C: 03+2] toolforge: move SGE diamond collector to cronrunner on Trusty grid [puppet] - 10https://gerrit.wikimedia.org/r/485229 (https://phabricator.wikimedia.org/T214182) (owner: 10BryanDavis) [18:29:28] (03PS3) 10ArielGlenn: do multistream dumps in parallel and recombine for big wikis [dumps] - 10https://gerrit.wikimedia.org/r/484754 (https://phabricator.wikimedia.org/T213912) [18:29:51] (03CR) 10jerkins-bot: [V: 04-1] do multistream dumps in parallel and recombine for big wikis [dumps] - 10https://gerrit.wikimedia.org/r/484754 (https://phabricator.wikimedia.org/T213912) (owner: 10ArielGlenn) [18:32:05] (03PS3) 10Bstorm: toolforge: upgrade the stretch grid to PHP 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/485221 (https://phabricator.wikimedia.org/T213666) [18:32:52] (03CR) 10jerkins-bot: [V: 04-1] toolforge: upgrade the stretch grid to PHP 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/485221 (https://phabricator.wikimedia.org/T213666) (owner: 10Bstorm) [18:34:05] (03PS4) 10Bstorm: toolforge: upgrade the stretch grid to PHP 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/485221 (https://phabricator.wikimedia.org/T213666) [18:38:15] 10Operations, 10DBA, 10Jade, 10TechCom-RFC, and 3 others: Introduce a new namespace for collaborative judgements about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) >>! In T200297#4887041, @Krinkle wrote: > 3. The feature proposes to store arbitrary text (specifically, wikitext) insi... [18:39:52] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [18:40:28] bstorm_: ^ [18:40:38] Is there? [18:40:53] I hadn't typed "yes" [18:40:57] :) [18:40:58] oops [18:41:00] :) [18:41:04] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [18:50:54] (03CR) 10Bstorm: [C: 03+2] toolforge: upgrade the stretch grid to PHP 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/485221 (https://phabricator.wikimedia.org/T213666) (owner: 10Bstorm) [18:50:55] 10Operations, 10SRE-Access-Requests, 10Developer-Advocacy (Jan-Mar 2019), 10Documentation, 10Patch-For-Review: Add Srishti to analytics-privatedata-users - https://phabricator.wikimedia.org/T213780 (10srishakatux) Thanks a lot @CDanis! [18:51:04] (03PS5) 10Bstorm: toolforge: upgrade the stretch grid to PHP 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/485221 (https://phabricator.wikimedia.org/T213666) [19:04:37] 10Operations, 10Discovery-Search, 10Maps, 10Reading-Infrastructure-Team-Backlog: Fix node vs nodejs dependency issue - https://phabricator.wikimedia.org/T214153 (10MSantos) [19:10:34] 10Operations, 10Elasticsearch, 10Discovery-Search (Current work): Broken elasticsearch-prometheus-exporter service on logstash nodes after reboot - https://phabricator.wikimedia.org/T210597 (10debt) 05Open→03Resolved a:03debt [19:20:01] 10Operations, 10Elasticsearch, 10Maps, 10Discovery-Search (Current work): Review Elastic/maps Grafana dashboards - https://phabricator.wikimedia.org/T209812 (10debt) [19:20:05] 10Operations, 10Elasticsearch, 10Discovery-Search (Current work), 10Patch-For-Review: Fix prometheus elasticsearch exporter to show all the metrics - https://phabricator.wikimedia.org/T210592 (10debt) 05Open→03Resolved [19:39:00] (03PS4) 10ArielGlenn: do multistream dumps in parallel and recombine for big wikis [dumps] - 10https://gerrit.wikimedia.org/r/484754 (https://phabricator.wikimedia.org/T213912) [19:40:53] (03PS1) 10ArielGlenn: fix up iohandlers to write separate streams for header and footer again [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/485240 [19:57:49] 10Operations, 10MediaWiki-General-or-Unknown, 10media-storage: Lost file Juan_Guaidó.jpg - https://phabricator.wikimedia.org/T213655 (10User100100) Recover in the normal way would be nice, because otherwise people are wondering where the original file disappeared. [20:00:38] 10Operations: uwsgi's logsocket_plugin.so causes segfaults during log rotation - https://phabricator.wikimedia.org/T212697 (10crusnov) Side note, I have not been able to reproduce this in our test instance of netmon, even though it happens regularly in production. Also the position in the library is the same in... [20:04:42] (03CR) 10Ayounsi: [C: 03+1] "I don't see anything wrong, but someone else should review." [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/485142 (https://phabricator.wikimedia.org/T212524) (owner: 10CRusnov) [20:15:48] RECOVERY - cassandra-b CQL 10.64.0.33:9042 on restbase1016 is OK: TCP OK - 0.000 second response time on 10.64.0.33 port 9042 [20:17:07] Can anyone provide me with a link to the ticket about removing usage of Silverpop's mkt4477.com/pages04.net domains? [20:19:26] I did find https://phabricator.wikimedia.org/T127401 but it's not just that [20:39:36] PROBLEM - HHVM rendering on mw1264 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:40:42] RECOVERY - HHVM rendering on mw1264 is OK: HTTP OK: HTTP/1.1 200 OK - 78960 bytes in 0.138 second response time [20:43:58] RECOVERY - MariaDB Slave Lag: s2 on db2091 is OK: OK slave_sql_lag Replication lag: 57.55 seconds [20:44:18] RECOVERY - MariaDB Slave Lag: s2 on db2063 is OK: OK slave_sql_lag Replication lag: 53.93 seconds [20:44:20] RECOVERY - MariaDB Slave Lag: s2 on db2041 is OK: OK slave_sql_lag Replication lag: 54.36 seconds [20:44:26] RECOVERY - MariaDB Slave Lag: s2 on db2056 is OK: OK slave_sql_lag Replication lag: 53.07 seconds [20:44:32] RECOVERY - MariaDB Slave Lag: s2 on db2088 is OK: OK slave_sql_lag Replication lag: 51.26 seconds [20:44:34] RECOVERY - MariaDB Slave Lag: s2 on db2095 is OK: OK slave_sql_lag Replication lag: 51.58 seconds [20:44:34] RECOVERY - MariaDB Slave Lag: s2 on db2049 is OK: OK slave_sql_lag Replication lag: 51.18 seconds [20:44:36] RECOVERY - MariaDB Slave Lag: s2 on db2035 is OK: OK slave_sql_lag Replication lag: 51.29 seconds [20:47:22] RECOVERY - cassandra-c SSL 10.64.0.34:7001 on restbase1016 is OK: SSL OK - Certificate restbase1016-c valid until 2020-06-24 13:01:16 +0000 (expires in 522 days) [20:47:24] RECOVERY - cassandra-c service on restbase1016 is OK: OK - cassandra-c is active [20:47:26] !log restbase/cassandra bootstrap restbase1016-c [20:47:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:34] !log restbase/cassandra bootstrap restbase1016-c - T212418 [20:47:35] (03CR) 10Bstorm: [C: 03+2] toolforge: allow relay for hosts in 172.16.0.0/21 [puppet] - 10https://gerrit.wikimedia.org/r/485211 (https://phabricator.wikimedia.org/T214131) (owner: 10BryanDavis) [20:47:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:37] T212418: Memory error on restbase1016 - https://phabricator.wikimedia.org/T212418 [20:47:45] (03PS3) 10Bstorm: toolforge: allow relay for hosts in 172.16.0.0/21 [puppet] - 10https://gerrit.wikimedia.org/r/485211 (https://phabricator.wikimedia.org/T214131) (owner: 10BryanDavis) [21:12:48] (03CR) 10Dzahn: [C: 04-2] "too many other things looking up "statsd" in Hiera common.yaml. can't change that" [puppet] - 10https://gerrit.wikimedia.org/r/485106 (owner: 10Dzahn) [21:19:14] (03PS7) 10Dzahn: webperf: add data types, split statsd host/port params [puppet] - 10https://gerrit.wikimedia.org/r/485106 [21:20:22] (03CR) 10jerkins-bot: [V: 04-1] webperf: add data types, split statsd host/port params [puppet] - 10https://gerrit.wikimedia.org/r/485106 (owner: 10Dzahn) [21:22:00] (03PS8) 10Dzahn: webperf: add data types, split statsd host/port params [puppet] - 10https://gerrit.wikimedia.org/r/485106 [21:23:36] (03PS9) 10Dzahn: webperf: add data types, split statsd host/port params [puppet] - 10https://gerrit.wikimedia.org/r/485106 [21:27:51] (03CR) 10Dzahn: [C: 04-1] "this version would not touch common.yaml and be limited to the webperf module itself but if we keep host and port in one string then port " [puppet] - 10https://gerrit.wikimedia.org/r/485106 (owner: 10Dzahn) [21:30:38] (03PS10) 10Dzahn: webperf: add data types, split statsd host/port params [puppet] - 10https://gerrit.wikimedia.org/r/485106 [21:33:23] (03CR) 10Dzahn: "using "$statsd_port = 0 + $statsd_parts[1]" to turn string into integer , looking better: https://puppet-compiler.wmflabs.org/compiler100" [puppet] - 10https://gerrit.wikimedia.org/r/485106 (owner: 10Dzahn) [21:34:34] PROBLEM - puppet last run on mw1312 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:37:55] (03PS2) 10Dzahn: releases: add data types to parameters, move vars to params [puppet] - 10https://gerrit.wikimedia.org/r/485100 [21:45:25] (03CR) 10Dzahn: [C: 04-1] "more of this, strings that should not be strings. parameter 'http_port' expects a Stdlib::Port = Integer[0, 65535] value, got String" [puppet] - 10https://gerrit.wikimedia.org/r/485100 (owner: 10Dzahn) [21:47:57] (03CR) 10CRusnov: [C: 03+1] "Looks good!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/485066 (owner: 10Volans) [21:49:16] (03CR) 10CRusnov: "> Patch Set 3: Code-Review+1" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/485142 (https://phabricator.wikimedia.org/T212524) (owner: 10CRusnov) [21:56:23] (03PS3) 10Dzahn: releases: add data types to parameters, move vars to params [puppet] - 10https://gerrit.wikimedia.org/r/485100 [21:59:51] The noc@ address is not handled in OTRS, is it mutante ? [22:00:28] Krenair: no, it's an exim alias directly on mx [22:00:44] RECOVERY - puppet last run on mw1312 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:00:48] puppetized in private repo [22:01:47] it basically equals all people with root [22:05:46] (03PS4) 10Dzahn: releases: add data types to parameters, move vars to params [puppet] - 10https://gerrit.wikimedia.org/r/485100 [22:05:54] (03CR) 10Dzahn: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/14389/" [puppet] - 10https://gerrit.wikimedia.org/r/485100 (owner: 10Dzahn) [22:06:31] (03CR) 10Dzahn: [C: 03+2] releases: add data types to parameters, move vars to params [puppet] - 10https://gerrit.wikimedia.org/r/485100 (owner: 10Dzahn) [22:08:58] mutante, yeah I thought so... yet OTRS thinks its it's own address [22:09:13] will ask an admin to fix it [22:09:17] (03CR) 10Dzahn: [C: 03+2] "noop on releases1001/2001, also remove outdated lint-ignore" [puppet] - 10https://gerrit.wikimedia.org/r/485100 (owner: 10Dzahn) [22:09:20] this prevented me forwarding an email from info-en to noc [22:12:44] Krenair: ack, it should be fixed then. thanks! it's possible it was a separate one in the very distant past (10 years?) [22:13:53] mutante, yes [22:16:36] (03PS2) 10Dzahn: package_builder: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/485099 [22:22:15] (03CR) 10Dzahn: [C: 03+2] Remove mw.cfg partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/485201 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [22:22:27] (03PS3) 10Dzahn: Remove mw.cfg partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/485201 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [22:26:30] PROBLEM - HHVM jobrunner on mw1334 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [22:27:44] RECOVERY - HHVM jobrunner on mw1334 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.007 second response time [22:29:46] (03PS2) 10Dzahn: jenkins: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/485094 [22:30:20] (03CR) 10jerkins-bot: [V: 04-1] jenkins: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/485094 (owner: 10Dzahn) [22:36:49] (03CR) 10Dzahn: "parameter 'service_ensure' expects a match for Stdlib::Ensure::Service = Enum['running', 'stopped'], got 'unmanaged'" [puppet] - 10https://gerrit.wikimedia.org/r/485094 (owner: 10Dzahn) [22:38:36] PROBLEM - Disk space on contint1001 is CRITICAL: DISK CRITICAL - free space: / 2510 MB (5% inode=62%) [23:00:40] RECOVERY - Disk space on contint1001 is OK: DISK OK [23:04:09] (03PS1) 10Bstorm: toolforge: remove phpunit from the stretch grid [puppet] - 10https://gerrit.wikimedia.org/r/485338 (https://phabricator.wikimedia.org/T213666) [23:05:20] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Backlog): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10Dzahn) This happened today. Was about to make a ticket for it and found this. 17:38 < icinga-wm> PROBLEM -... [23:09:47] (03CR) 10Bstorm: [C: 03+2] toolforge: remove phpunit from the stretch grid [puppet] - 10https://gerrit.wikimedia.org/r/485338 (https://phabricator.wikimedia.org/T213666) (owner: 10Bstorm) [23:17:20] (03PS3) 10Dzahn: jenkins: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/485094 [23:35:30] RECOVERY - cassandra-c CQL 10.64.0.34:9042 on restbase1016 is OK: TCP OK - 0.000 second response time on 10.64.0.34 port 9042 [23:42:49] (03CR) 10Dzahn: "parameter 'http_port' expects a Stdlib::Port = Integer[0, 65535] value, got String" [puppet] - 10https://gerrit.wikimedia.org/r/485094 (owner: 10Dzahn) [23:44:32] (03PS4) 10Dzahn: jenkins: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/485094 [23:53:14] !log mobrovac@deploy1001 Started deploy [restbase/deploy@f24d681]: Deploy latest version to restbase1016 (was out of rotation) - T212418 [23:53:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:53:18] T212418: Memory error on restbase1016 - https://phabricator.wikimedia.org/T212418 [23:53:48] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@f24d681]: Deploy latest version to restbase1016 (was out of rotation) - T212418 (duration: 00m 34s) [23:53:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:54:39] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1002/14393/contint1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/485094 (owner: 10Dzahn) [23:54:47] !log mobrovac@deploy1001 Started deploy [restbase/deploy@f24d681]: Deploy latest version to restbase1016 (was out of rotation), take #2 [23:54:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:55:05] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@f24d681]: Deploy latest version to restbase1016 (was out of rotation), take #2 (duration: 00m 18s) [23:55:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:56:22] RECOVERY - Restbase root url on restbase1016 is OK: HTTP OK: HTTP/1.1 200 - 16234 bytes in 0.035 second response time [23:56:56] !log mobrovac@deploy1001 Started deploy [restbase/deploy@f24d681]: Deploy latest version to restbase1016 (was out of rotation), take #3 [23:56:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:56] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@f24d681]: Deploy latest version to restbase1016 (was out of rotation), take #3 (duration: 01m 01s) [23:57:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:58:38] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1002/14394/contint1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/485096 (owner: 10Dzahn)