[00:23:27] PROBLEM - MariaDB Slave SQL: s8 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1030, Errmsg: Error Got error 175 File too short: Expected more data in file from storage engine Aria on query. Default database: wikidatawiki. [Query snipped] [00:23:49] PROBLEM - MariaDB Slave SQL: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Incorrect key file for table linter: try to repair it on query. Default database: commonswiki. [Query snipped] [00:23:59] PROBLEM - MariaDB Slave SQL: s6 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Incorrect key file for table linter: try to repair it on query. Default database: frwiki. [Query snipped] [00:24:11] PROBLEM - MariaDB Slave SQL: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Incorrect key file for table linter: try to repair it on query. Default database: mediawikiwiki. [Query snipped] [00:24:33] PROBLEM - MariaDB Slave SQL: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Incorrect key file for table linter: try to repair it on query. Default database: eswiki. [Query snipped] [00:24:33] PROBLEM - MariaDB Slave SQL: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Incorrect key file for table ores_classification: try to repair it on query. Default database: enwiki. [Query snipped] [00:24:33] PROBLEM - MariaDB Slave SQL: s2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Incorrect key file for table linter: try to repair it on query. Default database: zhwiki. [Query snipped] [00:28:12] Hopefully this ^ paged someone.. [00:28:25] onimisionipe: Well, it pinged me fwiw :P [00:30:35] PROBLEM - MariaDB Slave SQL: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Incorrect key file for table linter: try to repair it on query. Default database: dewiki. [Query snipped] [00:32:59] it may have paged the dba [00:34:49] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1002.14 seconds [00:36:59] PROBLEM - MariaDB Slave Lag: s8 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 965.01 seconds [00:37:03] PROBLEM - MariaDB Slave Lag: s6 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 968.29 seconds [00:37:09] PROBLEM - MariaDB Slave Lag: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 974.22 seconds [00:37:15] PROBLEM - MariaDB Slave Lag: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 961.09 seconds [00:37:25] Any ops around for ^^? [00:37:29] PROBLEM - MariaDB Slave Lag: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 993.77 seconds [00:37:37] PROBLEM - MariaDB Slave Lag: s2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 994.82 seconds [00:43:39] PROBLEM - MariaDB Slave Lag: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 986.95 seconds [01:07:53] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [01:10:13] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [01:10:51] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [01:13:19] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [01:13:19] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [01:14:05] (03PS3) 10AndyRussG: Give protect right to centralnoticeadmin on Meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483044 (https://phabricator.wikimedia.org/T209873) [01:15:11] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [01:16:17] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [01:16:53] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [01:31:11] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [01:33:35] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [01:35:40] (03CR) 10Kosta Harlan: [C: 03+1] [WIP] Change links of wgGEHelpPanelLinks for kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483996 (https://phabricator.wikimedia.org/T209467) (owner: 10Revi) [06:29:25] PROBLEM - puppet last run on ms-be1030 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/smartmontools/run.d/20logger] [06:29:31] PROBLEM - puppet last run on labstore1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/rsyslog.d/10-puppet-agent.conf] [06:32:55] (03PS1) 10Marostegui: Offboarding Balazs [puppet] - 10https://gerrit.wikimedia.org/r/484155 [06:36:51] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "AFAIK we need first to define "ensure: absent" on the user." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/484155 (owner: 10Marostegui) [06:39:30] (03PS2) 10Marostegui: Offboarding Balazs [puppet] - 10https://gerrit.wikimedia.org/r/484155 [06:40:05] (03CR) 10jerkins-bot: [V: 04-1] Offboarding Balazs [puppet] - 10https://gerrit.wikimedia.org/r/484155 (owner: 10Marostegui) [06:41:38] gah [06:42:37] (03PS3) 10Marostegui: Offboarding Balazs [puppet] - 10https://gerrit.wikimedia.org/r/484155 [06:43:08] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Offboarding Balazs [puppet] - 10https://gerrit.wikimedia.org/r/484155 (owner: 10Marostegui) [06:44:14] (03CR) 10Marostegui: [C: 03+2] Offboarding Balazs [puppet] - 10https://gerrit.wikimedia.org/r/484155 (owner: 10Marostegui) [06:46:44] ACKNOWLEDGEMENT - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 23228.36 seconds Marostegui T213670 - The acknowledgement expires at: 2019-01-16 06:46:25. [06:46:44] ACKNOWLEDGEMENT - MariaDB Slave Lag: s2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 23074.78 seconds Marostegui T213670 - The acknowledgement expires at: 2019-01-16 06:46:25. [06:46:44] ACKNOWLEDGEMENT - MariaDB Slave Lag: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 23041.36 seconds Marostegui T213670 - The acknowledgement expires at: 2019-01-16 06:46:25. [06:46:44] ACKNOWLEDGEMENT - MariaDB Slave Lag: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 23146.81 seconds Marostegui T213670 - The acknowledgement expires at: 2019-01-16 06:46:25. [06:46:44] ACKNOWLEDGEMENT - MariaDB Slave Lag: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 22706.60 seconds Marostegui T213670 - The acknowledgement expires at: 2019-01-16 06:46:25. [06:46:44] ACKNOWLEDGEMENT - MariaDB Slave Lag: s6 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 23050.46 seconds Marostegui T213670 - The acknowledgement expires at: 2019-01-16 06:46:25. [06:46:44] ACKNOWLEDGEMENT - MariaDB Slave Lag: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 23056.95 seconds Marostegui T213670 - The acknowledgement expires at: 2019-01-16 06:46:25. [06:46:45] ACKNOWLEDGEMENT - MariaDB Slave Lag: s8 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 23120.86 seconds Marostegui T213670 - The acknowledgement expires at: 2019-01-16 06:46:25. [06:46:45] ACKNOWLEDGEMENT - MariaDB Slave SQL: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Incorrect key file for table ores_classification: try to repair it on query. Default database: enwiki. [Query snipped] Marostegui T213670 - The acknowledgement expires at: 2019-01-16 06:46:25. [06:46:46] ACKNOWLEDGEMENT - MariaDB Slave SQL: s2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Incorrect key file for table linter: try to repair it on query. Default database: zhwiki. [Query snipped] Marostegui T213670 - The acknowledgement expires at: 2019-01-16 06:46:25. [06:46:46] ACKNOWLEDGEMENT - MariaDB Slave SQL: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Incorrect key file for table linter: try to repair it on query. Default database: mediawikiwiki. [Query snipped] Marostegui T213670 - The acknowledgement expires at: 2019-01-16 06:46:25. [06:46:47] ACKNOWLEDGEMENT - MariaDB Slave SQL: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Incorrect key file for table linter: try to repair it on query. Default database: commonswiki. [Query snipped] Marostegui T213670 - The acknowledgement expires at: 2019-01-16 06:46:25. [06:46:47] ACKNOWLEDGEMENT - MariaDB Slave SQL: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Incorrect key file for table linter: try to repair it on query. Default database: dewiki. [Query snipped] Marostegui T213670 - The acknowledgement expires at: 2019-01-16 06:46:25. [06:46:48] ACKNOWLEDGEMENT - MariaDB Slave SQL: s6 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Incorrect key file for table linter: try to repair it on query. Default database: frwiki. [Query snipped] Marostegui T213670 - The acknowledgement expires at: 2019-01-16 06:46:25. [06:46:48] ACKNOWLEDGEMENT - MariaDB Slave SQL: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Incorrect key file for table linter: try to repair it on query. Default database: eswiki. [Query snipped] Marostegui T213670 - The acknowledgement expires at: 2019-01-16 06:46:25. [06:46:49] ACKNOWLEDGEMENT - MariaDB Slave SQL: s8 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1030, Errmsg: Error Got error 175 File too short: Expected more data in file from storage engine Aria on query. Default database: wikidatawiki. [Query snipped] Marostegui T213670 - The acknowledgement expires at: 2019-01-16 06:46:25. [06:52:02] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received [06:52:56] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [06:54:10] (03PS1) 10Marostegui: data.yaml: Remove ssh and email from Balazs [puppet] - 10https://gerrit.wikimedia.org/r/484156 [06:55:02] PROBLEM - puppet last run on dbstore1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): User[banyek] [06:55:49] ^ this gets fixed with a second puppet run [06:55:58] PROBLEM - puppet last run on db1078 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): User[banyek] [06:56:54] PROBLEM - puppet last run on db1091 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): User[banyek] [07:00:06] RECOVERY - puppet last run on ms-be1030 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:00:08] RECOVERY - puppet last run on dbstore1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:00:12] RECOVERY - puppet last run on labstore1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:01:26] PROBLEM - puppet last run on db1094 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): User[banyek] [07:06:14] RECOVERY - puppet last run on db1078 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:06:34] RECOVERY - puppet last run on db1094 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:06:36] (03CR) 10Marostegui: [C: 03+2] install_server: Install pc1007 [puppet] - 10https://gerrit.wikimedia.org/r/483854 (https://phabricator.wikimedia.org/T207258) (owner: 10Marostegui) [07:06:41] (03PS2) 10Marostegui: install_server: Install pc1007 [puppet] - 10https://gerrit.wikimedia.org/r/483854 (https://phabricator.wikimedia.org/T207258) [07:06:49] (03CR) 10Vgutierrez: [C: 03+1] data.yaml: Remove ssh and email from Balazs [puppet] - 10https://gerrit.wikimedia.org/r/484156 (owner: 10Marostegui) [07:07:08] RECOVERY - puppet last run on db1091 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:07:41] (03PS2) 10Marostegui: data.yaml: Remove ssh and email from Balazs [puppet] - 10https://gerrit.wikimedia.org/r/484156 [07:09:16] (03CR) 10Marostegui: [C: 03+2] data.yaml: Remove ssh and email from Balazs [puppet] - 10https://gerrit.wikimedia.org/r/484156 (owner: 10Marostegui) [07:10:54] PROBLEM - Router interfaces on mr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.199, interfaces up: 35, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:12:54] PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100% [07:13:08] PROBLEM - puppet last run on db1122 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): User[banyek] [07:13:14] PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [07:13:30] the Equinix OOB is the interface that went down [07:13:34] (on mr1-eqiad) [07:14:30] RECOVERY - Router interfaces on mr1-eqiad is OK: OK: host 208.80.154.199, interfaces up: 37, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:15:00] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Overall lgtm, see the two comments. The absence of mod_passenger from the frontend is the real showstopper in this case." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/451821 (owner: 10Dzahn) [07:16:53] PROBLEM - puppet last run on db1109 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): User[banyek] [07:17:59] RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 3.36 ms [07:18:09] RECOVERY - puppet last run on db1122 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:18:21] RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 2.50 ms [07:21:53] RECOVERY - puppet last run on db1109 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:22:57] (03CR) 10Peachey88: "Is there a off-boarding task this can be linked to?" [puppet] - 10https://gerrit.wikimedia.org/r/484156 (owner: 10Marostegui) [07:29:32] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/install pc1007-pc1010 - https://phabricator.wikimedia.org/T207258 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['pc1007.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-a... [07:38:22] (03PS1) 10ArielGlenn: add new xml/sql dumps mirror, freemirror.org [puppet] - 10https://gerrit.wikimedia.org/r/484160 [07:38:50] (03CR) 10jerkins-bot: [V: 04-1] add new xml/sql dumps mirror, freemirror.org [puppet] - 10https://gerrit.wikimedia.org/r/484160 (owner: 10ArielGlenn) [07:40:49] (03PS2) 10ArielGlenn: add new xml/sql dumps mirror, freemirror.org [puppet] - 10https://gerrit.wikimedia.org/r/484160 [07:45:58] (03CR) 10ArielGlenn: [C: 03+2] add new xml/sql dumps mirror, freemirror.org [puppet] - 10https://gerrit.wikimedia.org/r/484160 (owner: 10ArielGlenn) [07:48:36] !log executed bmc-device --debug --cold-reset on dbstore1002 - "No more sessions available" for mgmt [07:48:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:28] (03PS2) 10Muehlenhoff: hhvm: Remove support for trusty/jessie [puppet] - 10https://gerrit.wikimedia.org/r/483381 [08:05:04] (03CR) 10Muehlenhoff: [C: 03+2] hhvm: Remove support for trusty/jessie [puppet] - 10https://gerrit.wikimedia.org/r/483381 (owner: 10Muehlenhoff) [08:15:52] (03Abandoned) 10Elukey: [TEST] Remove user elukey [puppet] - 10https://gerrit.wikimedia.org/r/483791 (owner: 10Elukey) [08:20:04] (03PS34) 10Elukey: admin: allow users to be deployed without ssh keys configured [puppet] - 10https://gerrit.wikimedia.org/r/482275 (https://phabricator.wikimedia.org/T212949) [08:20:06] (03PS1) 10Elukey: role::analytics_cluster::hadoop::master: add groups without ssh access [puppet] - 10https://gerrit.wikimedia.org/r/484165 (https://phabricator.wikimedia.org/T212949) [08:24:33] (03CR) 10Elukey: "I've removed the role::analytics_cluster::hadoop::master example and added in a separate review, so this one can be checked via puppet com" [puppet] - 10https://gerrit.wikimedia.org/r/482275 (https://phabricator.wikimedia.org/T212949) (owner: 10Elukey) [08:29:55] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/install pc1007-pc1010 - https://phabricator.wikimedia.org/T207258 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['pc1007.eqiad.wmnet'] ` and were **ALL** successful. [08:31:26] (03PS2) 10Elukey: role::analytics_cluster::hadoop: add groups without ssh access [puppet] - 10https://gerrit.wikimedia.org/r/484165 (https://phabricator.wikimedia.org/T212949) [08:33:29] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/install pc1007-pc1010 - https://phabricator.wikimedia.org/T207258 (10Marostegui) pc1007 got installed and looks good: ` root@pc1007:~# megacli -LDPDInfo -aAll Adapter #0 Number of Virtual Disks: 1 Virtual Drive: 0 (Target Id: 0) Name... [08:33:48] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/install pc1007-pc1010 - https://phabricator.wikimedia.org/T207258 (10Marostegui) [08:33:57] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/install pc1007-pc1010 - https://phabricator.wikimedia.org/T207258 (10Marostegui) 05Open→03Resolved [08:38:07] !log Stop MySQL on pc2010 to clone pc1007 - T208383 [08:38:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:10] T208383: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 [08:42:15] 10Operations, 10Puppet: puppet.git rake fails with ruby 2.5 - https://phabricator.wikimedia.org/T208566 (10fgiunchedi) @hashar I initially added this task to ci-infra because it'll be relevant with buster docker/jenkins jobs, is there a task already for that I could piggyback? [08:44:53] !log Stop mysql on dbstore1002 - T213670 [08:44:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:56] T213670: dbstore1002 Mysql errors - https://phabricator.wikimedia.org/T213670 [08:46:49] (03PS3) 10Zoranzoki21: Add new throttle rule for Berklee College of Music library [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483409 (https://phabricator.wikimedia.org/T213311) [08:50:11] (03PS4) 10Zoranzoki21: Create Portal namespace on shn.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482508 (https://phabricator.wikimedia.org/T212992) [08:50:51] 10Operations, 10monitoring: Report problems found by mcelog - https://phabricator.wikimedia.org/T197086 (10fgiunchedi) >>! In T197086#4865980, @CDanis wrote: > I think this work has mostly already happened? We have some mtail rules for mce events. > https://phabricator.wikimedia.org/source/operations-puppet/b... [08:53:26] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, and 2 others: Assess Thumbor upgrade options - https://phabricator.wikimedia.org/T209886 (10jijiki) [08:53:37] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, and 3 others: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817 (10jijiki) [08:53:39] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, and 2 others: Assess Thumbor upgrade options - https://phabricator.wikimedia.org/T209886 (10jijiki) 05Resolved→03Open [08:55:20] (03PS2) 10GTirloni: wmcs::nfs::misc - Fix typo and nsswitch.conf file [puppet] - 10https://gerrit.wikimedia.org/r/484149 (https://phabricator.wikimedia.org/T209527) [08:56:21] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, and 2 others: Assess Thumbor upgrade options - https://phabricator.wikimedia.org/T209886 (10jijiki) [08:56:25] (03CR) 10GTirloni: [C: 03+2] wmcs::nfs::misc - Fix typo and nsswitch.conf file [puppet] - 10https://gerrit.wikimedia.org/r/484149 (https://phabricator.wikimedia.org/T209527) (owner: 10GTirloni) [09:04:35] (03PS3) 10Muehlenhoff: Add support for buster-wikimedia to our internal repository [puppet] - 10https://gerrit.wikimedia.org/r/483694 (https://phabricator.wikimedia.org/T213527) [09:05:18] (03PS5) 10Zoranzoki21: Update groupOverrides for srwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482609 (https://phabricator.wikimedia.org/T213055) [09:06:00] 10Operations, 10ops-eqiad, 10DBA: db1115 (tendril DB) had OOM for some processes and some hw (memory) issues - https://phabricator.wikimedia.org/T196726 (10Marostegui) @Cmjohnson can we request a new DIMM to Dell? [09:09:59] (03CR) 10Muehlenhoff: [C: 03+2] Add support for buster-wikimedia to our internal repository [puppet] - 10https://gerrit.wikimedia.org/r/483694 (https://phabricator.wikimedia.org/T213527) (owner: 10Muehlenhoff) [09:20:47] (03PS1) 10Vgutierrez: aptrepo: add component/kernel-proposed-updates to stretch-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/484181 (https://phabricator.wikimedia.org/T203194) [09:30:46] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.0.13 [software/spicerack] - 10https://gerrit.wikimedia.org/r/484184 [09:37:25] !log Running aria_chk for all linter tables on dbstore1002 - T213670 [09:37:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:29] T213670: dbstore1002 Mysql errors - https://phabricator.wikimedia.org/T213670 [09:38:06] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.0.13 [software/spicerack] - 10https://gerrit.wikimedia.org/r/484184 (owner: 10Volans) [09:40:10] RECOVERY - puppet last run on cloudstore1008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:43:30] 10Operations, 10Elasticsearch, 10Discovery-Search (Current work), 10Patch-For-Review: Fix prometheus elasticsearch exporter to show all the metrics - https://phabricator.wikimedia.org/T210592 (10Gehel) Validation was done by @Mathew.onipe. .deb is now uploaded to our apt repo [09:45:09] (03PS1) 10Zoranzoki21: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484186 [09:45:11] (03PS1) 10Zoranzoki21: Update groupOverrides for srwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484187 (https://phabricator.wikimedia.org/T213679) [09:45:54] (03Abandoned) 10Zoranzoki21: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484186 (owner: 10Zoranzoki21) [09:47:39] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.0.13 [software/spicerack] - 10https://gerrit.wikimedia.org/r/484184 (owner: 10Volans) [09:48:05] (03CR) 10DCausse: [C: 04-1] Elasticsearch failed shard allocation check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/482297 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe) [09:48:45] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/484181 (https://phabricator.wikimedia.org/T203194) (owner: 10Vgutierrez) [09:49:09] (03CR) 10Vgutierrez: [C: 03+2] aptrepo: add component/kernel-proposed-updates to stretch-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/484181 (https://phabricator.wikimedia.org/T203194) (owner: 10Vgutierrez) [09:49:17] (03CR) 10jenkins-bot: CHANGELOG: add changelogs for release v0.0.13 [software/spicerack] - 10https://gerrit.wikimedia.org/r/484184 (owner: 10Volans) [09:50:43] (03PS1) 10Volans: Upstream release v0.0.13 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/484188 [09:51:47] !log Running aria_chk for all myisam tables on dbstore1002 T213670 [09:51:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:50] T213670: dbstore1002 Mysql errors - https://phabricator.wikimedia.org/T213670 [09:57:40] (03CR) 10Volans: [C: 03+2] Upstream release v0.0.13 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/484188 (owner: 10Volans) [10:01:52] (03CR) 10Volans: [C: 03+1] "LGTM, although I didn't run a puppetcompiler to verify it." [puppet] - 10https://gerrit.wikimedia.org/r/483695 (owner: 10Muehlenhoff) [10:03:09] (03Merged) 10jenkins-bot: Upstream release v0.0.13 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/484188 (owner: 10Volans) [10:07:45] !log install tmpreaper security updates on remaining hosts [10:07:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:48] !log uploaded spicerack_0.0.13-1_amd64.deb to apt.wikimedia.org stretch-wikimedia T205884 [10:11:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:52] T205884: Spicerack: split wmf-auto-reimage-lib into Spicerack modules - https://phabricator.wikimedia.org/T205884 [10:13:23] !log installed spicerack 0.0.13 on cumin2001 for final testing - T205884 [10:13:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:06] (03PS1) 10Hashar: doc: force users umask for wikidev group [puppet] - 10https://gerrit.wikimedia.org/r/484194 (https://phabricator.wikimedia.org/T137890) [10:15:54] PROBLEM - DPKG on relforge1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [10:15:55] (03CR) 10jerkins-bot: [V: 04-1] doc: force users umask for wikidev group [puppet] - 10https://gerrit.wikimedia.org/r/484194 (https://phabricator.wikimedia.org/T137890) (owner: 10Hashar) [10:16:52] PROBLEM - Check systemd state on relforge1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:17:47] <_joe_> uhm what's up with relforge? [10:18:16] (03PS2) 10Hashar: doc: force users umask for wikidev group [puppet] - 10https://gerrit.wikimedia.org/r/484194 (https://phabricator.wikimedia.org/T137890) [10:18:54] 10Operations, 10DBA: correctable memory errors db1068 (commons primary master database) - https://phabricator.wikimedia.org/T213664 (10jcrespo) I created to track it, it has gone up to 21 since yesterday. We have to consider the possibility of it crashing due to uncorrectable errors and be prepared for a failo... [10:19:21] _joe_: that's me! [10:19:43] <_joe_> onimisionipe: oh ok [10:19:49] It should be Ok now [10:19:51] <_joe_> it looks like the prometheus exporter fails [10:19:52] * fsero wonders if icinga can really downtime things [10:19:57] :P [10:20:06] yea.. It does. [10:20:54] onimisionipe: happens to everyone :) [10:21:20] 10Operations, 10monitoring, 10Patch-For-Review, 10Performance-Team (Radar): Provision >= 50% of statsd/Graphite-only metrics in Prometheus - https://phabricator.wikimedia.org/T205870 (10fgiunchedi) [10:26:02] PROBLEM - puppet last run on elastic1041 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:26:32] (03PS1) 10Zoranzoki21: Update groupOverrides for srwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484195 (https://phabricator.wikimedia.org/T213684) [10:29:38] PROBLEM - puppet last run on relforge1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[prometheus-elasticsearch-exporter] [10:30:04] jan_drewniak: Time to snap out of that daydream and deploy Wikimedia Portals Update. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190114T1030). [10:34:06] (03CR) 10Hashar: "The permissions are somehow wrong from time to time :/" [puppet] - 10https://gerrit.wikimedia.org/r/484194 (https://phabricator.wikimedia.org/T137890) (owner: 10Hashar) [10:39:22] !log start installing systemd security updates for stretch [10:39:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:52] RECOVERY - DPKG on relforge1002 is OK: All packages OK [10:40:02] RECOVERY - puppet last run on relforge1002 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [10:41:24] (03PS2) 10Giuseppe Lavagetto: profile::services_proxy: simple local proxying for remote services [puppet] - 10https://gerrit.wikimedia.org/r/483788 (https://phabricator.wikimedia.org/T210717) [10:41:26] (03PS3) 10Giuseppe Lavagetto: mediawiki::common: add proxy for services [puppet] - 10https://gerrit.wikimedia.org/r/483789 (https://phabricator.wikimedia.org/T210717) [10:42:09] (03CR) 10jerkins-bot: [V: 04-1] profile::services_proxy: simple local proxying for remote services [puppet] - 10https://gerrit.wikimedia.org/r/483788 (https://phabricator.wikimedia.org/T210717) (owner: 10Giuseppe Lavagetto) [10:44:20] (03CR) 10Gehel: "minor comment inline. PCC looks good: https://puppet-compiler.wmflabs.org/compiler1002/14318/" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/483798 (https://phabricator.wikimedia.org/T198622) (owner: 10Mathew.onipe) [10:47:05] 10Puppet, 10ORES, 10Scoring-platform-team (Current): orespoolcounter1002.eqiad.wmnet reporting compile errors - https://phabricator.wikimedia.org/T213586 (10akosiaris) 05Open→03Invalid All are warnings, that is not errors and are safe to ignore. They are about a feature (exported resources[1]) that is no... [10:49:08] (03PS35) 10Elukey: admin: allow users to be deployed without ssh keys configured [puppet] - 10https://gerrit.wikimedia.org/r/482275 (https://phabricator.wikimedia.org/T212949) [10:49:11] (03PS3) 10Elukey: role::analytics_cluster::hadoop: add groups without ssh access [puppet] - 10https://gerrit.wikimedia.org/r/484165 (https://phabricator.wikimedia.org/T212949) [10:51:42] RECOVERY - puppet last run on elastic1041 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:57:41] (03CR) 10Alexandros Kosiaris: [C: 03+2] "I 'll merge this in the interest of unblocking the current issue. PCC says it's noop for production anyways, we can always revert with min" [puppet] - 10https://gerrit.wikimedia.org/r/475714 (https://phabricator.wikimedia.org/T212327) (owner: 10Alexandros Kosiaris) [10:57:53] (03PS7) 10Alexandros Kosiaris: Introduce $aggregate_networks, deprecate $all_networks [puppet] - 10https://gerrit.wikimedia.org/r/475714 (https://phabricator.wikimedia.org/T212327) [11:00:11] (03PS1) 10Muehlenhoff: Update canary host for Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/484196 [11:02:23] (03PS2) 10Muehlenhoff: Update canary host for Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/484196 [11:02:42] RECOVERY - Check systemd state on relforge1002 is OK: OK - running: The system is fully operational [11:02:43] (03PS2) 10Alexandros Kosiaris: ferm: Remove unused all_networks erb variable [puppet] - 10https://gerrit.wikimedia.org/r/483429 [11:03:30] (03CR) 10Alexandros Kosiaris: [C: 03+2] ferm: Remove unused all_networks erb variable [puppet] - 10https://gerrit.wikimedia.org/r/483429 (owner: 10Alexandros Kosiaris) [11:03:32] (03CR) 10Muehlenhoff: [C: 03+2] Update canary host for Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/484196 (owner: 10Muehlenhoff) [11:03:47] 10Operations, 10Citoid, 10serviceops: Create a readiness probe for zotero - https://phabricator.wikimedia.org/T213689 (10fselles) [11:05:42] (03PS3) 10Muehlenhoff: Update canary host for Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/484196 [11:09:42] (03PS1) 10Vgutierrez: cache: Add kernel-proposed-updates component for cp1075-99 [puppet] - 10https://gerrit.wikimedia.org/r/484199 (https://phabricator.wikimedia.org/T203194) [11:12:08] (03CR) 10Fsero: [C: 03+1] profile::services_proxy: simple local proxying for remote services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/483788 (https://phabricator.wikimedia.org/T210717) (owner: 10Giuseppe Lavagetto) [11:13:15] (03CR) 10Muehlenhoff: [C: 03+1] "One nit, but LGTM." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/484199 (https://phabricator.wikimedia.org/T203194) (owner: 10Vgutierrez) [11:13:48] PROBLEM - DPKG on relforge1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:14:07] common [11:20:47] !log installed spicerack 0.0.13 on cumin1001 - T205884 [11:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:54] T205884: Spicerack: split wmf-auto-reimage-lib into Spicerack modules - https://phabricator.wikimedia.org/T205884 [11:24:35] (03CR) 10Volans: [C: 03+2] API: convert to new Spicerack API [cookbooks] - 10https://gerrit.wikimedia.org/r/479463 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [11:26:21] (03Merged) 10jenkins-bot: API: convert to new Spicerack API [cookbooks] - 10https://gerrit.wikimedia.org/r/479463 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [11:51:08] (03PS3) 10Zfilipin: Add 'suppressredirect' user right to patroller user group at zh.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480768 (https://phabricator.wikimedia.org/T212272) (owner: 10Tulsi Bhagat) [11:57:06] (03PS7) 10Mathew.onipe: Elasticsearch failed shard allocation check [puppet] - 10https://gerrit.wikimedia.org/r/482297 (https://phabricator.wikimedia.org/T212850) [11:57:31] (03CR) 10Mathew.onipe: Elasticsearch failed shard allocation check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/482297 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe) [11:57:33] (03PS2) 10Vgutierrez: cache: Add kernel-proposed-updates component for cp1075-99 [puppet] - 10https://gerrit.wikimedia.org/r/484199 (https://phabricator.wikimedia.org/T203194) [12:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: That opportune time is upon us again. Time for a European Mid-day SWAT(Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190114T1200). [12:00:04] Tulsi, TBhagat, Urbanecm, and Zoranzoki21: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:16] I can swat today [12:00:24] Here [12:00:31] Can you process my patches first [12:00:55] (03PS2) 10Mathew.onipe: maps: migrate maps1003 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/483798 (https://phabricator.wikimedia.org/T198622) [12:01:11] Tulsi, TBhagat, Urbanecm: are any patches urgent? no complaints on Zoranzoki21 being the first? [12:01:18] No [12:01:20] and hi zeljkof [12:01:26] hi Urbanecm! [12:01:27] I have no problem with it. [12:02:00] Thanks TBhagat and Urbanecm [12:02:06] Yw Zoranzoki21 [12:02:06] ok, deploying the first Zoranzoki21's patch, please stand b< [12:02:07] by [12:02:47] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483409 (https://phabricator.wikimedia.org/T213311) (owner: 10Zoranzoki21) [12:03:02] zeljkof: 483409 no needs testing, it is throttle rule [12:03:14] ok [12:03:56] (03Merged) 10jenkins-bot: Add new throttle rule for Berklee College of Music library [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483409 (https://phabricator.wikimedia.org/T213311) (owner: 10Zoranzoki21) [12:05:19] !log zfilipin@deploy1001 Synchronized wmf-config/throttle.php: SWAT: [[gerrit:483409|Add new throttle rule for Berklee College of Music library (T213311)]] (duration: 00m 52s) [12:05:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:22] T213311: Request for temporary lift of IP cap on 2019-01-15 - https://phabricator.wikimedia.org/T213311 [12:05:27] Zoranzoki21: 483409 deployed [12:06:03] Second patch for Portal namespace needs namespaceDupes.php [12:06:40] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482508 (https://phabricator.wikimedia.org/T212992) (owner: 10Zoranzoki21) [12:06:46] Zoranzoki21: ok [12:06:53] thanks for the reminder [12:07:43] (03Merged) 10jenkins-bot: Create Portal namespace on shn.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482508 (https://phabricator.wikimedia.org/T212992) (owner: 10Zoranzoki21) [12:07:56] zeljkof: Oh, zuul is so fast today :) [12:08:30] Zoranzoki21: it is :) 482508 is at mwdebug1002 for testing [12:08:45] * Zoranzoki21 testing [12:09:00] (03CR) 10Mathew.onipe: maps: migrate maps1003 to stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/483798 (https://phabricator.wikimedia.org/T198622) (owner: 10Mathew.onipe) [12:09:36] zeljkof: looks good, LGTM [12:09:44] Zoranzoki21: ok, deploying [12:10:36] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:482508|Create Portal namespace on shn.wikipedia (T212992)]] (duration: 00m 46s) [12:10:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:39] T212992: Create Portal namespace on shn.wikipedia - https://phabricator.wikimedia.org/T212992 [12:11:29] 10Operations, 10Citoid, 10serviceops, 10Wikimedia-Incident: Zotero service crashes and pages multiple times. - https://phabricator.wikimedia.org/T213693 (10fselles) [12:11:36] zeljkof: ;) [12:12:01] 10Operations, 10Citoid, 10serviceops, 10Kubernetes, 10Wikimedia-Incident: Zotero service crashes and pages multiple times. - https://phabricator.wikimedia.org/T213693 (10fselles) [12:12:07] Zoranzoki21: deployed, script did not find anything T212992#4876873 [12:12:19] Zoranzoki21: you are free to go, thanks for deploying with #releng :) [12:12:31] (and please test the last patch before going) :) [12:13:03] zeljkof: Ok is all, thanks! [12:13:17] Tulsi, TBhagat: you have two nicks? :) [12:13:24] (03PS3) 10Tulsi Bhagat: Configure $wgAddGroups, $wgRemoveGroups and $wgImportSources for ur.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481579 (https://phabricator.wikimedia.org/T212612) [12:13:47] anyway, please stand by, you're next, I'll let you know when the first patch is at mwdebug1002 ready for testing [12:13:57] Hi zeljkof! Yes :) [12:14:07] Sure [12:14:25] Tulsi, TBhagat: which one do you prefer for pings? [12:14:36] so I don't ping both all the time :) [12:14:50] Go for TBhagat. [12:14:51] and let me know if you need help on how to test at mwdebug102 [12:14:57] ok [12:15:12] (03PS4) 10Zfilipin: Add 'suppressredirect' user right to patroller user group at zh.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480768 (https://phabricator.wikimedia.org/T212272) (owner: 10Tulsi Bhagat) [12:15:29] No [12:15:36] Let's start! :) [12:16:10] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480768 (https://phabricator.wikimedia.org/T212272) (owner: 10Tulsi Bhagat) [12:16:46] Urbanecm: there are a lot of patches today, I'll do my best but there's a chance one or both of your commit will not make it [12:17:10] Ok, that's fine [12:17:15] (03Merged) 10jenkins-bot: Add 'suppressredirect' user right to patroller user group at zh.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480768 (https://phabricator.wikimedia.org/T212272) (owner: 10Tulsi Bhagat) [12:18:02] zeljkof, LGTM. Please deploy. [12:18:05] TBhagat: 480768 is at mwdebug1002, please test and let me know if I can deploy [12:18:12] oh, that was fast :) [12:18:17] deploying [12:18:20] hehe [12:18:33] TBhagat: let me know if any patches need scripts to run after deployment [12:18:44] Ok [12:19:19] the best way is to leave a comment in gerrit (which scripts need to run for which patches) [12:19:36] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:480768|Add suppressredirect user right to patroller user group at zh.wikivoyage (T212272)]] (duration: 00m 46s) [12:19:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:39] T212272: Assign "suppressredirect" to patroller on Chinese Wikivoyage - https://phabricator.wikimedia.org/T212272 [12:19:45] I have already left a comment on 481578 [12:19:53] TBhagat: 480768 deployed, please test [12:19:56] TBhagat: thanks [12:20:14] (03PS4) 10Zfilipin: Configure $wgAddGroups, $wgRemoveGroups and $wgImportSources for ur.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481579 (https://phabricator.wikimedia.org/T212612) (owner: 10Tulsi Bhagat) [12:20:23] 480768 Working fine. [12:21:53] (03CR) 10jenkins-bot: Add new throttle rule for Berklee College of Music library [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483409 (https://phabricator.wikimedia.org/T213311) (owner: 10Zoranzoki21) [12:21:55] (03CR) 10jenkins-bot: Create Portal namespace on shn.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482508 (https://phabricator.wikimedia.org/T212992) (owner: 10Zoranzoki21) [12:21:57] (03CR) 10jenkins-bot: Add 'suppressredirect' user right to patroller user group at zh.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480768 (https://phabricator.wikimedia.org/T212272) (owner: 10Tulsi Bhagat) [12:23:20] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481579 (https://phabricator.wikimedia.org/T212612) (owner: 10Tulsi Bhagat) [12:24:48] (03Merged) 10jenkins-bot: Configure $wgAddGroups, $wgRemoveGroups and $wgImportSources for ur.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481579 (https://phabricator.wikimedia.org/T212612) (owner: 10Tulsi Bhagat) [12:25:21] TBhagat: 481579 is at mwdebug1002, please test [12:25:53] * TBhagat testing [12:26:19] (03PS1) 10Jcrespo: mariadb: Depool db1081 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484213 (https://phabricator.wikimedia.org/T213664) [12:26:47] 481579 LGTM, Please deploy. [12:26:54] ok [12:27:54] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:481579|Configure $wgAddGroups, $wgRemoveGroups and $wgImportSources for ur.wiki (T212612)]] (duration: 00m 46s) [12:27:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:57] T212612: Add transwiki import on Urdu Wikipedia - https://phabricator.wikimedia.org/T212612 [12:28:20] TBhagat: it's deployed, please test [12:28:59] zeljkof, 481579 Working fine. [12:30:25] ok, moving on [12:30:33] Sure [12:31:21] Should i rebase 481578? [12:31:39] TBhagat: I'll rebase as needed [12:31:48] ok [12:32:04] (03PS4) 10Zfilipin: Configure $wgNamespaceAliases for yue.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481578 (https://phabricator.wikimedia.org/T212678) (owner: 10Tulsi Bhagat) [12:33:02] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481578 (https://phabricator.wikimedia.org/T212678) (owner: 10Tulsi Bhagat) [12:33:30] zeljkof, Reminder: change 481578 - Requires `namespaceDupes.php --wiki=yuewiktionary --fix` to be run after deployment. [12:33:43] TBhagat: thanks, will do [12:34:06] (03Merged) 10jenkins-bot: Configure $wgNamespaceAliases for yue.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481578 (https://phabricator.wikimedia.org/T212678) (owner: 10Tulsi Bhagat) [12:34:55] TBhagat: 481578 is at mwdebug1002 [12:35:25] testing [12:35:56] (03CR) 10jenkins-bot: Configure $wgAddGroups, $wgRemoveGroups and $wgImportSources for ur.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481579 (https://phabricator.wikimedia.org/T212612) (owner: 10Tulsi Bhagat) [12:35:58] (03CR) 10jenkins-bot: Configure $wgNamespaceAliases for yue.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481578 (https://phabricator.wikimedia.org/T212678) (owner: 10Tulsi Bhagat) [12:36:16] 481578 LGTM [12:36:24] ok, deploying [12:36:24] please deploy [12:37:37] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:481578|Configure $wgNamespaceAliases for yue.wiktionary (T212678)]] (duration: 00m 45s) [12:37:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:41] T212678: Add namespace aliases for yuewiktionary - https://phabricator.wikimedia.org/T212678 [12:38:42] TBhagat: deployed, script ran, did not find anything to fix https://phabricator.wikimedia.org/T212678#4876955 [12:39:09] Gr8. Let's move on. [12:39:51] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483737 (https://phabricator.wikimedia.org/T213023) (owner: 10Tulsi Bhagat) [12:42:28] (03CR) 10Zfilipin: Configure $wgImportSources for ne.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483737 (https://phabricator.wikimedia.org/T213023) (owner: 10Tulsi Bhagat) [12:42:33] (03PS2) 10Zfilipin: Configure $wgImportSources for ne.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483737 (https://phabricator.wikimedia.org/T213023) (owner: 10Tulsi Bhagat) [12:42:42] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483737 (https://phabricator.wikimedia.org/T213023) (owner: 10Tulsi Bhagat) [12:43:48] (03Merged) 10jenkins-bot: Configure $wgImportSources for ne.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483737 (https://phabricator.wikimedia.org/T213023) (owner: 10Tulsi Bhagat) [12:44:55] !log zfilipin@deploy1001 sync-file aborted: SWAT: [[gerrit:481578|Configure $wgNamespaceAliases for yue.wiktionary (T212678)]] (duration: 00m 01s) [12:44:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:58] T212678: Add namespace aliases for yuewiktionary - https://phabricator.wikimedia.org/T212678 [12:45:29] 10Operations, 10monitoring, 10Patch-For-Review, 10User-CDanis, 10User-fgiunchedi: Better organization for SRE grafana dashboards - https://phabricator.wikimedia.org/T178690 (10jcrespo) I've just seen a dashboard I use is scheduled for deletion. I don't see the replacement as particularly better and lacki... [12:45:31] oops, this ^ is my mistake, wrong link from bash history :( aborted after a second [12:45:57] TBhagat: 483737 is at mwdebug [12:46:49] 483737 LGTM, Please deploy. [12:47:41] ok [12:48:05] Urbanecm: please stand by, you're next :) [12:48:07] ack [12:48:38] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:483737|Configure $wgImportSources for ne.wiktionary (T213023)]] (duration: 00m 45s) [12:48:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:40] T213023: Enable import feature in Nepali Wiktionary - https://phabricator.wikimedia.org/T213023 [12:48:56] TBhagat: it's deployed, please test and thanks for deploying with #releng :) [12:49:12] (03CR) 10jenkins-bot: Configure $wgImportSources for ne.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483737 (https://phabricator.wikimedia.org/T213023) (owner: 10Tulsi Bhagat) [12:49:20] (03PS2) 10Zfilipin: Localisation of Babel categories on nap.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481244 (https://phabricator.wikimedia.org/T123188) (owner: 10Urbanecm) [12:50:24] zeljkof, Thank you so much! Have a good time ahead! ;) [12:50:46] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481244 (https://phabricator.wikimedia.org/T123188) (owner: 10Urbanecm) [12:52:20] (03Merged) 10jenkins-bot: Localisation of Babel categories on nap.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481244 (https://phabricator.wikimedia.org/T123188) (owner: 10Urbanecm) [12:53:15] Urbanecm: 481244 is at mwdebug, please test [12:53:32] will do [12:55:44] looks to be working, please deploy zeljkof [12:55:51] ok [12:56:28] 10Operations, 10monitoring, 10Patch-For-Review, 10User-CDanis, 10User-fgiunchedi: Better organization for SRE grafana dashboards - https://phabricator.wikimedia.org/T178690 (10CDanis) Jaime, going to have to guess here; are you referring to [[ https://grafana.wikimedia.org/d/000000274/prometheus-machine-... [12:56:50] PROBLEM - DPKG on proton1002 is CRITICAL: connect to address 10.64.32.61 port 5666: Connection refused [12:56:53] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:481244|Localisation of Babel categories on nap.wikipedia.org (T123188)]] (duration: 00m 44s) [12:56:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:56] T123188: Localisation of user categories for nap.wikipedia - https://phabricator.wikimedia.org/T123188 [12:57:06] PROBLEM - proton endpoints health on proton1002 is CRITICAL: connect to address 10.64.32.61 port 5666: Connection refused [12:57:14] PROBLEM - dhclient process on proton1002 is CRITICAL: connect to address 10.64.32.61 port 5666: Connection refused [12:57:22] PROBLEM - Disk space on proton1002 is CRITICAL: connect to address 10.64.32.61 port 5666: Connection refused [12:57:24] PROBLEM - Check whether ferm is active by checking the default input chain on proton1002 is CRITICAL: connect to address 10.64.32.61 port 5666: Connection refused [12:57:24] PROBLEM - configured eth on proton1002 is CRITICAL: connect to address 10.64.32.61 port 5666: Connection refused [12:57:36] PROBLEM - Check systemd state on proton1002 is CRITICAL: connect to address 10.64.32.61 port 5666: Connection refused [12:57:38] PROBLEM - Check size of conntrack table on proton1002 is CRITICAL: connect to address 10.64.32.61 port 5666: Connection refused [12:57:39] Urbanecm: deployed, please test [12:57:43] thx [12:58:16] (03PS3) 10Zfilipin: Add http://mbc.cyfrowemazowsze.pl to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481108 (https://phabricator.wikimedia.org/T212469) (owner: 10Urbanecm) [12:58:43] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481108 (https://phabricator.wikimedia.org/T212469) (owner: 10Urbanecm) [12:58:48] PROBLEM - puppet last run on proton1002 is CRITICAL: connect to address 10.64.32.61 port 5666: Connection refused [12:59:49] (03Merged) 10jenkins-bot: Add http://mbc.cyfrowemazowsze.pl to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481108 (https://phabricator.wikimedia.org/T212469) (owner: 10Urbanecm) [13:00:52] Urbanecm: 481108 is at mwdebug [13:00:56] thanks [13:01:48] Urbanecm: can I deploy it? [13:01:51] is anyone aware why npre died on proton1002 / [13:01:53] ? [13:01:57] zeljkof, yes [13:01:58] should I restart it ? [13:02:06] Urbanecm: ok, deploying [13:02:16] (03CR) 10jenkins-bot: Localisation of Babel categories on nap.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481244 (https://phabricator.wikimedia.org/T123188) (owner: 10Urbanecm) [13:02:18] (03CR) 10jenkins-bot: Add http://mbc.cyfrowemazowsze.pl to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481108 (https://phabricator.wikimedia.org/T212469) (owner: 10Urbanecm) [13:03:03] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:481108|Add http://mbc.cyfrowemazowsze.pl to $wgCopyUploadsDomains (T212469)]] (duration: 00m 46s) [13:03:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:08] T212469: Add http://mbc.cyfrowemazowsze.pl to $wgCopyUploadsDomains - https://phabricator.wikimedia.org/T212469 [13:03:15] Urbanecm: all deployed, please test and thanks for deploying with #releng ;) [13:03:20] thanks zeljkof [13:03:22] !log eu swat finished [13:03:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:48] (03PS6) 10CDanis: Reference grafana dashboards by UID for alerting. [puppet] - 10https://gerrit.wikimedia.org/r/483820 (owner: 10Ppchelko) [13:07:16] 10Operations, 10Puppet, 10Continuous-Integration-Config: puppet.git rake fails with ruby 2.5 - https://phabricator.wikimedia.org/T208566 (10hashar) The `Gemfile` uses `puppet ~> 4.8.2` which is the version provided by `jessie-backports` and `stretch`. The CI job installs it from rubygems hence we lack monke... [13:07:38] (03CR) 10CDanis: [C: 03+2] Reference grafana dashboards by UID for alerting. [puppet] - 10https://gerrit.wikimedia.org/r/483820 (owner: 10Ppchelko) [13:08:32] RECOVERY - Check systemd state on proton1002 is OK: OK - running: The system is fully operational [13:08:34] RECOVERY - Check size of conntrack table on proton1002 is OK: OK: nf_conntrack is 0 % full [13:08:52] (03CR) 10CDanis: [C: 03+2] "Merged and puppet-merged. Thanks again Petr!" [puppet] - 10https://gerrit.wikimedia.org/r/483820 (owner: 10Ppchelko) [13:09:00] RECOVERY - DPKG on proton1002 is OK: All packages OK [13:09:14] RECOVERY - puppet last run on proton1002 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [13:09:18] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy [13:09:22] RECOVERY - dhclient process on proton1002 is OK: PROCS OK: 0 processes with command name dhclient [13:09:27] I restarted npre on proton1002 for now [13:09:30] RECOVERY - Disk space on proton1002 is OK: DISK OK [13:09:32] RECOVERY - Check whether ferm is active by checking the default input chain on proton1002 is OK: OK ferm input default policy is set [13:09:32] RECOVERY - configured eth on proton1002 is OK: OK - interfaces up [13:09:57] I am not investigating any further, if npre dies again, we could dig deeper [13:10:31] !log Restarted npre on proton1002 [13:10:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:47] 10Operations, 10monitoring, 10Patch-For-Review, 10User-CDanis, 10User-fgiunchedi: Better organization for SRE grafana dashboards - https://phabricator.wikimedia.org/T178690 (10jcrespo) >>! In T178690#4876994, @CDanis wrote: > Jaime, going to have to guess here; are you referring to [[ https://grafana.wik... [13:16:29] 10Operations, 10MediaWiki Language Extension Bundle, 10MediaWiki-Cache, 10Language-Team (Language-2019-January-March), and 5 others: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10Nikerabbit) The new patch sh... [13:19:20] (03PS1) 10Filippo Giunchedi: hieradata: increase default kafka partitions for logging cluster [puppet] - 10https://gerrit.wikimedia.org/r/484226 (https://phabricator.wikimedia.org/T213081) [13:23:47] jijiki: in such cases it's often the oomkiller which kills unrelated processes [13:24:33] moritzm: yep [13:24:44] but proton spawns chromium instances [13:24:52] so maybe one got out of hand [13:25:30] I will keep an eye [13:27:22] https://grafana.wikimedia.org/d/000000377/host-overview?refresh=5m&orgId=1&var-server=proton1002&var-datasource=eqiad%20prometheus%2Fops&var-cluster=proton [13:34:22] 10Operations, 10Traffic, 10HTTPS, 10Patch-For-Review: letsencrypt puppetization: upgrade for scalability - https://phabricator.wikimedia.org/T134447 (10Krenair) 05Open→03Resolved I think at this point the route forward is certcentral and there's not much point keeping this particular ticket open. Feel... [13:34:25] 10Operations, 10Traffic, 10HTTPS, 10Patch-For-Review: Create a secure redirect service for large count of non-canonical / junk domains - https://phabricator.wikimedia.org/T133548 (10Krenair) [13:34:35] (03CR) 10Filippo Giunchedi: "Luca/Andrew LMK what you think! Straightforward for new topics I'd say, for existing topics should be fine too (see task)" [puppet] - 10https://gerrit.wikimedia.org/r/484226 (https://phabricator.wikimedia.org/T213081) (owner: 10Filippo Giunchedi) [13:34:41] 10Operations, 10Traffic, 10HTTPS: letsencrypt puppetization: add parallel rsa+ecdsa cert support - https://phabricator.wikimedia.org/T141266 (10Krenair) 05Open→03Resolved I think at this point the route forward is certcentral and there's not much point keeping this particular ticket open. Feel free to re... [13:34:45] 10Operations, 10Traffic, 10HTTPS, 10Patch-For-Review: letsencrypt puppetization: upgrade for scalability - https://phabricator.wikimedia.org/T134447 (10Krenair) [13:37:13] !log akosiaris@deploy1001 scap-helm zotero upgrade -f zotero-values-codfw.yaml stable/zotero [namespace: zotero, clusters: codfw] [13:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:21] !log akosiaris@deploy1001 scap-helm zotero upgrade production -f zotero-values-codfw.yaml stable/zotero [namespace: zotero, clusters: codfw] [13:37:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:16] (03PS19) 10DCausse: [cirrus] Start writing to psi & omega [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476271 (https://phabricator.wikimedia.org/T210381) [13:38:18] (03PS19) 10DCausse: [cirrus] Start using replica group settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476272 (https://phabricator.wikimedia.org/T210381) [13:38:20] (03PS21) 10DCausse: [cirrus] Cleanup transitional states [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381) [13:40:31] PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - CRITICAL - zoterov2_1968: Servers kubernetes2002.codfw.wmnet, kubernetes2001.codfw.wmnet are marked down but pooled: zotero_1969: Servers kubernetes2002.codfw.wmnet, kubernetes2001.codfw.wmnet are marked down but pooled [13:41:39] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - CRITICAL - zoterov2_1968: Servers kubernetes2002.codfw.wmnet, kubernetes2004.codfw.wmnet are marked down but pooled: zotero_1969: Servers kubernetes2004.codfw.wmnet, kubernetes2001.codfw.wmnet are marked down but pooled [13:41:54] !log rollback zotero codfw deployment [13:41:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:26] !log akosiaris@deploy1001 scap-helm zotero upgrade production --dry-run --debug -f zotero-values-codfw.yaml stable/zotero [namespace: zotero, clusters: codfw] [13:42:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:27] !log akosiaris@deploy1001 scap-helm zotero cluster codfw completed [13:42:27] !log akosiaris@deploy1001 scap-helm zotero finished [13:42:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:51] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy [13:42:55] RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy [13:46:19] (03PS3) 10Jgreen: Add SHA256 selector record for fundraising mail contractor (IBM/Silverpop). [dns] - 10https://gerrit.wikimedia.org/r/483294 (https://phabricator.wikimedia.org/T210445) [13:47:05] (03CR) 10Jgreen: [C: 03+2] Add SHA256 selector record for fundraising mail contractor (IBM/Silverpop). [dns] - 10https://gerrit.wikimedia.org/r/483294 (https://phabricator.wikimedia.org/T210445) (owner: 10Jgreen) [13:48:43] !log creating testcommonswiki index in the omega search-elastic cluster (eqiad & codfw) [13:48:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:05] !log authdns update for T210445 [13:49:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:08] T210445: Stronger DKIM key for fundraising emails? - https://phabricator.wikimedia.org/T210445 [13:51:08] 10Operations, 10Fundraising-Backlog, 10Mail, 10fundraising-tech-ops, 10Patch-For-Review: Stronger DKIM key for fundraising emails? - https://phabricator.wikimedia.org/T210445 (10Jgreen) >>! In T210445#4867737, @Jgreen wrote: >>>! In T210445#4867613, @bsisolak wrote: >> The key is correct, and IBM will va... [13:51:10] (03PS1) 10Jbond: Remove user imarlier as part of the off boarding process [puppet] - 10https://gerrit.wikimedia.org/r/484231 [13:51:41] !log akosiaris@deploy1001 scap-helm zotero upgrade production -f zotero-values-codfw.yaml stable/zotero [namespace: zotero, clusters: codfw] [13:51:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:42] !log akosiaris@deploy1001 scap-helm zotero cluster codfw completed [13:51:42] !log akosiaris@deploy1001 scap-helm zotero finished [13:51:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:12] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/484231 (owner: 10Jbond) [13:58:09] (03CR) 10Jbond: [C: 03+2] Remove user imarlier as part of the off boarding process [puppet] - 10https://gerrit.wikimedia.org/r/484231 (owner: 10Jbond) [14:04:13] !log Add pc1007 to tendril and zarcillo - T208383 [14:04:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:16] T208383: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 [14:07:32] (03CR) 10Elukey: "I'll wait for Andrew to comment since he knows best, but I'd avoid to cross the 3 partitions unless there is a special need for high traff" [puppet] - 10https://gerrit.wikimedia.org/r/484226 (https://phabricator.wikimedia.org/T213081) (owner: 10Filippo Giunchedi) [14:09:08] (03CR) 10Marostegui: [C: 03+1] mariadb: Depool db1081 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484213 (https://phabricator.wikimedia.org/T213664) (owner: 10Jcrespo) [14:10:34] !log akosiaris@deploy1001 scap-helm zotero upgrade production -f zotero-values-eqiad.yaml stable/zotero [namespace: zotero, clusters: eqiad] [14:10:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:42] !log akosiaris@deploy1001 scap-helm zotero upgrade production --debug -f zotero-values-eqiad.yaml stable/zotero [namespace: zotero, clusters: eqiad] [14:11:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:26] 10Operations: Offboard Balazs - https://phabricator.wikimedia.org/T213703 (10MoritzMuehlenhoff) [14:12:33] (03PS1) 10Arturo Borrero Gonzalez: toolforge: aptly: create stretch/jessie repos [puppet] - 10https://gerrit.wikimedia.org/r/484233 (https://phabricator.wikimedia.org/T213421) [14:12:42] 10Operations: Offboard Balazs - https://phabricator.wikimedia.org/T213703 (10MoritzMuehlenhoff) p:05Triage→03Normal a:03jbond [14:13:19] !log akosiaris@deploy1001 scap-helm zotero upgrade production -f zotero-values-eqiad.yaml stable/zotero [namespace: zotero, clusters: eqiad] [14:13:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:26] !log akosiaris@deploy1001 scap-helm zotero cluster eqiad completed [14:13:26] !log akosiaris@deploy1001 scap-helm zotero finished [14:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:19] (03PS2) 10Arturo Borrero Gonzalez: toolforge: aptly: create stretch/jessie repos [puppet] - 10https://gerrit.wikimedia.org/r/484233 (https://phabricator.wikimedia.org/T213421) [14:16:49] RECOVERY - DPKG on relforge1001 is OK: All packages OK [14:18:03] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: aptly: create stretch/jessie repos [puppet] - 10https://gerrit.wikimedia.org/r/484233 (https://phabricator.wikimedia.org/T213421) (owner: 10Arturo Borrero Gonzalez) [14:18:46] !log elasticsearch (search cluster): pre-populating omega & psi clusters in eqiad & codfw (from mwmaint1002 and mwmaint2001 respectively) (T210381) [14:18:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:49] T210381: Update mw-config to use the psi&omega elastic clusters - https://phabricator.wikimedia.org/T210381 [14:18:53] RECOVERY - MariaDB Slave SQL: s5 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [14:19:01] RECOVERY - MariaDB Slave SQL: s7 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [14:19:09] RECOVERY - MariaDB Slave SQL: s2 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [14:19:13] RECOVERY - MariaDB Slave SQL: s6 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [14:20:28] (03PS1) 10CDanis: Fixes to check_grafana_alert [puppet] - 10https://gerrit.wikimedia.org/r/484234 (https://phabricator.wikimedia.org/T213506) [14:21:07] (03CR) 10CDanis: [C: 03+2] Fixes to check_grafana_alert [puppet] - 10https://gerrit.wikimedia.org/r/484234 (https://phabricator.wikimedia.org/T213506) (owner: 10CDanis) [14:21:18] (03PS2) 10CDanis: Fixes to check_grafana_alert [puppet] - 10https://gerrit.wikimedia.org/r/484234 (https://phabricator.wikimedia.org/T213506) [14:23:37] PROBLEM - puppet last run on notebook1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): User[imarlier] [14:25:38] RECOVERY - MariaDB Slave SQL: s3 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [14:27:14] RECOVERY - MariaDB Slave SQL: s8 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [14:27:18] RECOVERY - MariaDB Slave SQL: s1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [14:27:32] (03CR) 10Volans: "recheck" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/483131 (owner: 10Volans) [14:28:13] \o/ [14:29:06] RECOVERY - MariaDB Slave SQL: s4 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [14:29:56] (03PS2) 10Arturo Borrero Gonzalez: toolforge: refactor docker registry profile [puppet] - 10https://gerrit.wikimedia.org/r/483765 (https://phabricator.wikimedia.org/T213418) [14:30:01] (03CR) 10Andrew Bogott: [C: 04-1] "The python/yaml changes look good to me except for one question inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/483606 (owner: 10BryanDavis) [14:31:03] (03CR) 10Andrew Bogott: [C: 03+1] Enable base::service_auto_restart for uwsgi-striker [puppet] - 10https://gerrit.wikimedia.org/r/483114 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [14:31:59] (03CR) 10Volans: "recheck" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443366 (https://phabricator.wikimedia.org/T198592) (owner: 10Volans) [14:32:04] (03CR) 10Volans: "recheck" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443367 (https://phabricator.wikimedia.org/T198592) (owner: 10Volans) [14:32:10] (03CR) 10Volans: "recheck" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443368 (https://phabricator.wikimedia.org/T198592) (owner: 10Volans) [14:33:21] 10Operations, 10TCB-Team, 10WMF-JobQueue, 10monitoring, and 3 others: Grafana alerting broken after upgrade to 5.0.0 - https://phabricator.wikimedia.org/T213506 (10CDanis) 05Open→03Resolved [14:34:04] (03CR) 10Hashar: "We can fix the permissions ourselves once we are granded sudo as the doc-publisher user T213169" [puppet] - 10https://gerrit.wikimedia.org/r/484194 (https://phabricator.wikimedia.org/T137890) (owner: 10Hashar) [14:35:51] 10Operations, 10MediaWiki Language Extension Bundle, 10MediaWiki-Cache, 10Language-Team (Language-2019-January-March), and 5 others: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) @Nikerabbit looking... [14:36:06] !log uploaded python{,3}-phabricator 0.7.0-2~wmf1 to apt.w.o T205884 (upstream removes egg files) [14:36:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:08] T205884: Spicerack: split wmf-auto-reimage-lib into Spicerack modules - https://phabricator.wikimedia.org/T205884 [14:37:18] moritzm: have a moment to update the topic to put me on ops duty? [14:37:36] (03CR) 10Andrew Bogott: [C: 03+1] "LGTM! Has this been tested in toolforge already via cherry-pick? If not I'd like to do that before merging." [puppet] - 10https://gerrit.wikimedia.org/r/482237 (https://phabricator.wikimedia.org/T87001) (owner: 10BryanDavis) [14:37:38] 10Operations, 10Citoid, 10serviceops, 10Kubernetes, 10Wikimedia-Incident: Zotero service crashes and pages multiple times. - https://phabricator.wikimedia.org/T213693 (10CDanis) p:05Triage→03Normal [14:38:02] 10Operations, 10Citoid, 10serviceops, 10Patch-For-Review: Create a readiness probe for zotero - https://phabricator.wikimedia.org/T213689 (10CDanis) p:05Triage→03Normal [14:39:21] !log updated python3-phabricator on cumin[12]001 T205884 [14:39:22] 10Operations, 10DBA, 10Patch-For-Review: correctable memory errors db1068 (commons primary master database) - https://phabricator.wikimedia.org/T213664 (10CDanis) p:05Triage→03High [14:39:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:24] cdanis: sure, on it [14:41:12] cdanis: done [14:41:14] !log anomie@mwmaint1002 Running migrateActors.php on remaining section 3 wikis for T188327. This may cause lag in codfw. [14:41:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:17] T188327: Deploy refactored actor storage - https://phabricator.wikimedia.org/T188327 [14:41:47] 10Operations, 10Puppet, 10Continuous-Integration-Config: puppet.git rake fails with ruby 2.5 - https://phabricator.wikimedia.org/T208566 (10CDanis) p:05Triage→03Normal [14:41:52] 10Operations, 10Certcentral, 10Traffic, 10Goal: Deploy managed LetsEncrypt certs for all public use-cases - https://phabricator.wikimedia.org/T213705 (10Vgutierrez) p:05Triage→03Normal [14:41:53] !log anomie@mwmaint1002 Running migrateActors.php on section 1 wikis for T188327. This may cause lag in codfw. [14:41:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:01] !log anomie@mwmaint1002 Running migrateActors.php on section 2 wikis for T188327. This may cause lag in codfw. [14:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:07] !log anomie@mwmaint1002 Running migrateActors.php on section 4 wikis for T188327. This may cause lag in codfw. [14:42:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:10] !log anomie@mwmaint1002 Running migrateActors.php on section 5 wikis for T188327. This may cause lag in codfw. [14:42:12] !log anomie@mwmaint1002 Running migrateActors.php on section 6 wikis for T188327. This may cause lag in codfw. [14:42:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:14] !log anomie@mwmaint1002 Running migrateActors.php on section 7 wikis for T188327. This may cause lag in codfw. [14:42:16] !log anomie@mwmaint1002 Running migrateActors.php on section 8 wikis for T188327. This may cause lag in codfw. [14:42:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:18] !log anomie@mwmaint1002 Running migrateActors.php on wikitech for T188327. This may cause lag in codfw. [14:42:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:20] 10Operations, 10Certcentral, 10Traffic, 10Goal: Deploy managed LetsEncrypt certs for all public use-cases - https://phabricator.wikimedia.org/T213705 (10Vgutierrez) [14:42:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:27] (03CR) 10Vgutierrez: [C: 03+2] cache: Add kernel-proposed-updates component for cp1075-99 [puppet] - 10https://gerrit.wikimedia.org/r/484199 (https://phabricator.wikimedia.org/T203194) (owner: 10Vgutierrez) [14:49:35] (03PS3) 10Vgutierrez: cache: Add kernel-proposed-updates component for cp1075-99 [puppet] - 10https://gerrit.wikimedia.org/r/484199 (https://phabricator.wikimedia.org/T203194) [14:51:10] !log akosiaris@deploy1001 scap-helm zotero upgrade production -f zotero-values-codfw.yaml stable/zotero [namespace: zotero, clusters: codfw] [14:51:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:11] !log akosiaris@deploy1001 scap-helm zotero cluster codfw completed [14:51:12] !log akosiaris@deploy1001 scap-helm zotero finished [14:51:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:25] !log upgrade zotero pods to 2019-01-14-115905-candidate in codfw T213693 [14:52:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:27] T213693: Zotero service crashes and pages multiple times. - https://phabricator.wikimedia.org/T213693 [14:52:49] 10Operations, 10DBA, 10Patch-For-Review: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 (10Marostegui) pc1007 is now up and replicating. It is catching up. Tomorrow I will replace pc1010 with pc1007 for consistency with codfw... [14:56:17] (03PS3) 10Revi: Change links of wgGEHelpPanelLinks for kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483996 (https://phabricator.wikimedia.org/T209467) [14:57:35] 10Operations, 10monitoring, 10Goal: Upgrade production prometheus-node-exporter to >= 0.16 - https://phabricator.wikimedia.org/T213708 (10fgiunchedi) p:05Triage→03Normal [14:57:49] !log Drop table tag_summary from s6 - T212255 [14:57:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:52] T212255: Drop tag_summary table - https://phabricator.wikimedia.org/T212255 [15:00:38] !log ran systemctl reset-failed on relforge1001 [15:00:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:15] (03PS1) 10Volans: spicerack: fix version [software/spicerack] - 10https://gerrit.wikimedia.org/r/484239 (https://phabricator.wikimedia.org/T205884) [15:02:18] !log upgrading kernel in cp1075 to 4.1.144-1 - T203194 [15:02:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:21] T203194: cp1075-90 - bnxt_en transmit hangs - https://phabricator.wikimedia.org/T203194 [15:02:26] !log imported debdeploy 0.0.99.6-1+deb10u1 for buster-wikimedia (T213527) [15:02:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:31] T213527: Prepare our base system layer for Debian buster - https://phabricator.wikimedia.org/T213527 [15:04:33] !log akosiaris@deploy1001 scap-helm zotero upgrade production -f zotero-values-eqiad.yaml stable/zotero [namespace: zotero, clusters: eqiad] [15:04:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:34] !log akosiaris@deploy1001 scap-helm zotero cluster eqiad completed [15:04:34] !log akosiaris@deploy1001 scap-helm zotero finished [15:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:51] !log upgrade zotero pods to 2019-01-14-115905-candidate in eqiad T213693 [15:04:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:53] T213693: Zotero service crashes and pages multiple times. - https://phabricator.wikimedia.org/T213693 [15:07:06] 10Operations, 10Citoid, 10serviceops, 10Patch-For-Review, 10Wikimedia-Incident: allow zotero container nodejs server to define the amount of heap used instead of the fixed limit of 1.7Gi - https://phabricator.wikimedia.org/T213414 (10akosiaris) p:05Triage→03Normal An image that allows overriding the... [15:08:49] !log testing switchdc cookbooks in DRY-RUN mode w/ latest spicerack T205884 (no real changes expected) [15:08:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:52] T205884: Spicerack: split wmf-auto-reimage-lib into Spicerack modules - https://phabricator.wikimedia.org/T205884 [15:09:12] 10Operations, 10Citoid, 10serviceops, 10Kubernetes, 10Wikimedia-Incident: Zotero service crashes and pages multiple times. - https://phabricator.wikimedia.org/T213693 (10akosiaris) p:05Normal→03Low We have already identified a specific url that was able to send zotero in what appear like a busy loop.... [15:11:31] (03PS1) 10Marostegui: db-eqiad.php: Depool db1105:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484241 (https://phabricator.wikimedia.org/T85757) [15:13:52] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1105:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484241 (https://phabricator.wikimedia.org/T85757) (owner: 10Marostegui) [15:14:21] 10Operations, 10monitoring: Upgrade to Prometheus 2.x - https://phabricator.wikimedia.org/T187987 (10fgiunchedi) [15:15:20] 10Operations, 10monitoring: Serve >= 50% of production Prometheus systems with Prometheus v2 - https://phabricator.wikimedia.org/T187987 (10fgiunchedi) [15:15:25] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1105:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484241 (https://phabricator.wikimedia.org/T85757) (owner: 10Marostegui) [15:16:51] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db1105:3311 T85757 (duration: 00m 46s) [15:16:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:53] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [15:16:55] !log Deploy schema change on db1105:3311 - T85757 [15:16:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:25] PROBLEM - MariaDB Slave Lag: m5 on db2078 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.72 seconds [15:25:18] ^ checking [15:25:29] I was too [15:26:11] it is gone now [15:26:14] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1105:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484241 (https://phabricator.wikimedia.org/T85757) (owner: 10Marostegui) [15:26:28] it comes and goes [15:26:40] the master is delayed [15:27:10] QPS has gone from 0 to 1000 [15:27:35] at around 14:40-14:43 [15:27:59] lots of updates [15:28:37] maybe something from the cloud team? as it is m5? [15:28:48] gtirloni: ^ anything that might be hitting m5 dbs? [15:28:57] I think I know what it is [15:29:00] wikitech upgrade [15:29:10] MigrateActors::migrate [15:29:14] maybe andrewbogott [15:29:32] that is from anomie [15:29:41] oh [15:29:41] ˜/logmsgbot 15:42> !log anomie@mwmaint1002 Running migrateActors.php on wikitech for T188327. This may cause lag in codfw. [15:29:42] T188327: Deploy refactored actor storage - https://phabricator.wikimedia.org/T188327 [15:30:01] well, it makes sense as wait for replication doesn't really work there [15:30:08] maybe we could make it work? [15:30:31] we need to move wikitech to s5 :) [15:30:52] "wikitech for T188327. This may cause lag in codfw." [15:31:01] so it is expected, there is not much to do [15:31:20] yep [15:31:24] sorry, gtirloni andrewbogott it was not your maintenance [15:31:41] np [15:31:46] Oops, I forgot to update the Deployments page on wikitech. I'll go do that now. [15:32:14] it is ok, you logged, it is just I didn't see it because I just returned to my seat [15:32:49] andrewbogott: there was some monitoring about wikitech-static some time ago [15:33:01] not sure if you saw it [15:33:33] i didn't but I'll make a note to look later. Often mu.tante is also on top of those [15:33:40] ok, sorry [15:33:48] not very urgent anyway [15:33:48] !log rolling restart of cp1076-cp1090 to upgrade to kernel 4.9.144 - T203194 [15:33:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:51] T203194: cp1075-90 - bnxt_en transmit hangs - https://phabricator.wikimedia.org/T203194 [15:35:53] RECOVERY - MariaDB Slave Lag: m5 on db2078 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [15:36:51] PROBLEM - MariaDB Slave Lag: s2 on db2049 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 308.21 seconds [15:42:01] (03PS1) 10Mathew.onipe: elasticsearch: mask default exporter service [puppet] - 10https://gerrit.wikimedia.org/r/484243 (https://phabricator.wikimedia.org/T210592) [15:44:18] 10Operations, 10monitoring: Serve >= 50% of production Prometheus systems with Prometheus v2 - https://phabricator.wikimedia.org/T187987 (10fgiunchedi) The list of production Prometheus instances as of today is (gathered from grafana datasources) http://prometheus-labmon.eqiad.wmnet/labs http://prometheus.svc... [15:44:28] !log downscaling old zotero-production-645dccfb64 replicaset on eqiad [15:44:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:56] PROBLEM - MariaDB Slave Lag: s2 on db2088 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.19 seconds [15:45:10] PROBLEM - MariaDB Slave Lag: s2 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 307.24 seconds [15:45:14] PROBLEM - MariaDB Slave Lag: s2 on db2091 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 307.62 seconds [15:45:32] PROBLEM - MariaDB Slave Lag: s2 on db2056 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 314.49 seconds [15:45:58] !log Running cleanupUsersWithNoIds.php on labswiki and labtestwiki, apparently they were left out when that was done for all other wikis (and so caused issues with the migrateActors.php run). [15:45:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:53] 10Operations, 10ops-eqiad, 10Analytics: Degraded RAID on dbstore1002 - https://phabricator.wikimedia.org/T206965 (10Marostegui) >>! In T206965#4694827, @Cmjohnson wrote: > @elukey dbstore1002 is out of warranty and has 1.2T disks. I don't have disks this size but can replace with a 2TB disk.. Let's do it Th... [15:52:43] 10Operations, 10ops-eqiad, 10Analytics: Degraded RAID on dbstore1002 - https://phabricator.wikimedia.org/T206965 (10elukey) a:03Cmjohnson [15:53:21] (03CR) 10Mathew.onipe: "PCC Output looks good: https://puppet-compiler.wmflabs.org/compiler1002/14323/" [puppet] - 10https://gerrit.wikimedia.org/r/484243 (https://phabricator.wikimedia.org/T210592) (owner: 10Mathew.onipe) [15:53:31] (03PS2) 10Jcrespo: mariadb: Depool db1081 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484213 (https://phabricator.wikimedia.org/T213664) [15:54:50] (03PS1) 10Addshore: Introduce wmgWikibaseMaxItemIdForNewPropertyIdHtmlFormatter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484245 (https://phabricator.wikimedia.org/T201831) [15:55:58] (03PS1) 10Gehel: wdqs: prometheus-blazegraph-exporter supports multi instances [puppet] - 10https://gerrit.wikimedia.org/r/484246 (https://phabricator.wikimedia.org/T213234) [15:56:53] (03CR) 10jerkins-bot: [V: 04-1] wdqs: prometheus-blazegraph-exporter supports multi instances [puppet] - 10https://gerrit.wikimedia.org/r/484246 (https://phabricator.wikimedia.org/T213234) (owner: 10Gehel) [15:57:14] !log akosiaris@deploy1001 scap-helm zotero [namespace: zotero, clusters: eqiad] [15:57:14] !log akosiaris@deploy1001 scap-helm zotero cluster eqiad completed [15:57:14] !log akosiaris@deploy1001 scap-helm zotero finished [15:57:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:18] (03PS1) 10Addshore: wmgWikibaseMaxItemIdForNewPropertyIdHtmlFormatter 3000 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484247 (https://phabricator.wikimedia.org/T201831) [15:57:33] PROBLEM - MariaDB Slave Lag: s2 on db2035 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 445.06 seconds [15:58:24] (03PS1) 10Addshore: wmgWikibaseMaxItemIdForNewPropertyIdHtmlFormatter fully on [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484248 (https://phabricator.wikimedia.org/T201831) [15:58:33] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1105:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484249 [15:58:35] ? [15:58:41] that's wrong... I did nothing [15:58:53] scap-helm should not have logged anything [15:58:57] PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp1078_v4, cp1078_v6 [15:59:23] PROBLEM - MariaDB Slave Lag: s2 on db2041 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 461.95 seconds [15:59:25] PROBLEM - MariaDB Slave Lag: s2 on db2063 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 462.32 seconds [15:59:31] jouncebot: now [15:59:31] No deployments scheduled for the next 2 hour(s) and 0 minute(s) [15:59:34] jouncebot: next [15:59:34] In 2 hour(s) and 0 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190114T1800) [16:00:05] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 64 ESP OK [16:00:07] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1105:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484249 (owner: 10Marostegui) [16:01:21] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1105:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484249 (owner: 10Marostegui) [16:01:25] jouncebot reload [16:01:33] jouncebot update [16:01:43] oh, i never remember... [16:01:56] jouncebot: refresh [16:01:57] I refreshed my knowledge about deployments. [16:01:57] no? [16:01:59] :D [16:02:01] there you go! [16:02:07] jouncebot: next [16:02:07] In 162 hour(s) and 27 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190121T1030) [16:02:18] hmmmp, did I break it now >.> [16:02:21] * addshore looks back at the diff [16:02:58] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db1105:3311 T85757 (duration: 00m 46s) [16:03:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:00] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [16:03:50] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1105:3311 T85757 (duration: 00m 45s) [16:03:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:36] jouncebot: refresh [16:04:38] I refreshed my knowledge about deployments. [16:04:39] jouncebot: next [16:04:39] In 0 hour(s) and 55 minute(s): Wikidata: Deploy property link formatter that uses cache instead of wb_terms DB table (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190114T1700) [16:04:42] thats better [16:05:20] (03PS4) 10AndyRussG: Give protect right to centralnoticeadmin on Meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483044 (https://phabricator.wikimedia.org/T209873) [16:06:25] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1105:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484249 (owner: 10Marostegui) [16:12:55] PROBLEM - puppet last run on matomo1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tshark] [16:18:36] (03CR) 10Mobrovac: "Hm, but as you noted, stretch-backports gives us node 8, while we will be skipping node 8 and go directly to node 10. Would that still be " [puppet] - 10https://gerrit.wikimedia.org/r/483891 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [16:18:45] ^ matomo1001 is me, should recover soon [16:25:09] (03CR) 10Muehlenhoff: "Yeah, the nodejs 10 package from the component also no longer builds a -legacy package." [puppet] - 10https://gerrit.wikimedia.org/r/483891 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [16:26:34] (03PS1) 10Volans: sre.switchdc.mediawiki: fix update tendril [cookbooks] - 10https://gerrit.wikimedia.org/r/484255 [16:36:30] RECOVERY - MariaDB Slave Lag: s5 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 218.29 seconds [16:37:05] (03CR) 10Cwhite: [C: 03+2] hiera: add cluster definition to dumps servers [puppet] - 10https://gerrit.wikimedia.org/r/483602 (https://phabricator.wikimedia.org/T210486) (owner: 10Cwhite) [16:37:12] (03PS5) 10Cwhite: hiera: add cluster definition to dumps servers [puppet] - 10https://gerrit.wikimedia.org/r/483602 (https://phabricator.wikimedia.org/T210486) [16:37:23] * elukey looks at moritzm trying to break our dear Matomo [16:37:26] :D [16:39:07] (03CR) 10Cwhite: [C: 03+2] hiera: add cluster definition to syslog servers [puppet] - 10https://gerrit.wikimedia.org/r/483612 (https://phabricator.wikimedia.org/T210486) (owner: 10Cwhite) [16:43:30] RECOVERY - puppet last run on matomo1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:43:36] !log mobrovac@deploy1001 scap-helm -h [namespace: -h, clusters: eqiad,codfw] [16:43:36] !log mobrovac@deploy1001 scap-helm -h cluster eqiad completed [16:43:36] !log mobrovac@deploy1001 scap-helm -h cluster codfw completed [16:43:37] !log mobrovac@deploy1001 scap-helm -h finished [16:43:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:45] lol [16:44:01] akosiaris: bug or feature ^ ? [16:44:02] :P [16:44:17] (03PS2) 10Cwhite: hiera: add cluster definition to syslog servers [puppet] - 10https://gerrit.wikimedia.org/r/483612 (https://phabricator.wikimedia.org/T210486) [16:44:24] mobrovac: I bet feature, public help :-P [16:45:54] (03PS2) 10Nuria: Adding default granularities for monthly datasets [puppet] - 10https://gerrit.wikimedia.org/r/483888 (https://phabricator.wikimedia.org/T209103) [16:46:24] RECOVERY - MariaDB Slave Lag: s2 on db2056 is OK: OK slave_sql_lag Replication lag: 9.86 seconds [16:46:28] RECOVERY - MariaDB Slave Lag: s2 on db2041 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [16:46:30] RECOVERY - MariaDB Slave Lag: s2 on db2063 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [16:46:50] RECOVERY - MariaDB Slave Lag: s2 on db2095 is OK: OK slave_sql_lag Replication lag: 0.30 seconds [16:46:58] (03CR) 10Cwhite: [C: 03+2] hiera: add cluster definition to syslog servers [puppet] - 10https://gerrit.wikimedia.org/r/483612 (https://phabricator.wikimedia.org/T210486) (owner: 10Cwhite) [16:47:17] RECOVERY - MariaDB Slave Lag: s2 on db2035 is OK: OK slave_sql_lag Replication lag: 0.40 seconds [16:47:31] (03PS3) 10Elukey: turnilo: add default granularities for monthly datasets [puppet] - 10https://gerrit.wikimedia.org/r/483888 (https://phabricator.wikimedia.org/T209103) (owner: 10Nuria) [16:47:40] (03PS4) 10Elukey: turnilo: add default granularities for monthly datasets [puppet] - 10https://gerrit.wikimedia.org/r/483888 (https://phabricator.wikimedia.org/T209103) (owner: 10Nuria) [16:50:37] (03CR) 10Elukey: [C: 03+2] turnilo: add default granularities for monthly datasets [puppet] - 10https://gerrit.wikimedia.org/r/483888 (https://phabricator.wikimedia.org/T209103) (owner: 10Nuria) [16:51:05] PROBLEM - puppet last run on cloudvirt1030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:51:22] mobrovac: bug for sure [16:56:07] RECOVERY - MariaDB Slave Lag: s2 on db2088 is OK: OK slave_sql_lag Replication lag: 0.42 seconds [16:57:57] RECOVERY - MariaDB Slave Lag: s2 on db2049 is OK: OK slave_sql_lag Replication lag: 0.24 seconds [16:58:15] (03PS1) 10Huji: Add new synonyms for namespaces in Persian (fa) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484256 (https://phabricator.wikimedia.org/T213733) [16:59:47] RECOVERY - MariaDB Slave Lag: s2 on db2091 is OK: OK slave_sql_lag Replication lag: 0.39 seconds [16:59:57] (03PS2) 10Gehel: wdqs: prometheus-blazegraph-exporter supports multi instances [puppet] - 10https://gerrit.wikimedia.org/r/484246 (https://phabricator.wikimedia.org/T213234) [17:00:04] addshore: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Wikidata: Deploy property link formatter that uses cache instead of wb_terms DB table deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190114T1700). [17:00:04] Addshore: A patch you scheduled for Wikidata: Deploy property link formatter that uses cache instead of wb_terms DB table is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:13] 10Operations, 10ops-eqiad, 10DBA: db1115 (tendril DB) had OOM for some processes and some hw (memory) issues - https://phabricator.wikimedia.org/T196726 (10Cmjohnson) @Marostegui I have to do move the DIMM to another slot and see if the error corrects itself moves with the DIMM or remains the same. Can you... [17:00:34] (03PS1) 10GTirloni: wmcs::nfs::misc - Configure nsswitch.conf [puppet] - 10https://gerrit.wikimedia.org/r/484257 (https://phabricator.wikimedia.org/T209527) [17:00:55] 10Operations, 10ops-eqiad, 10DBA: db1115 (tendril DB) had OOM for some processes and some hw (memory) issues - https://phabricator.wikimedia.org/T196726 (10Marostegui) Yep, we can do that! Just ping us when you are ready for it Thanks! [17:01:00] (03CR) 10jerkins-bot: [V: 04-1] wdqs: prometheus-blazegraph-exporter supports multi instances [puppet] - 10https://gerrit.wikimedia.org/r/484246 (https://phabricator.wikimedia.org/T213234) (owner: 10Gehel) [17:01:03] (03PS2) 10Huji: Add new synonyms for namespaces in Persian (fa) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484256 (https://phabricator.wikimedia.org/T213733) [17:01:33] 10Operations, 10ops-eqiad, 10RESTBase, 10RESTBase-Cassandra, and 3 others: Memory error on restbase1016 - https://phabricator.wikimedia.org/T212418 (10Cmjohnson) The log remains clear and no erros have returned. I will give it another 24 hours and if no change then it can go back into service. [17:03:07] o/ [17:03:14] * addshore is going to go ahead with the slot :) [17:04:00] 10Operations, 10media-storage: Lost file Juan_Guaidó.jpg - https://phabricator.wikimedia.org/T213655 (10CDanis) p:05Triage→03Normal a:03jcrespo [17:04:16] (03PS2) 10Addshore: Introduce wmgWikibaseMaxItemIdForNewPropertyIdHtmlFormatter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484245 (https://phabricator.wikimedia.org/T201831) [17:04:21] (03PS2) 10Addshore: wmgWikibaseMaxItemIdForNewPropertyIdHtmlFormatter 3000 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484247 (https://phabricator.wikimedia.org/T201831) [17:04:28] (03PS2) 10Addshore: wmgWikibaseMaxItemIdForNewPropertyIdHtmlFormatter fully on [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484248 (https://phabricator.wikimedia.org/T201831) [17:04:47] 10Operations, 10media-storage: Lost file Juan_Guaidó.jpg - https://phabricator.wikimedia.org/T213655 (10CDanis) @jcrespo and @fgiunchedi are going to take a look at what happened to the file in Swift. [17:04:49] (03PS3) 10Gehel: wdqs: prometheus-blazegraph-exporter supports multi instances [puppet] - 10https://gerrit.wikimedia.org/r/484246 (https://phabricator.wikimedia.org/T213234) [17:04:59] (03CR) 10Addshore: [C: 03+2] Introduce wmgWikibaseMaxItemIdForNewPropertyIdHtmlFormatter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484245 (https://phabricator.wikimedia.org/T201831) (owner: 10Addshore) [17:05:25] (03CR) 10Ottomata: admin: allow users to be deployed without ssh keys configured (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/482275 (https://phabricator.wikimedia.org/T212949) (owner: 10Elukey) [17:06:09] (03Merged) 10jenkins-bot: Introduce wmgWikibaseMaxItemIdForNewPropertyIdHtmlFormatter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484245 (https://phabricator.wikimedia.org/T201831) (owner: 10Addshore) [17:08:42] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T201831 T201838 Introduce wmgWikibaseMaxItemIdForNewPropertyIdHtmlFormatter PT 1/2 (duration: 00m 47s) [17:08:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:46] T201831: Deploy item/property link formatter that uses cache instead of wb_terms DB table - https://phabricator.wikimedia.org/T201831 [17:08:46] T201838: Use link formatter that uses cache instead of wb_terms for all wikidatawiki properties - https://phabricator.wikimedia.org/T201838 [17:09:44] !log addshore@deploy1001 Synchronized wmf-config/Wikibase.php: T201831 T201838 Introduce wmgWikibaseMaxItemIdForNewPropertyIdHtmlFormatter PT 2/2 (duration: 00m 45s) [17:09:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:55] (03CR) 10Addshore: [C: 03+2] wmgWikibaseMaxItemIdForNewPropertyIdHtmlFormatter 3000 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484247 (https://phabricator.wikimedia.org/T201831) (owner: 10Addshore) [17:11:08] (03Merged) 10jenkins-bot: wmgWikibaseMaxItemIdForNewPropertyIdHtmlFormatter 3000 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484247 (https://phabricator.wikimedia.org/T201831) (owner: 10Addshore) [17:11:25] PROBLEM - MariaDB Slave Lag: s7 on db2054 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.67 seconds [17:11:27] PROBLEM - MariaDB Slave Lag: s7 on db2087 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 306.12 seconds [17:11:27] PROBLEM - MariaDB Slave Lag: s7 on db2040 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 306.04 seconds [17:11:45] !log addshore@deploy1001 sync-file aborted: T201831 T201838 wmgWikibaseMaxItemIdForNewPropertyIdHtmlFormatter 3000 (duration: 00m 01s) [17:11:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:51] PROBLEM - MariaDB Slave Lag: s7 on db2077 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 314.72 seconds [17:11:55] PROBLEM - MariaDB Slave Lag: s7 on db2047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 316.10 seconds [17:12:05] (03CR) 10jenkins-bot: Introduce wmgWikibaseMaxItemIdForNewPropertyIdHtmlFormatter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484245 (https://phabricator.wikimedia.org/T201831) (owner: 10Addshore) [17:12:05] PROBLEM - MariaDB Slave Lag: s7 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 319.96 seconds [17:12:08] (03CR) 10jenkins-bot: wmgWikibaseMaxItemIdForNewPropertyIdHtmlFormatter 3000 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484247 (https://phabricator.wikimedia.org/T201831) (owner: 10Addshore) [17:12:13] PROBLEM - MariaDB Slave Lag: s7 on db2086 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 321.78 seconds [17:12:17] PROBLEM - MariaDB Slave Lag: s7 on db2061 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 322.26 seconds [17:13:05] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T201831 T201838 wmgWikibaseMaxItemIdForNewPropertyIdHtmlFormatter 3000 (duration: 00m 46s) [17:13:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:14] 10Operations, 10ops-codfw, 10decommission, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10Papaul) a:05Papaul→03RobH This is complete. All servers ready to be ship out. [17:13:21] * addshore will now watch some graphs for a few mins [17:13:29] 10Operations, 10Certcentral, 10Traffic: Allow specifying a custom period of time before deploying a newly issued certificate - https://phabricator.wikimedia.org/T213737 (10Vgutierrez) p:05Triage→03Normal [17:14:01] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission of restbase200[1-6] (lease return in December 2018) - https://phabricator.wikimedia.org/T211070 (10Papaul) a:05Papaul→03RobH This is complete. All servers ready to be ship out. [17:17:44] PROBLEM - MariaDB Slave Lag: s7 on db2068 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 426.69 seconds [17:19:49] (03CR) 10Addshore: [C: 03+2] wmgWikibaseMaxItemIdForNewPropertyIdHtmlFormatter fully on [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484248 (https://phabricator.wikimedia.org/T201831) (owner: 10Addshore) [17:20:59] (03Merged) 10jenkins-bot: wmgWikibaseMaxItemIdForNewPropertyIdHtmlFormatter fully on [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484248 (https://phabricator.wikimedia.org/T201831) (owner: 10Addshore) [17:21:46] RECOVERY - puppet last run on cloudvirt1030 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:21:57] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T201831 T201838 wmgWikibaseMaxItemIdForNewPropertyIdHtmlFormatter fully on (duration: 00m 46s) [17:22:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:01] T201831: Deploy item/property link formatter that uses cache instead of wb_terms DB table - https://phabricator.wikimedia.org/T201831 [17:22:02] T201838: Use link formatter that uses cache instead of wb_terms for all wikidatawiki properties - https://phabricator.wikimedia.org/T201838 [17:24:54] (03CR) 10GTirloni: [C: 03+2] wmcs::nfs::misc - Configure nsswitch.conf [puppet] - 10https://gerrit.wikimedia.org/r/484257 (https://phabricator.wikimedia.org/T209527) (owner: 10GTirloni) [17:25:00] (03CR) 10jenkins-bot: wmgWikibaseMaxItemIdForNewPropertyIdHtmlFormatter fully on [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484248 (https://phabricator.wikimedia.org/T201831) (owner: 10Addshore) [17:25:02] !log deploy slot done [17:25:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:28] (03CR) 10Mobrovac: [C: 03+1] "kk, lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/483891 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [17:34:20] (03PS1) 10GTirloni: wmcs::nfs::misc - Second attempt to fix nsswitch.conf [puppet] - 10https://gerrit.wikimedia.org/r/484258 (https://phabricator.wikimedia.org/T209527) [17:36:18] PROBLEM - IPsec on cp2023 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp1089_v4, cp1089_v6 [17:36:26] 10Operations, 10Patch-For-Review, 10User-Marostegui, 10User-fgiunchedi: Audit "misc" cluster hosts - https://phabricator.wikimedia.org/T210486 (10colewhite) a:03colewhite [17:37:28] RECOVERY - IPsec on cp2023 is OK: Strongswan OK - 52 ESP OK [17:42:10] PROBLEM - Request latencies on acrab is CRITICAL: instance=10.192.16.26:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:43:27] cp2023 was me upgrading the kernel in cp1089, it's up again already [17:43:42] (03CR) 10GTirloni: [C: 03+2] wmcs::nfs::misc - Second attempt to fix nsswitch.conf [puppet] - 10https://gerrit.wikimedia.org/r/484258 (https://phabricator.wikimedia.org/T209527) (owner: 10GTirloni) [17:44:36] PROBLEM - Request latencies on acrux is CRITICAL: instance=10.192.0.93:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:44:47] 10Puppet, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current), and 2 others: ORES services should bind to ores config files - https://phabricator.wikimedia.org/T210719 (10Halfak) Maybe we should have a script and a process instead for manually restarting ORES nodes in a safe way. See {T213743} [17:47:02] RECOVERY - Request latencies on acrux is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:48:36] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: cp1075-90 - bnxt_en transmit hangs - https://phabricator.wikimedia.org/T203194 (10Vgutierrez) kernel upgraded successfully in cp1075-cp1090: ` vgutierrez@cumin1001:~$ sudo cumin cp[1075-1090].eqiad.wmnet 'uname -v' 16 hosts will be targeted: cp[1075... [17:57:20] PROBLEM - Request latencies on acrux is CRITICAL: instance=10.192.0.93:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:00:04] gehel and onimisionipe: It is that lovely time of the day again! You are hereby commanded to deploy Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190114T1800). [18:00:04] onimisionipe: A patch you scheduled for Wikidata Query Service weekly deploy is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:22] here here [18:01:46] RECOVERY - Request latencies on acrux is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:03:40] (03PS1) 10GTirloni: wmcs::nfs::misc - Remove wmcs-root from admin groups [puppet] - 10https://gerrit.wikimedia.org/r/484260 (https://phabricator.wikimedia.org/T209527) [18:05:12] (03PS4) 10Dzahn: admins: add Greg to phabricator-admins [puppet] - 10https://gerrit.wikimedia.org/r/483623 (https://phabricator.wikimedia.org/T213569) [18:05:56] PROBLEM - MariaDB Slave Lag: s2 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.56 seconds [18:05:56] PROBLEM - MariaDB Slave Lag: s2 on db2049 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 352.61 seconds [18:06:16] PROBLEM - MariaDB Slave Lag: s2 on db2035 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 308.83 seconds [18:06:18] PROBLEM - MariaDB Slave Lag: s2 on db2091 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.49 seconds [18:06:22] PROBLEM - MariaDB Slave Lag: s2 on db2063 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.12 seconds [18:06:30] Hauskatze: re: " looks like more than 'phabricator-admins' are entitled this access" it's because there is also phabricator-roots [18:06:46] PROBLEM - MariaDB Slave Lag: s2 on db2088 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 316.31 seconds [18:06:59] mutante: ack, but I was refering to the fact that the compiler listed users with "absent" status [18:07:04] (03CR) 10Dzahn: [C: 03+1] "approved in SRE meeting" [puppet] - 10https://gerrit.wikimedia.org/r/483623 (https://phabricator.wikimedia.org/T213569) (owner: 10Dzahn) [18:07:54] !log onimisionipe@deploy1001 Started deploy [wdqs/wdqs@f71131e]: Category script and GUI updates, blazegraph launcher updates and moved RWStore from scap to puppet [18:07:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:56] Hauskatze: ah, that's the spcial group for absent users.. absent doesnt mean literally not existing, it means "member of a special group" [18:08:01] be back in a while [18:08:30] RECOVERY - Request latencies on acrab is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:14:48] PROBLEM - MariaDB Slave Lag: s3 on db1124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.14 seconds [18:16:44] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:18:40] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:18:50] !log onimisionipe@deploy1001 Finished deploy [wdqs/wdqs@f71131e]: Category script and GUI updates, blazegraph launcher updates and moved RWStore from scap to puppet (duration: 10m 56s) [18:18:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:32] PROBLEM - MariaDB Slave Lag: s2 on db2041 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 494.70 seconds [18:19:34] PROBLEM - MariaDB Slave Lag: s2 on db2056 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 496.53 seconds [18:19:42] RECOVERY - MariaDB Slave Lag: s3 on db1124 is OK: OK slave_sql_lag Replication lag: 22.42 seconds [18:21:38] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:22:18] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:26:08] 10Operations, 10Citoid, 10serviceops, 10Kubernetes, 10Wikimedia-Incident: Zotero service crashes and pages multiple times. - https://phabricator.wikimedia.org/T213693 (10greg) Meta: Reading "This task is sort of an umbrella task for zotero latest incidents, it should be closed when we dont receive multip... [18:30:58] PROBLEM - MariaDB Slave Lag: s3 on db2057 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.10 seconds [18:33:39] 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Core Platform Team Backlog (Watching / External), and 2 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10akosiaris) Some numbers to help inform the decision = Graph usage = Using WMCS resources I extracted th... [18:38:16] 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Core Platform Team Backlog (Watching / External), and 2 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10akosiaris) >>! In T211881#4820828, @Milimetric wrote: > The reason Graphoid was initially developed was t... [18:40:28] 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Core Platform Team Backlog (Watching / External), and 2 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10akosiaris) [18:40:48] (03PS3) 10Jcrespo: mariadb: Depool db1081 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484213 (https://phabricator.wikimedia.org/T213664) [18:42:03] (03CR) 10Jcrespo: [C: 03+2] mariadb: Depool db1081 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484213 (https://phabricator.wikimedia.org/T213664) (owner: 10Jcrespo) [18:42:07] 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Core Platform Team Backlog (Watching / External), and 2 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10akosiaris) >>! In T211881#4821719, @Yurik wrote: > @akosiaris also, please add usage before the Varnish -... [18:42:58] 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Core Platform Team Backlog (Watching / External), and 2 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10akosiaris) >>! In T211881#4822247, @Tgr wrote: >>>! In T211881#4820828, @Milimetric wrote: >> The reason... [18:43:09] (03Merged) 10jenkins-bot: mariadb: Depool db1081 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484213 (https://phabricator.wikimedia.org/T213664) (owner: 10Jcrespo) [18:43:52] 10Operations, 10Wikidata, 10Wikidata-Termbox-Hike, 10serviceops, and 4 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10Smalyshev) I've tried to read all of it and maybe I've missed something, but I am still not sure what added value having such separate serv... [18:43:59] 10Operations, 10ops-eqiad, 10Analytics: swap a2-eqiad PDU with on-site spare - https://phabricator.wikimedia.org/T213748 (10RobH) p:05Triage→03High [18:44:03] (03CR) 10jenkins-bot: mariadb: Depool db1081 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484213 (https://phabricator.wikimedia.org/T213664) (owner: 10Jcrespo) [18:45:25] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: depool db1081 (duration: 00m 46s) [18:45:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:10] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1081 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484267 [18:48:12] (03PS3) 10Huji: Add new synonyms for namespaces in Persian (fa) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484256 (https://phabricator.wikimedia.org/T213733) [18:48:27] !log stop upgrade and restart db1081 [18:48:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:35] 10Operations, 10ops-eqiad, 10Analytics: swap a2-eqiad PDU with on-site spare - https://phabricator.wikimedia.org/T213748 (10RobH) [18:53:02] (03CR) 10Smalyshev: [C: 03+1] "lgtm, +some random notes" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/484246 (https://phabricator.wikimedia.org/T213234) (owner: 10Gehel) [18:54:46] 10Operations, 10ops-eqiad, 10Analytics: swap a2-eqiad PDU with on-site spare - https://phabricator.wikimedia.org/T213748 (10RobH) [18:55:29] 10Operations: Offboard Balazs - https://phabricator.wikimedia.org/T213703 (10jbond) [18:57:29] 10Operations, 10ops-eqiad, 10Analytics: swap a2-eqiad PDU with on-site spare - https://phabricator.wikimedia.org/T213748 (10RobH) [18:58:55] 10Operations: Offboard Balazs - https://phabricator.wikimedia.org/T213703 (10jbond) [19:00:04] Deploy window Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190114T1900) [19:00:04] MaxSem, dcausse, stephanebisson, and James_F: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:06] * James_F waves for jouncebot. [19:00:15] o/ [19:00:19] hello [19:00:22] Anyone planning to SWAT, or should I? [19:00:51] I can SWAT if there's no thing crazy to deplay :) [19:00:58] dcausse: Go for it. :-) [19:00:59] 10Operations, 10ops-eqiad, 10Analytics: swap a2-eqiad PDU with on-site spare - https://phabricator.wikimedia.org/T213748 (10elukey) [19:01:42] (03PS3) 10DCausse: Remove old ArticleCreationWorkflows config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462041 (https://phabricator.wikimedia.org/T204016) (owner: 10MaxSem) [19:03:44] 10Operations, 10ops-eqiad, 10Analytics: swap a2-eqiad PDU with on-site spare - https://phabricator.wikimedia.org/T213748 (10elukey) [19:04:39] MaxSem: around? [19:05:21] looks like it's just a cleanup I guess it's fine to deploy [19:05:35] (03CR) 10DCausse: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462041 (https://phabricator.wikimedia.org/T204016) (owner: 10MaxSem) [19:06:39] (03Merged) 10jenkins-bot: Remove old ArticleCreationWorkflows config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462041 (https://phabricator.wikimedia.org/T204016) (owner: 10MaxSem) [19:08:18] 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Core Platform Team Backlog (Watching / External), and 2 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10akosiaris) > @akosiaris, the logic in is fundament... [19:09:28] 10Operations, 10ops-eqiad, 10Analytics: swap a2-eqiad PDU with on-site spare - https://phabricator.wikimedia.org/T213748 (10RobH) [19:09:36] !log dcausse@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T204016: Remove old ArticleCreationWorkflows config (duration: 00m 46s) [19:09:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:39] 10Operations, 10ops-eqiad, 10Analytics: swap a2-eqiad PDU with on-site spare - https://phabricator.wikimedia.org/T213748 (10RobH) [19:10:24] (03CR) 10DCausse: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483645 (owner: 10Jforrester) [19:10:34] James_F: is there something to test with this patch? ^ [19:10:43] (03CR) 10jenkins-bot: Remove old ArticleCreationWorkflows config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462041 (https://phabricator.wikimedia.org/T204016) (owner: 10MaxSem) [19:10:57] dcausse: It's fine. [19:11:35] 10Operations, 10ops-eqiad, 10Analytics: swap a2-eqiad PDU with on-site spare - https://phabricator.wikimedia.org/T213748 (10RobH) [19:13:23] (03PS2) 10DCausse: Clean-up: Explain why WBMI wikis don't need wmgWikibaseRepoEntityNamespaces set [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483645 (owner: 10Jforrester) [19:13:45] 10Operations, 10ops-eqiad, 10Analytics, 10DBA: swap a2-eqiad PDU with on-site spare - https://phabricator.wikimedia.org/T213748 (10RobH) [19:13:49] (03CR) 10Dzahn: [C: 03+2] admins: add Greg to phabricator-admins [puppet] - 10https://gerrit.wikimedia.org/r/483623 (https://phabricator.wikimedia.org/T213569) (owner: 10Dzahn) [19:14:33] stephanebisson: hey, is there something you would like to test (should I deploy it on mwdebug1002 first?) [19:14:43] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: add Greg Grossmeier to Phabricator admins group - https://phabricator.wikimedia.org/T213569 (10Dzahn) [19:14:45] dcausse: yes please, I can test it [19:14:53] (03CR) 10Jcrespo: [C: 03+2] Revert "mariadb: Depool db1081 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484267 (owner: 10Jcrespo) [19:15:02] ok [19:15:09] 10Operations, 10ops-eqiad, 10Analytics, 10DBA: swap a2-eqiad PDU with on-site spare - https://phabricator.wikimedia.org/T213748 (10RobH) [19:16:18] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1081 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484267 (owner: 10Jcrespo) [19:16:45] stephanebisson: it's live on mwdebug1002 [19:17:51] dcausse: testing now [19:18:21] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: add Greg Grossmeier to Phabricator admins group - https://phabricator.wikimedia.org/T213569 (10Dzahn) 05Open→03Resolved a:03Dzahn ` [phab1001:~] $ id gjg uid=2890(gjg) gid=500(wikidev) groups=500(wikidev),746(phabricator-admin) ` @greg Should w... [19:18:31] 10Operations, 10SRE-Access-Requests: add Greg Grossmeier to Phabricator admins group - https://phabricator.wikimedia.org/T213569 (10Dzahn) [19:18:37] 10Operations, 10ops-eqiad, 10Analytics, 10DBA: swap a2-eqiad PDU with on-site spare - https://phabricator.wikimedia.org/T213748 (10RobH) [19:18:47] godog: re T213081, why do you want to add more partitions? [19:18:47] T213081: Consider increasing kafka logging topic partitions - https://phabricator.wikimedia.org/T213081 [19:19:00] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1081 with low load (duration: 00m 47s) [19:19:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:07] good to have greg-g added mutante :) [19:19:20] dcausse: Looks good, you can deploy [19:19:28] Check endpoints for mwdebug1002.eqiad.wmnet' failed: /wiki/{title} (Main Page) timed out before a response was received; /wiki/{title} (Special Version) timed out before a response was received; /w/api.php (Main Page pageprops) timed out before a response was received [19:19:29] stephanebisson: deploying [19:20:01] dcausse: ^ [19:20:06] jynus: looking [19:20:09] (03PS3) 10Dzahn: doc: grant doc-uploader access to contint users [puppet] - 10https://gerrit.wikimedia.org/r/480798 (https://phabricator.wikimedia.org/T213169) (owner: 10Hashar) [19:21:20] (03PS6) 10MarcoAurelio: [WIP] mediawiki: Stop logging each run of purge_abusefilter.pp [puppet] - 10https://gerrit.wikimedia.org/r/483876 (https://phabricator.wikimedia.org/T213591) [19:21:35] (03CR) 10MarcoAurelio: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/483876 (https://phabricator.wikimedia.org/T213591) (owner: 10MarcoAurelio) [19:23:14] (03CR) 10Dzahn: [C: 03+2] "this has been approved in SRE meeting but with the additional comment that this should not stay manual in the long-term and probably needs" [puppet] - 10https://gerrit.wikimedia.org/r/480798 (https://phabricator.wikimedia.org/T213169) (owner: 10Hashar) [19:23:27] jynus: I don't see anything wrong, replaying these requests is working well, I wonder if it's related to T204871 [19:23:28] T204871: Investigate the spikes of "web request took longer than 60 seconds and timed out" during deployments - https://phabricator.wikimedia.org/T204871 [19:23:59] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1081 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484267 (owner: 10Jcrespo) [19:24:01] can someone confirm that it's "normal" to see some time outs on mwdebug1002 after running scap pull? [19:24:04] dcausse: just got that and was the only thing related I could thing about [19:24:10] thcipriani: perhaps? ^ [19:24:24] and the rule #1 is to speak up just in case [19:24:30] sure [19:24:43] I don't see any error on the logs either [19:24:49] yes me neither [19:25:31] dcausse: I have noticed that on occasion when hhvm load becomes high on a particular machine after a pull [19:25:33] dcausse: remember it is only when one says "it is nothing" when issues happen, and the other way round [19:25:46] :-) [19:25:48] :) [19:26:11] thcipriani: ok thanks, I guess we're in this situation [19:28:08] !log re-activate BGP to Zayo on cr1-eqiad - T212791 [19:28:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:11] T212791: Interface errors on cr1-eqiad:xe-3/3/1 - https://phabricator.wikimedia.org/T212791 [19:29:15] !log dcausse@deploy1001 Synchronized php-1.33.0-wmf.12/extensions/GrowthExperiments/includes/WelcomeSurvey.php: Welcome survey: ignore check confirmed email (duration: 00m 45s) [19:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:28] 10Operations, 10Continuous-Integration-Infrastructure, 10SRE-Access-Requests, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Grant sudo access for CI admins to doc.wikimedia.org publishing user - https://phabricator.wikimedia.org/T213169 (10Dzahn) 05Open→03Resolved The request has been appr... [19:29:46] stephanebisson: should be live [19:30:08] James_F: back to your patch, sorry for the delay [19:30:35] No worries. [19:31:09] (03CR) 10DCausse: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483645 (owner: 10Jforrester) [19:32:48] !log re-deactivate BGP to Zayo on cr1-eqiad - T212791 [19:32:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:53] (03Merged) 10jenkins-bot: Clean-up: Explain why WBMI wikis don't need wmgWikibaseRepoEntityNamespaces set [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483645 (owner: 10Jforrester) [19:34:49] (03CR) 10Volans: [C: 04-1] "I'm sorry for the placeholder -1, but I have to run now. I'll add all the related comments later today." [software/certcentral] - 10https://gerrit.wikimedia.org/r/483163 (https://phabricator.wikimedia.org/T213301) (owner: 10Vgutierrez) [19:35:23] 10Operations, 10ops-eqiad, 10Analytics, 10DBA: swap a2-eqiad PDU with on-site spare - https://phabricator.wikimedia.org/T213748 (10RobH) [19:35:55] 10Operations, 10ops-eqiad, 10Analytics, 10DBA: swap a2-eqiad PDU with on-site spare - https://phabricator.wikimedia.org/T213748 (10RobH) [19:37:10] !log dcausse@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Clean-up: Explain why WBMI wikis don't need wmgWikibaseRepoEntityNamespaces set (duration: 00m 46s) [19:37:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:13] (03CR) 10Dzahn: [C: 03+1] "yea, this is just a revert of https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/410072/ which i merged. so if it's not WIP anymore it" [puppet] - 10https://gerrit.wikimedia.org/r/483876 (https://phabricator.wikimedia.org/T213591) (owner: 10MarcoAurelio) [19:37:16] (03CR) 10jenkins-bot: Clean-up: Explain why WBMI wikis don't need wmgWikibaseRepoEntityNamespaces set [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483645 (owner: 10Jforrester) [19:37:36] James_F: done [19:37:49] Thanks! [19:37:51] yw! [19:38:26] (03CR) 10DCausse: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476271 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [19:41:07] (03PS20) 10DCausse: [cirrus] Start writing to psi & omega [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476271 (https://phabricator.wikimedia.org/T210381) [19:41:09] (03PS20) 10DCausse: [cirrus] Start using replica group settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476272 (https://phabricator.wikimedia.org/T210381) [19:41:11] (03PS22) 10DCausse: [cirrus] Cleanup transitional states [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381) [19:41:32] (03CR) 10Dzahn: "the quotes i left were from a discussion on #httpd the freenode channel about using protocol in server name but they must have been talkin" [puppet] - 10https://gerrit.wikimedia.org/r/483775 (https://phabricator.wikimedia.org/T95164) (owner: 10Hashar) [19:43:46] (03CR) 10DCausse: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476271 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [19:43:51] dcausse: sorry I wasn't around. Thanks for deploying! [19:43:57] MaxSem: np! [19:45:06] dcausse: All done? I've just realised I didn't schedule a core back-port. :-( I can deploy it if you're busy. [19:45:13] (03Merged) 10jenkins-bot: [cirrus] Start writing to psi & omega [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476271 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [19:45:34] James_F: I'm testing my patch but it's [19:45:44] going to take some time to test :( [19:47:14] dcausse: Oh, no worries, I can do it whenever. [19:47:34] RECOVERY - MariaDB Slave Lag: s3 on db2057 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [19:50:26] (03CR) 10jenkins-bot: [cirrus] Start writing to psi & omega [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476271 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [19:51:41] 10Operations, 10ops-eqiad, 10Analytics, 10DBA: swap a2-eqiad PDU with on-site spare - https://phabricator.wikimedia.org/T213748 (10RobH) Please note the work has now been scheduled for Thursday, 2019-01-17 @ 07:00 EST (12:00 GMT). As both the #dba team and the #analytics team have expressed interest in st... [19:53:33] (03PS1) 10DCausse: Revert "[cirrus] Start writing to psi & omega" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484273 [19:55:22] (03PS4) 10Ottomata: [WIP] Helm chart for eventgate-analytics deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/483035 (https://phabricator.wikimedia.org/T211247) [19:55:54] (03CR) 10DCausse: [C: 03+2] "SWAT, reverted patch failed testing on mwdebug1002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484273 (owner: 10DCausse) [19:57:05] (03Merged) 10jenkins-bot: Revert "[cirrus] Start writing to psi & omega" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484273 (owner: 10DCausse) [19:57:07] (03PS1) 10Jbond: update the offboard-user script so that it also checks absent users [puppet] - 10https://gerrit.wikimedia.org/r/484276 [19:57:47] James_F: I'm done [19:57:50] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T213397 (10RStallman-legalteam) This is fully signed and filed. Thanks! [19:58:03] 10Operations, 10ops-eqiad, 10Analytics, 10DBA: swap a2-eqiad PDU with on-site spare - https://phabricator.wikimedia.org/T213748 (10RobH) [19:58:27] !log Morning SWAT done [19:58:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:45] 10Operations, 10ops-eqiad, 10Analytics, 10DBA: swap a2-eqiad PDU with on-site spare - https://phabricator.wikimedia.org/T213748 (10Marostegui) @robh what's your plan with db1075 (the db master)? [19:59:25] (03CR) 10Gehel: wdqs: prometheus-blazegraph-exporter supports multi instances (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/484246 (https://phabricator.wikimedia.org/T213234) (owner: 10Gehel) [19:59:48] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:59:52] jouncebot: next [19:59:52] In 1 hour(s) and 0 minute(s): Services – Parsoid / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190114T2100) [19:59:58] 10Operations, 10ops-eqiad, 10Analytics, 10DBA: swap a2-eqiad PDU with on-site spare - https://phabricator.wikimedia.org/T213748 (10RobH) >>! In T213748#4878612, @Marostegui wrote: > @robh what's your plan with db1075 (the db master)? @cmjohnson will take 1 of the 2 power supplies and cross-cable it into... [20:03:26] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:03:58] (03CR) 10jenkins-bot: Revert "[cirrus] Start writing to psi & omega" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484273 (owner: 10DCausse) [20:04:19] 10Operations, 10ops-eqiad, 10Analytics, 10DBA: swap a2-eqiad PDU with on-site spare - https://phabricator.wikimedia.org/T213748 (10Marostegui) Awesome! Thanks for clarifying! [20:05:02] (03PS13) 10Gehel: Create second Blazegraph instance for categories [puppet] - 10https://gerrit.wikimedia.org/r/483628 (https://phabricator.wikimedia.org/T213234) (owner: 10Smalyshev) [20:08:30] (03CR) 10Anomie: [C: 03+1] "That's a lot of patches that all do basically the same thing though." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483910 (owner: 10MaxSem) [20:08:54] !log disabling puppet on all wdqs servers to deploy T213234 [20:08:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:57] T213234: Create puppet config to run two instances of Blazegraph - https://phabricator.wikimedia.org/T213234 [20:09:26] (03CR) 10MaxSem: "Yep, and I prefer to do it granularly, even if they'll be eventually deployed in one batch." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483910 (owner: 10MaxSem) [20:09:33] (03CR) 10Samwilson: [C: 03+1] [labs] Remove $wmgUseTemplateWizard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483900 (owner: 10MaxSem) [20:10:51] (03CR) 10Gehel: [C: 03+2] Create second Blazegraph instance for categories [puppet] - 10https://gerrit.wikimedia.org/r/483628 (https://phabricator.wikimedia.org/T213234) (owner: 10Smalyshev) [20:15:56] (03PS5) 10Ottomata: [WIP] Helm chart for eventgate-analytics deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/483035 (https://phabricator.wikimedia.org/T211247) [20:17:47] (03PS6) 10Ottomata: [WIP] Helm chart for eventgate-analytics deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/483035 (https://phabricator.wikimedia.org/T211247) [20:19:17] (03CR) 10Dzahn: doc: fix Apache redirects to use https (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/483775 (https://phabricator.wikimedia.org/T95164) (owner: 10Hashar) [20:27:34] !log gehel@deploy1001 Started deploy [wdqs/wdqs@f71131e]: upgradign wdqs1010 to latest version [20:27:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:58] !log gehel@deploy1001 Finished deploy [wdqs/wdqs@f71131e]: upgradign wdqs1010 to latest version (duration: 00m 24s) [20:27:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:30] 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Core Platform Team Backlog (Watching / External), and 2 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10Tgr) Are those numbers reliable? Arabic Wikipedia gets about 5M pageviews a day, and it sounds like almos... [20:37:20] !log jforrester@deploy1001 Synchronized php-1.33.0-wmf.12/resources/Resources.php: Hot-deploy I18193b19 to add missing message for OOUI v0.30.0 (duration: 00m 47s) [20:37:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:03] (03CR) 10Dzahn: "Could you give an example for URLs that are currently broken? I think there is a lot of explanation here but maybe not the actual problem " [puppet] - 10https://gerrit.wikimedia.org/r/483775 (https://phabricator.wikimedia.org/T95164) (owner: 10Hashar) [20:44:34] 10Operations, 10Citoid, 10SRE-Access-Requests: Requesting access to Citoid/Zotero production servers for MVOLZ - https://phabricator.wikimedia.org/T213269 (10CDanis) [20:44:52] (03PS1) 10Gehel: wdqs: make GC log file configurable per blazegraph instance [puppet] - 10https://gerrit.wikimedia.org/r/484288 (https://phabricator.wikimedia.org/T213234) [20:45:04] (03CR) 10Dzahn: "ok, i see they are described on https://phabricator.wikimedia.org/T213509 gotcha.. let me run some tests with apache-fast-test from deplo" [puppet] - 10https://gerrit.wikimedia.org/r/483775 (https://phabricator.wikimedia.org/T95164) (owner: 10Hashar) [20:45:17] 10Operations, 10Citoid, 10SRE-Access-Requests: Requesting access to Citoid/Zotero production servers for MVOLZ - https://phabricator.wikimedia.org/T213269 (10CDanis) This was approved at the meeting today. I'm happy to review a patchset adding your keys :) [20:49:38] PROBLEM - MariaDB Slave Lag: s3 on db2057 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.57 seconds [20:51:07] (03CR) 10Mathew.onipe: [C: 03+1] wdqs: make GC log file configurable per blazegraph instance [puppet] - 10https://gerrit.wikimedia.org/r/484288 (https://phabricator.wikimedia.org/T213234) (owner: 10Gehel) [20:51:18] does anyone know what's up with db2057? some job running or...? [20:51:38] I'm kinda hoping it's known or someone in sf tz can look (it's 11 pm here) [20:52:17] (03PS1) 10Kosta Harlan: EditorJourney: Enable data collection for viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484289 (https://phabricator.wikimedia.org/T213348) [20:52:27] (03PS2) 10CDanis: Add bmansunov to deploy-service and recommendation-admin groups [puppet] - 10https://gerrit.wikimedia.org/r/482312 (https://phabricator.wikimedia.org/T212945) (owner: 10Mobrovac) [20:52:35] (03CR) 10Gehel: [C: 03+2] wdqs: make GC log file configurable per blazegraph instance [puppet] - 10https://gerrit.wikimedia.org/r/484288 (https://phabricator.wikimedia.org/T213234) (owner: 10Gehel) [20:52:45] (03CR) 10CDanis: [C: 03+2] Add bmansunov to deploy-service and recommendation-admin groups [puppet] - 10https://gerrit.wikimedia.org/r/482312 (https://phabricator.wikimedia.org/T212945) (owner: 10Mobrovac) [20:53:09] (03PS3) 10CDanis: Add bmansunov to deploy-service and recommendation-admin groups [puppet] - 10https://gerrit.wikimedia.org/r/482312 (https://phabricator.wikimedia.org/T212945) (owner: 10Mobrovac) [20:54:28] 10Operations, 10Recommendation-API, 10Research, 10SRE-Access-Requests, and 3 others: Add Baha as a deployer for Recommendation API - https://phabricator.wikimedia.org/T212945 (10CDanis) [20:54:48] 10Operations, 10Recommendation-API, 10Research, 10SRE-Access-Requests, and 3 others: Add Baha as a deployer for Recommendation API - https://phabricator.wikimedia.org/T212945 (10CDanis) 05Open→03Resolved [20:54:56] (03CR) 10Kosta Harlan: "Waiting for Marshall's feedback on when this would get deployed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484289 (https://phabricator.wikimedia.org/T213348) (owner: 10Kosta Harlan) [20:58:09] 10Operations, 10Traffic, 10Wikidata, 10serviceops, and 2 others: [Task] move wikiba.se webhosting to wikimedia cluster - https://phabricator.wikimedia.org/T99531 (10CRoslof) Transferring the domain name from WMDE to the Foundation requires that WMDE complete an ownership change form. I emailed with @Abraha... [20:59:08] apergos: it's a slave in codfw and slow quries from labsdb host. that combo should mean it doesn't need immediate action [21:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: I, the Bot under the Fountain, allow thee, The Deployer, to do Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190114T2100). [21:00:12] no, I doubt it, but it should be flagged for review if we know there's not a slow predictable job on it [21:00:15] no parsoid deploy today [21:01:00] RECOVERY - MariaDB Slave Lag: s6 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 290.41 seconds [21:01:22] (03CR) 10Volans: [C: 03+1] "LGTM, two minor optional comments inline." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/484276 (owner: 10Jbond) [21:05:50] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T213397 (10CDanis) 05Open→03Resolved [21:06:06] (03CR) 10MarcoAurelio: "Thanks, Daniel. I am still waiting for someone with access to 'grep' the logs (mwmaint1002 & mwmaint2001, cfr. Task) and see if there has " [puppet] - 10https://gerrit.wikimedia.org/r/483876 (https://phabricator.wikimedia.org/T213591) (owner: 10MarcoAurelio) [21:09:34] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 239, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:12:08] 10Operations, 10Citoid, 10SRE-Access-Requests: Requesting access to Citoid/Zotero production servers for MVOLZ - https://phabricator.wikimedia.org/T213269 (10CDanis) Ah, just realized I probably need to prepare the patch myself. (Sorry, first time on clinic duty.) Doing so shortly. [21:14:00] apergos: saved the slow query / user / client in a private paste.. pinged m.arostegui [21:14:16] 10Operations, 10Continuous-Integration-Infrastructure, 10SRE-Access-Requests, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Grant sudo access for CI admins to doc.wikimedia.org publishing user - https://phabricator.wikimedia.org/T213169 (10hashar) I can confirm it is working fine. Thank you.... [21:14:21] ok, well we'll see what's said about it eventually [21:14:29] thank you for having a lok [21:15:42] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 241, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:16:48] (wikibase) [21:17:13] (03PS1) 10CDanis: add user mvolz to citoid-admin/deployment/deploy-service [puppet] - 10https://gerrit.wikimedia.org/r/484295 (https://phabricator.wikimedia.org/T213269) [21:18:33] (03CR) 10MarcoAurelio: "> if you want give it a try to convert it to use the new mediawiki" [puppet] - 10https://gerrit.wikimedia.org/r/483876 (https://phabricator.wikimedia.org/T213591) (owner: 10MarcoAurelio) [21:20:10] (03CR) 10Volans: [C: 04-1] "I think there is an issue (actually also with current code). See also a couple of suggestions inline." (036 comments) [software/certcentral] - 10https://gerrit.wikimedia.org/r/483163 (https://phabricator.wikimedia.org/T213301) (owner: 10Vgutierrez) [21:22:00] (03CR) 10CDanis: [C: 03+2] add user mvolz to citoid-admin/deployment/deploy-service [puppet] - 10https://gerrit.wikimedia.org/r/484295 (https://phabricator.wikimedia.org/T213269) (owner: 10CDanis) [21:23:08] 10Operations, 10Citoid, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Citoid/Zotero production servers for MVOLZ - https://phabricator.wikimedia.org/T213269 (10CDanis) 05Open→03Resolved [21:26:24] !log bsitzmann@deploy1001 Started deploy [mobileapps/deploy@89c4d8d]: Update mobileapps to f2658de (fix ITN explore feed for dawiki) [21:26:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:15] !log bsitzmann@deploy1001 Finished deploy [mobileapps/deploy@89c4d8d]: Update mobileapps to f2658de (fix ITN explore feed for dawiki) (duration: 03m 51s) [21:30:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:00] (03PS7) 10Ottomata: [WIP] Helm chart for eventgate-analytics deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/483035 (https://phabricator.wikimedia.org/T211247) [21:45:48] RECOVERY - MariaDB Slave Lag: s2 on db2095 is OK: OK slave_sql_lag Replication lag: 12.16 seconds [21:46:14] RECOVERY - MariaDB Slave Lag: s2 on db2041 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [21:46:16] RECOVERY - MariaDB Slave Lag: s2 on db2091 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [21:46:16] RECOVERY - MariaDB Slave Lag: s2 on db2056 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [21:46:20] RECOVERY - MariaDB Slave Lag: s2 on db2035 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [21:46:26] RECOVERY - MariaDB Slave Lag: s2 on db2063 is OK: OK slave_sql_lag Replication lag: 0.15 seconds [21:46:44] RECOVERY - MariaDB Slave Lag: s2 on db2088 is OK: OK slave_sql_lag Replication lag: 0.22 seconds [21:51:18] (03PS1) 10Andrew Bogott: proxyleaks.py: update for multi-region and other issues [puppet] - 10https://gerrit.wikimedia.org/r/484303 [21:54:09] (03PS1) 10Hashar: rsync: readd incoming and outgoing chmod [puppet] - 10https://gerrit.wikimedia.org/r/484304 [21:55:18] (03CR) 10jerkins-bot: [V: 04-1] rsync: readd incoming and outgoing chmod [puppet] - 10https://gerrit.wikimedia.org/r/484304 (owner: 10Hashar) [21:59:04] PROBLEM - Wikitech-static main page has content on labweb1001 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - string Wikitech not found on https://wikitech-static.wikimedia.org:443/wiki/Main_Page?debug=true - 291 bytes in 0.107 second response time [21:59:14] PROBLEM - Wikitech-static main page has content on labweb1002 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - string Wikitech not found on https://wikitech-static.wikimedia.org:443/wiki/Main_Page?debug=true - 291 bytes in 0.108 second response time [21:59:32] PROBLEM - Wikitech-static main page has content on labtestweb2001 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - string Wikitech not found on https://wikitech-static.wikimedia.org:443/wiki/Main_Page?debug=true - 291 bytes in 0.100 second response time [22:00:04] bawolff and Reedy: I, the Bot under the Fountain, allow thee, The Deployer, to do Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190114T2200). [22:03:04] I'm working on wikitech-static, sorry for the noise [22:04:35] (03PS2) 10Hashar: rsync: readd incoming and outgoing chmod [puppet] - 10https://gerrit.wikimedia.org/r/484304 (https://phabricator.wikimedia.org/T137890) [22:04:37] (03PS1) 10Hashar: doc: make published files group writable [puppet] - 10https://gerrit.wikimedia.org/r/484308 (https://phabricator.wikimedia.org/T137890) [22:05:28] (03CR) 10jerkins-bot: [V: 04-1] rsync: readd incoming and outgoing chmod [puppet] - 10https://gerrit.wikimedia.org/r/484304 (https://phabricator.wikimedia.org/T137890) (owner: 10Hashar) [22:06:43] (03CR) 10Hashar: "We got sudo for doc-publisher, might as well make sure rsync set received files/dirs group writable?" [puppet] - 10https://gerrit.wikimedia.org/r/484308 (https://phabricator.wikimedia.org/T137890) (owner: 10Hashar) [22:08:00] RECOVERY - MariaDB Slave Lag: s2 on db2049 is OK: OK slave_sql_lag Replication lag: 23.25 seconds [22:08:07] (03CR) 10Hashar: "Seems modules/rsync does not pass rubocop which should not happen (TM). Will dig into the issue eventually and fix it in another change." [puppet] - 10https://gerrit.wikimedia.org/r/484304 (https://phabricator.wikimedia.org/T137890) (owner: 10Hashar) [22:10:16] PROBLEM - Blazegraph process on wdqs1010 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 499 (blazegraph), regex args ^java .* blazegraph-service-.*war [22:12:25] ^ downtime expired [22:20:49] 10Operations, 10ops-eqiad: Interface errors on cr1-eqiad:xe-3/3/1 - https://phabricator.wikimedia.org/T212791 (10ayounsi) Equinix cleaned and tested the X-connect, but the issue persists. Next step is to do another round of testing/swapping on our side and follow up with Zayo if no resolution. [22:21:00] RECOVERY - MariaDB Slave Lag: s3 on db2057 is OK: OK slave_sql_lag Replication lag: 26.06 seconds [22:25:50] RECOVERY - Wikitech-static main page has content on labweb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 27832 bytes in 0.284 second response time [22:26:00] RECOVERY - Wikitech-static main page has content on labweb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 27832 bytes in 0.269 second response time [22:26:18] RECOVERY - Wikitech-static main page has content on labtestweb2001 is OK: HTTP OK: HTTP/1.1 200 OK - 27832 bytes in 0.240 second response time [22:36:16] bblack: Don't suppose you've had time to do any of the Zero VCL removal you mentioned last week? Should I create a task under T187716? [22:36:17] T187716: Sunset Wikipedia Zero - https://phabricator.wikimedia.org/T187716 [22:39:47] !log upgraded packages and MW version on wikitech-static [22:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:34] (03PS1) 10Gehel: wdqs: monitor blazegraph process per instance [puppet] - 10https://gerrit.wikimedia.org/r/484314 (https://phabricator.wikimedia.org/T213234) [22:43:09] (03CR) 10jerkins-bot: [V: 04-1] wdqs: monitor blazegraph process per instance [puppet] - 10https://gerrit.wikimedia.org/r/484314 (https://phabricator.wikimedia.org/T213234) (owner: 10Gehel) [22:53:01] (03CR) 10Dzahn: "i'm somewhat skeptical about adding several "hacks" to do things manual and prevent breakage when doing things manual when at the same tim" [puppet] - 10https://gerrit.wikimedia.org/r/484308 (https://phabricator.wikimedia.org/T137890) (owner: 10Hashar) [22:54:57] James_F: please do, it's been a pretty swampy january so far :) [22:57:29] bblack: Kk. [22:57:55] (03PS3) 10Dzahn: doc: force users umask for wikidev group [puppet] - 10https://gerrit.wikimedia.org/r/484194 (https://phabricator.wikimedia.org/T137890) (owner: 10Hashar) [22:58:11] 10Operations, 10ExternalGuidance, 10Traffic: Deliver mobile-based version for automatic translations - https://phabricator.wikimedia.org/T212197 (10BBlack) I don't have any suggestions, no. Develop a straw-patch which at least serves in code terms to document the intent (e.g. the explicit header and URI val... [22:58:55] (03PS2) 10Smalyshev: wdqs: monitor blazegraph process per instance [puppet] - 10https://gerrit.wikimedia.org/r/484314 (https://phabricator.wikimedia.org/T213234) (owner: 10Gehel) [22:59:23] 10Operations, 10Traffic, 10Zero: Zero VCL removal - https://phabricator.wikimedia.org/T213769 (10Jdforrester-WMF) p:05Triage→03Normal [22:59:30] (03CR) 10Jforrester: [C: 04-2] "Specifically, blocked on T213769." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483193 (owner: 10Jforrester) [22:59:56] (03CR) 10jerkins-bot: [V: 04-1] wdqs: monitor blazegraph process per instance [puppet] - 10https://gerrit.wikimedia.org/r/484314 (https://phabricator.wikimedia.org/T213234) (owner: 10Gehel) [23:00:11] (03PS3) 10Jforrester: Revert "Block WP Zero users from accessing Phabricator uploads" [puppet] - 10https://gerrit.wikimedia.org/r/479399 (https://phabricator.wikimedia.org/T213769) (owner: 10MaxSem) [23:00:35] (03CR) 10Dzahn: "> We can fix the permissions ourselves once we are granded sudo as the doc-publisher user T213169" [puppet] - 10https://gerrit.wikimedia.org/r/484194 (https://phabricator.wikimedia.org/T137890) (owner: 10Hashar) [23:00:50] (03CR) 10Dzahn: [C: 03+2] doc: force users umask for wikidev group [puppet] - 10https://gerrit.wikimedia.org/r/484194 (https://phabricator.wikimedia.org/T137890) (owner: 10Hashar) [23:02:13] (03CR) 10Dzahn: "adding Moritz for general rsync module changes" [puppet] - 10https://gerrit.wikimedia.org/r/484304 (https://phabricator.wikimedia.org/T137890) (owner: 10Hashar) [23:03:04] (03CR) 10Dzahn: "depends on getting support for it back into rsync module.. so stalled for a moment" [puppet] - 10https://gerrit.wikimedia.org/r/484308 (https://phabricator.wikimedia.org/T137890) (owner: 10Hashar) [23:05:11] 10Operations, 10Continuous-Integration-Infrastructure, 10SRE-Access-Requests, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Grant sudo access for CI admins to doc.wikimedia.org publishing user - https://phabricator.wikimedia.org/T213169 (10Dzahn) >>! In T213169#4878825, @hashar wrote: >just no... [23:08:33] (03PS3) 10Smalyshev: wdqs: monitor blazegraph process per instance [puppet] - 10https://gerrit.wikimedia.org/r/484314 (https://phabricator.wikimedia.org/T213234) (owner: 10Gehel) [23:12:05] (03PS4) 10Gehel: wdqs: monitor blazegraph process per instance [puppet] - 10https://gerrit.wikimedia.org/r/484314 (https://phabricator.wikimedia.org/T213234) [23:13:05] (03PS4) 10Gergő Tisza: Improve list of privileged groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483022 [23:13:28] (03CR) 10Smalyshev: [C: 03+1] wdqs: monitor blazegraph process per instance [puppet] - 10https://gerrit.wikimedia.org/r/484314 (https://phabricator.wikimedia.org/T213234) (owner: 10Gehel) [23:15:23] (03PS1) 10Dzahn: doc: ferm, allow http connections from deployment hosts [puppet] - 10https://gerrit.wikimedia.org/r/484317 [23:15:59] (03PS5) 10Gehel: wdqs: monitor blazegraph process per instance [puppet] - 10https://gerrit.wikimedia.org/r/484314 (https://phabricator.wikimedia.org/T213234) [23:17:13] (03PS2) 10Dzahn: doc: ferm, allow http connections from deployment hosts [puppet] - 10https://gerrit.wikimedia.org/r/484317 (https://phabricator.wikimedia.org/T137890) [23:17:38] (03PS6) 10Gehel: wdqs: monitor blazegraph process per instance [puppet] - 10https://gerrit.wikimedia.org/r/484314 (https://phabricator.wikimedia.org/T213234) [23:18:29] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/14327/doc1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/484317 (https://phabricator.wikimedia.org/T137890) (owner: 10Dzahn) [23:18:42] (03CR) 10Gehel: [C: 03+2] wdqs: monitor blazegraph process per instance [puppet] - 10https://gerrit.wikimedia.org/r/484314 (https://phabricator.wikimedia.org/T213234) (owner: 10Gehel) [23:19:06] mutante: damn, you were faster than me! [23:19:14] (03PS7) 10Gehel: wdqs: monitor blazegraph process per instance [puppet] - 10https://gerrit.wikimedia.org/r/484314 (https://phabricator.wikimedia.org/T213234) [23:20:29] gehel: oops. a constant race. your time now before i touch the next :) [23:20:54] mutante: thanks! [23:25:57] PROBLEM - testing a script [23:27:13] PROBLEM - testing a script [23:27:54] (03PS4) 10Smalyshev: wdqs: prometheus-blazegraph-exporter supports multi instances [puppet] - 10https://gerrit.wikimedia.org/r/484246 (https://phabricator.wikimedia.org/T213234) (owner: 10Gehel) [23:28:03] hmm, isn't john bond our new employee? [23:28:13] yes sorry guys that was me [23:29:02] (03PS1) 10Gehel: wdqs: removed unused port parameter on wdqs::monitor::blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/484319 (https://phabricator.wikimedia.org/T213234) [23:29:44] (03CR) 10Gehel: [C: 03+2] wdqs: removed unused port parameter on wdqs::monitor::blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/484319 (https://phabricator.wikimedia.org/T213234) (owner: 10Gehel) [23:30:19] !log doc1001 - disabling puppet, testing apache config change 483775 [23:30:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:33] jbond42: I recommend applying for a irc cloak, this can be done by "/msg wmopbot cloak" and following the prompts, lot easier to tell when trying to hit spammers [23:30:37] apologies for the removal [23:31:02] p858snake: no worries i should have used a dev room either way [23:31:21] also i did request a cloak for this account just waiting for approval [23:33:09] the other account can remain as a standard account or blocked :) [23:34:13] (03CR) 10Dzahn: [C: 03+1] "i opened a firewall hole to allow using apache-fast-test from deploy1001 against doc1001 (https://gerrit.wikimedia.org/r/#/c/operations/pu" [puppet] - 10https://gerrit.wikimedia.org/r/483775 (https://phabricator.wikimedia.org/T95164) (owner: 10Hashar) [23:34:23] (03PS3) 10Dzahn: doc: fix Apache redirects to use https [puppet] - 10https://gerrit.wikimedia.org/r/483775 (https://phabricator.wikimedia.org/T95164) (owner: 10Hashar) [23:34:50] (03CR) 10Dzahn: [C: 03+2] "editing the old config didn't really do anything anymore and we should delete it but shrug :)" [puppet] - 10https://gerrit.wikimedia.org/r/483775 (https://phabricator.wikimedia.org/T95164) (owner: 10Hashar) [23:35:07] 10Operations, 10netops, 10Performance-Team (Radar): Stop prioritizing peering over transit - https://phabricator.wikimedia.org/T204281 (10ayounsi) [23:39:51] !log gehel@deploy1001 Started deploy [wdqs/wdqs@59d5f40]: New wdqs startup script for multi-instance [23:39:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:46:06] (03PS1) 10Dzahn: contint: delete unused doc.wikimedia.org site config [puppet] - 10https://gerrit.wikimedia.org/r/484321 (https://phabricator.wikimedia.org/T137890) [23:47:20] 10Operations, 10MediaWiki Language Extension Bundle, 10MediaWiki-Cache, 10Language-Team (Language-2019-January-March), and 5 others: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10aaron) >>! In T203786#487722... [23:47:26] 10Operations, 10CirrusSearch, 10Discovery-Search (Current work): Add chi, psi and omega selector to the elasticsearch dashboards in grafana - https://phabricator.wikimedia.org/T211956 (10debt) 05Open→03Resolved [23:49:43] !log gehel@deploy1001 Finished deploy [wdqs/wdqs@59d5f40]: New wdqs startup script for multi-instance (duration: 09m 53s) [23:49:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:54:45] PROBLEM - Blazegraph process on wdqs2001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 499 (blazegraph), regex args ^java .* blazegraph-service-.*war [23:57:05] PROBLEM - Blazegraph process on wdqs2005 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 499 (blazegraph), regex args ^java .* blazegraph-service-.*war [23:57:11] PROBLEM - Blazegraph process on wdqs2004 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 499 (blazegraph), regex args ^java .* blazegraph-service-.*war [23:57:13] PROBLEM - Blazegraph process on wdqs2003 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 499 (blazegraph), regex args ^java .* blazegraph-service-.*war [23:57:51] ^ transient failure, icinga needs to be updated as well for new wdqs instances, sorry for the noise [23:57:51] i know what this will be :) [23:57:53] PROBLEM - Blazegraph process on wdqs1006 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 499 (blazegraph), regex args ^java .* blazegraph-service-.*war [23:58:03] Matt announced runnign 2 blazegraphs per instance in the meeting today [23:58:03] PROBLEM - Blazegraph process on wdqs1008 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 499 (blazegraph), regex args ^java .* blazegraph-service-.*war [23:58:07] PROBLEM - Blazegraph process on wdqs2002 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 499 (blazegraph), regex args ^java .* blazegraph-service-.*war [23:58:22] and the check isn't "> 0" but " == 1" [23:58:41] mutante: at least someone listen in those meetings! [23:58:43] haha, it looked a lot like "something bad happend" [23:58:49] PROBLEM - Blazegraph process on wdqs2006 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 499 (blazegraph), regex args ^java .* blazegraph-service-.*war [23:58:51] mutante: left some comments on the purge_abusefilter.pp. Going to bed now. [23:59:12] nah, the correction to the check is in the same patch as the new instance, I should have split those for deployement [23:59:31] Hauskatze: thanks and good night