[00:00:20] (03PS1) 10Dzahn: webperf: silence output of curl in cron job [puppet] - 10https://gerrit.wikimedia.org/r/453307 [00:02:03] (03CR) 10Dzahn: [C: 032] "curl -s for silent mode" [puppet] - 10https://gerrit.wikimedia.org/r/453307 (owner: 10Dzahn) [00:12:25] PROBLEM - Disk space on elastic1027 is CRITICAL: DISK CRITICAL - free space: /srv 52284 MB (10% inode=99%) [00:14:34] RECOVERY - Disk space on elastic1027 is OK: DISK OK [00:21:25] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 [00:21:25] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 [00:26:29] * Krinkle staging on mwdebug1002 / deploy1001 [00:26:41] (03PS3) 10Krinkle: Remove StartProfiler.php (no longer used) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453063 (https://phabricator.wikimedia.org/T201782) [00:34:54] (03CR) 10Krinkle: [C: 032] "On deploy1001, mwdebug1002 and a random app server:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453063 (https://phabricator.wikimedia.org/T201782) (owner: 10Krinkle) [00:36:09] (03Merged) 10jenkins-bot: Remove StartProfiler.php (no longer used) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453063 (https://phabricator.wikimedia.org/T201782) (owner: 10Krinkle) [00:39:28] !log krinkle@deploy1001 Synchronized docroot/noc/: Ia83751695f35def (duration: 00m 50s) [00:39:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:41:45] PROBLEM - Disk space on elastic1019 is CRITICAL: DISK CRITICAL - free space: /srv 52300 MB (10% inode=99%) [00:42:09] !log krinkle@deploy1001 Synchronized wmf-config/: rm StartProfiler.php / Ia83751695f35def / T201782 (duration: 00m 50s) [00:42:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:42:16] T201782: Remove use of StartProfiler.php in wmf production - https://phabricator.wikimedia.org/T201782 [00:45:55] PROBLEM - Disk space on elastic1019 is CRITICAL: DISK CRITICAL - free space: /srv 51925 MB (10% inode=99%) [00:55:25] RECOVERY - Disk space on elastic1019 is OK: DISK OK [00:57:06] (03CR) 10jenkins-bot: Remove StartProfiler.php (no longer used) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453063 (https://phabricator.wikimedia.org/T201782) (owner: 10Krinkle) [00:58:54] 10Operations, 10LDAP-Access-Requests, 10User-Addshore: Give access to graphite and grafana-admin to Aleksey Bekh-Ivanov (WMDE) - https://phabricator.wikimedia.org/T199233 (10Dzahn) I was able to find out the user name is "albe" by searching for all wikimedia.de email addresses. [01:00:07] 10Operations, 10LDAP-Access-Requests, 10User-Addshore: Give access to graphite and grafana-admin to Aleksey Bekh-Ivanov (WMDE) - https://phabricator.wikimedia.org/T199233 (10Dzahn) 05Open>03Resolved Done! user "albe" has been added to group "nda". [01:08:04] 10Operations, 10Analytics, 10vm-requests: eqiad: (3) VM %request for internal analytics web sites - https://phabricator.wikimedia.org/T202013 (10Dzahn) Hi, thanks for filling out the template. Did you have specific names in mind? I know .. naming discussion, heh.. but we need to start by adding them to DNS b... [01:08:59] 10Operations, 10Analytics, 10vm-requests: eqiad: (3) VM %request for internal analytics web sites - https://phabricator.wikimedia.org/T202013 (10Dzahn) "yarn", "superset" and "turnilo" each with the standard "1001" at the end? [01:13:20] 10Operations, 10Analytics, 10vm-requests: eqiad: (3) VM %request for internal analytics web sites - https://phabricator.wikimedia.org/T202013 (10Dzahn) alternative: analytics-tools1001, 1002 and 1003 [01:42:24] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0 [01:43:04] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [01:44:05] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [01:44:25] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 [01:46:15] PROBLEM - Disk space on elastic1027 is CRITICAL: DISK CRITICAL - free space: /srv 48829 MB (10% inode=99%) [01:47:24] RECOVERY - Disk space on elastic1027 is OK: DISK OK [01:49:24] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [01:49:44] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0 [01:56:54] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [01:57:15] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 [02:13:05] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0 [02:14:45] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [02:16:05] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 [02:16:45] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [02:19:34] PROBLEM - HHVM rendering on mw2223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:20:25] RECOVERY - HHVM rendering on mw2223 is OK: HTTP OK: HTTP/1.1 200 OK - 79205 bytes in 0.391 second response time [03:29:55] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 866.23 seconds [03:42:34] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 291.64 seconds [04:44:04] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [04:46:05] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [05:01:34] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 239, down: 1, dormant: 0, excluded: 0, unused: 0 [05:04:45] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 241, down: 0, dormant: 0, excluded: 0, unused: 0 [05:25:05] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [05:25:15] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0 [05:26:22] (03CR) 10Krinkle: mediawiki: move php to a profile, use the php class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/453093 (https://phabricator.wikimedia.org/T201140) (owner: 10Giuseppe Lavagetto) [05:27:14] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [05:27:19] (03CR) 10Krinkle: mediawiki: move php to a profile, use the php class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/453093 (https://phabricator.wikimedia.org/T201140) (owner: 10Giuseppe Lavagetto) [05:27:25] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 [05:29:26] (03CR) 10Giuseppe Lavagetto: mediawiki: move php to a profile, use the php class (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/453093 (https://phabricator.wikimedia.org/T201140) (owner: 10Giuseppe Lavagetto) [05:39:24] PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert. [05:40:25] RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting. [06:29:54] PROBLEM - puppet last run on mw1319 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/logrotate.d/mediawiki_apache] [06:32:45] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/share/ca-certificates/RapidSSL_SHA256_CA_-_G3.crt] [06:38:24] (03PS1) 10Jcrespo: toolsdb: Ignore s51290__dpl_p replication on toolsdb replica [puppet] - 10https://gerrit.wikimedia.org/r/453355 (https://phabricator.wikimedia.org/T202055) [06:57:55] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [07:00:14] RECOVERY - puppet last run on mw1319 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:08:42] (03PS1) 10Muehlenhoff: Extend MOU for Shilad Sen [puppet] - 10https://gerrit.wikimedia.org/r/453357 [07:12:48] (03CR) 10Muehlenhoff: [C: 032] Extend MOU for Shilad Sen [puppet] - 10https://gerrit.wikimedia.org/r/453357 (owner: 10Muehlenhoff) [07:41:09] !log rebooting mw2220-mw2242 for kernel security update (also bundling wikidiff and apache updates) [07:41:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:25] !log stopping db2085 for upgrade [07:42:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:35] PROBLEM - Check systemd state on elastic1027 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:55:54] PROBLEM - Check systemd state on elastic1028 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:56:04] PROBLEM - Check systemd state on elastic1048 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:56:15] * gehel is checking elastic [07:56:15] PROBLEM - Check systemd state on elastic1034 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:56:24] PROBLEM - Check systemd state on elastic1032 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:56:34] PROBLEM - Check systemd state on elastic1049 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:56:35] PROBLEM - Check systemd state on elastic1039 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:56:35] PROBLEM - Check systemd state on elastic1038 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:56:35] PROBLEM - Check systemd state on elastic1029 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:56:44] PROBLEM - Check systemd state on elastic1050 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:56:45] PROBLEM - Check systemd state on elastic1017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:56:53] seems to be mjolnir failing, no user facing impact [07:56:54] PROBLEM - Check systemd state on elastic1036 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:56:54] PROBLEM - Check systemd state on elastic1044 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:56:55] PROBLEM - Check systemd state on elastic1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:57:04] PROBLEM - Check systemd state on elastic1035 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:57:04] PROBLEM - Check systemd state on elastic1022 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:57:05] PROBLEM - Check systemd state on elastic1046 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:57:26] do you know what unit is failing? [07:57:45] RECOVERY - Check systemd state on elastic1050 is OK: OK - running: The system is fully operational [07:57:51] jynus: yep, mjolnir (daemon used to trasfer stuff from analytics) [07:57:55] PROBLEM - Check systemd state on elastic1023 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:58:04] PROBLEM - Check systemd state on elastic1018 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:58:04] PROBLEM - Check systemd state on elastic1024 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:58:04] PROBLEM - Check systemd state on elastic1031 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:58:25] PROBLEM - Check systemd state on elastic1019 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:58:25] PROBLEM - Check systemd state on elastic1020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:58:45] PROBLEM - Check systemd state on elastic1026 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:59:04] PROBLEM - Check systemd state on elastic1051 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:59:15] RECOVERY - Check systemd state on elastic1048 is OK: OK - running: The system is fully operational [07:59:19] !log restarting mjolnir on all elastic / cirrus nodes [07:59:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:25] RECOVERY - Check systemd state on elastic1034 is OK: OK - running: The system is fully operational [07:59:34] RECOVERY - Check systemd state on elastic1019 is OK: OK - running: The system is fully operational [07:59:34] RECOVERY - Check systemd state on elastic1020 is OK: OK - running: The system is fully operational [07:59:35] RECOVERY - Check systemd state on elastic1032 is OK: OK - running: The system is fully operational [07:59:44] RECOVERY - Check systemd state on elastic1049 is OK: OK - running: The system is fully operational [07:59:45] RECOVERY - Check systemd state on elastic1026 is OK: OK - running: The system is fully operational [07:59:45] RECOVERY - Check systemd state on elastic1039 is OK: OK - running: The system is fully operational [07:59:45] RECOVERY - Check systemd state on elastic1038 is OK: OK - running: The system is fully operational [07:59:46] RECOVERY - Check systemd state on elastic1029 is OK: OK - running: The system is fully operational [07:59:52] Oh, systemd was already doing this same restart :) [07:59:55] RECOVERY - Check systemd state on elastic1017 is OK: OK - running: The system is fully operational [07:59:55] RECOVERY - Check systemd state on elastic1027 is OK: OK - running: The system is fully operational [08:00:04] RECOVERY - Check systemd state on elastic1051 is OK: OK - running: The system is fully operational [08:00:04] RECOVERY - Check systemd state on elastic1036 is OK: OK - running: The system is fully operational [08:00:05] RECOVERY - Check systemd state on elastic1023 is OK: OK - running: The system is fully operational [08:00:05] RECOVERY - Check systemd state on elastic1018 is OK: OK - running: The system is fully operational [08:00:05] RECOVERY - Check systemd state on elastic1044 is OK: OK - running: The system is fully operational [08:00:05] RECOVERY - Check systemd state on elastic1024 is OK: OK - running: The system is fully operational [08:00:05] RECOVERY - Check systemd state on elastic1028 is OK: OK - running: The system is fully operational [08:00:06] RECOVERY - Check systemd state on elastic1031 is OK: OK - running: The system is fully operational [08:00:06] RECOVERY - Check systemd state on elastic1030 is OK: OK - running: The system is fully operational [08:00:15] RECOVERY - Check systemd state on elastic1035 is OK: OK - running: The system is fully operational [08:00:15] RECOVERY - Check systemd state on elastic1022 is OK: OK - running: The system is fully operational [08:00:15] RECOVERY - Check systemd state on elastic1046 is OK: OK - running: The system is fully operational [08:02:34] PROBLEM - Check systemd state on elastic1048 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:02:45] PROBLEM - Check systemd state on elastic1052 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:03:05] PROBLEM - Check systemd state on elastic1029 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:03:15] PROBLEM - Check systemd state on elastic1036 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:03:16] again? [08:03:24] PROBLEM - Check systemd state on elastic1044 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:03:34] PROBLEM - Check systemd state on elastic1045 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:03:34] PROBLEM - Check systemd state on elastic1046 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:03:54] PROBLEM - Check systemd state on elastic1019 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:03:54] PROBLEM - Check systemd state on elastic1037 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:03:55] PROBLEM - Check systemd state on elastic1032 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:04:14] PROBLEM - Check systemd state on elastic1038 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:04:15] PROBLEM - Check systemd state on elastic1050 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:04:15] PROBLEM - Check systemd state on elastic1027 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:04:25] PROBLEM - Check systemd state on elastic1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:05:24] (03PS1) 10Jcrespo: mariadb: Depool db1085, both s4 and s5, for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453364 [08:05:24] RECOVERY - Check systemd state on elastic1027 is OK: OK - running: The system is fully operational [08:06:03] (03PS2) 10Jcrespo: mariadb: Depool db1097, both s4 and s5, for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453364 [08:06:18] !log silencing systemd check as mjolnir is flapping [08:06:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:54] RECOVERY - Check systemd state on elastic1046 is OK: OK - running: The system is fully operational [08:08:04] RECOVERY - Check systemd state on elastic1052 is OK: OK - running: The system is fully operational [08:08:25] RECOVERY - Check systemd state on elastic1029 is OK: OK - running: The system is fully operational [08:09:15] RECOVERY - Check systemd state on elastic1032 is OK: OK - running: The system is fully operational [08:10:26] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1097, both s4 and s5, for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453364 (owner: 10Jcrespo) [08:11:49] (03Merged) 10jenkins-bot: mariadb: Depool db1097, both s4 and s5, for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453364 (owner: 10Jcrespo) [08:11:55] RECOVERY - Check systemd state on elastic1044 is OK: OK - running: The system is fully operational [08:13:34] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1097 (duration: 00m 54s) [08:13:34] RECOVERY - Check systemd state on elastic1037 is OK: OK - running: The system is fully operational [08:13:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:49] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1097, both s4 and s5, for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453365 [08:13:59] (03CR) 10jenkins-bot: mariadb: Depool db1097, both s4 and s5, for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453364 (owner: 10Jcrespo) [08:16:15] RECOVERY - Check systemd state on elastic1030 is OK: OK - running: The system is fully operational [08:18:05] RECOVERY - Check systemd state on elastic1038 is OK: OK - running: The system is fully operational [08:19:24] RECOVERY - Check systemd state on elastic1036 is OK: OK - running: The system is fully operational [08:22:04] RECOVERY - Check systemd state on elastic1019 is OK: OK - running: The system is fully operational [08:27:05] RECOVERY - Check systemd state on elastic1045 is OK: OK - running: The system is fully operational [08:27:45] RECOVERY - Check systemd state on elastic1050 is OK: OK - running: The system is fully operational [08:29:15] RECOVERY - Check systemd state on elastic1048 is OK: OK - running: The system is fully operational [08:29:30] !log stopping db2073 for upgrade [08:29:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:15] PROBLEM - MariaDB Slave IO: s4 on db2095 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2073.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2073.codfw.wmnet (111 Connection refused) [08:32:32] ^no problem [08:32:40] I am restarting its master, see log [08:35:45] 10Operations, 10Discovery-Search (Current work): mjolnir-kafka-bulk-daemon failed on all elastic / eqiad nodes - https://phabricator.wikimedia.org/T202120 (10Gehel) There seem to be some correlation with a high number of failed relocations that happened just before mjolnir failing (see [[ URL | logstash ]]). N... [08:39:35] RECOVERY - MariaDB Slave IO: s4 on db2095 is OK: OK slave_io_state Slave_IO_Running: Yes [09:03:57] !log depool wdqs2003 to catch up on updates [09:04:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:14] RECOVERY - Check systemd state on wdqs1010 is OK: OK - running: The system is fully operational [09:35:18] (03Abandoned) 10Volans: Doc: uniform docstrings [software/spicerack] - 10https://gerrit.wikimedia.org/r/451537 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [09:37:10] (03PS5) 10Volans: Add confctl module to interact with conftool [software/spicerack] - 10https://gerrit.wikimedia.org/r/451254 (https://phabricator.wikimedia.org/T199079) [09:37:12] (03PS3) 10Volans: Add remote module to interact with Cumin [software/spicerack] - 10https://gerrit.wikimedia.org/r/451538 (https://phabricator.wikimedia.org/T199079) [09:37:14] (03PS1) 10Volans: config: directly inject global config path [software/spicerack] - 10https://gerrit.wikimedia.org/r/453371 (https://phabricator.wikimedia.org/T199079) [09:37:16] (03PS1) 10Volans: log: directly inject running user [software/spicerack] - 10https://gerrit.wikimedia.org/r/453372 (https://phabricator.wikimedia.org/T199079) [09:37:18] (03PS1) 10Volans: Add service locator class Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/453373 (https://phabricator.wikimedia.org/T199079) [09:37:53] (03CR) 10jerkins-bot: [V: 04-1] Add confctl module to interact with conftool [software/spicerack] - 10https://gerrit.wikimedia.org/r/451254 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [09:37:56] (03CR) 10jerkins-bot: [V: 04-1] log: directly inject running user [software/spicerack] - 10https://gerrit.wikimedia.org/r/453372 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [09:37:59] (03CR) 10jerkins-bot: [V: 04-1] Add remote module to interact with Cumin [software/spicerack] - 10https://gerrit.wikimedia.org/r/451538 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [09:38:05] (03CR) 10jerkins-bot: [V: 04-1] config: directly inject global config path [software/spicerack] - 10https://gerrit.wikimedia.org/r/453371 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [09:38:07] (03CR) 10jerkins-bot: [V: 04-1] Add service locator class Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/453373 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [09:42:52] (03CR) 10MarcoAurelio: [C: 031] "We already have some requests on the Phabricator for this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450450 (owner: 10Gergő Tisza) [09:47:03] !log rebooting mw2243-mw2290 for kernel security update (also bundling wikidiff and apache updates) [09:47:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:57] (03CR) 10Gehel: [C: 04-1] "Mostly good, still a few comments inline. I really like the introduction of the RemoteExecution class (just not its name :)." (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/451538 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [09:56:22] (03CR) 10Gehel: "LGTM, trivial enough" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/453371 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [09:58:35] (03CR) 10Gehel: "LGTM, trivial enough" [software/spicerack] - 10https://gerrit.wikimedia.org/r/453372 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [10:04:51] (03CR) 10Gehel: "Nice! Minor comments inline." (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/453373 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [10:25:11] (03PS5) 10Vgutierrez: Replace acme_tiny with acme_requests [software/certcentral] - 10https://gerrit.wikimedia.org/r/451866 [10:28:19] (03CR) 10Vgutierrez: "Fixed in PS5" [software/certcentral] - 10https://gerrit.wikimedia.org/r/451866 (owner: 10Vgutierrez) [10:35:20] !log powercycling mw2286 [10:35:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:56] !log rebooting mw2152-mw2162 for kernel security update (also bundling wikidiff and apache updates) [10:59:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:14] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): jessie support for QLogic FastLinQ 41112 Dual Port 10Gb SFP+ Adapter - https://phabricator.wikimedia.org/T201942 (10MoritzMuehlenhoff) This card isn't supported in stretch either, see T196477. The available options for jessie are even... [11:54:50] (03PS4) 10Ema: ATS: storage configuration [puppet] - 10https://gerrit.wikimedia.org/r/453164 (https://phabricator.wikimedia.org/T199720) [11:55:51] (03CR) 10jerkins-bot: [V: 04-1] ATS: storage configuration [puppet] - 10https://gerrit.wikimedia.org/r/453164 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema) [11:57:25] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db1097, both s4 and s5, for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453365 (owner: 10Jcrespo) [11:59:02] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1097, both s4 and s5, for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453365 (owner: 10Jcrespo) [12:00:13] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1097, both s4 and s5, for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453365 (owner: 10Jcrespo) [12:01:00] (03PS5) 10Ema: ATS: storage configuration [puppet] - 10https://gerrit.wikimedia.org/r/453164 (https://phabricator.wikimedia.org/T199720) [12:01:46] (03CR) 10jerkins-bot: [V: 04-1] ATS: storage configuration [puppet] - 10https://gerrit.wikimedia.org/r/453164 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema) [12:02:54] jerkins what is it that you want from my life [12:03:20] (03PS1) 10Wangql: Adding Chinese Wikiversity's logos: * zh-hant (default), with 1.5x and 2x * zh-hans, with 1.5x and 2x [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453383 [12:04:45] (03PS6) 10Ema: ATS: storage configuration [puppet] - 10https://gerrit.wikimedia.org/r/453164 (https://phabricator.wikimedia.org/T199720) [12:07:50] !log repooling wdqs2003, catched up on updates [12:07:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:25] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10aborrero) Ok, @RobH let's assume we won't be using the 2x10G NICs in the short-mid term. How many 1G NICs do these servers have? are they disabled in B... [12:10:41] (03PS7) 10Ema: ATS: storage configuration [puppet] - 10https://gerrit.wikimedia.org/r/453164 (https://phabricator.wikimedia.org/T199720) [12:10:56] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): jessie support for QLogic FastLinQ 41112 Dual Port 10Gb SFP+ Adapter - https://phabricator.wikimedia.org/T201942 (10aborrero) [12:11:18] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10aborrero) [12:11:21] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): jessie support for QLogic FastLinQ 41112 Dual Port 10Gb SFP+ Adapter - https://phabricator.wikimedia.org/T201942 (10aborrero) 05Open>03declined Ok, then closing this task now. It seems we won't see support in Jessie in the short-mi... [12:13:45] PROBLEM - HHVM rendering on mw1235 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:14:44] RECOVERY - HHVM rendering on mw1235 is OK: HTTP OK: HTTP/1.1 200 OK - 79219 bytes in 0.102 second response time [12:15:06] (03CR) 10Ema: "pcc is pleased https://puppet-compiler.wmflabs.org/compiler02/12125/cp2009.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/453164 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema) [12:30:30] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1097 (duration: 00m 53s) [12:30:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:38] (03CR) 10星耀晨曦: [C: 04-1] "The commit message missing "Bug: Txxx". Insert it immediately above the Change-Id: line. See https://www.mediawiki.org/wiki/Gerrit/Commit" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453383 (owner: 10Wangql) [12:34:39] (03PS2) 10Wangql: Adding Chinese Wikiversity's logos: * zh-hant (default), with 1.5x and 2x * zh-hans, with 1.5x and 2x Bug: T202127 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453383 (https://phabricator.wikimedia.org/T202127) [12:36:26] !log installing openjdk-8 security updates on elastic* in codfw (eqiad already has it via the recent reimage to stretch) [12:36:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:02] (03CR) 10星耀晨曦: [C: 04-1] "Keep a line between the Bug and the body. Without blank line between Bug and Change-Id." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453383 (https://phabricator.wikimedia.org/T202127) (owner: 10Wangql) [12:44:02] (03PS12) 10Vgutierrez: Refactor certcentral.certificate_management() [software/certcentral] - 10https://gerrit.wikimedia.org/r/451867 [12:46:14] !log rebooting relforge for JVM and kernel upgrade [12:46:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:35] !log imarlier@deploy1001 Started deploy [performance/coal@aff3793]: (no justification provided) [12:48:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:21] 10Operations: Onboarding Cole White - https://phabricator.wikimedia.org/T202136 (10MoritzMuehlenhoff) [12:49:42] !log imarlier@deploy1001 Finished deploy [performance/coal@aff3793]: (no justification provided) (duration: 01m 06s) [12:49:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:46] (03CR) 10星耀晨曦: [C: 04-1] "Also, separate the body from the subject with an empty line. About commit message standard, place see https://www.mediawiki.org/wiki/Gerri" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453383 (https://phabricator.wikimedia.org/T202127) (owner: 10Wangql) [12:53:47] (03PS3) 10Wangql: Adding Chinese Wikiversity's logos: * zh-hant (default), with 1.5x and 2x * zh-hans, with 1.5x and 2x [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453383 (https://phabricator.wikimedia.org/T202127) [12:57:34] (03PS4) 10星耀晨曦: Adding Chinese Wikiversity's logos: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453383 (https://phabricator.wikimedia.org/T202127) (owner: 10Wangql) [12:59:38] (03PS3) 10Vgutierrez: Implement different Certificate.save() modes [software/certcentral] - 10https://gerrit.wikimedia.org/r/453124 [13:01:44] (03CR) 10星耀晨曦: "This patch looks good now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453383 (https://phabricator.wikimedia.org/T202127) (owner: 10Wangql) [13:02:13] !log stopping db2084 for upgrade [13:02:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:22] (03CR) 10Muehlenhoff: [C: 031] "Looks good" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/453164 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema) [13:15:51] 10Operations, 10TCB-Team, 10wikidiff2, 10WMDE-QWERTY-Sprint-2018-08-14: Update wikidiff2 library on the WMF production cluster to v1.7.2 - https://phabricator.wikimedia.org/T199801 (10MoritzMuehlenhoff) 05Open>03Resolved @WMDE-Fisch Wikidiff 1.7.2 has been deployed in production (also for the inactive... [13:16:47] is someone testing importin on test2wiki? [13:47:54] (03PS8) 10Ema: ATS: storage configuration [puppet] - 10https://gerrit.wikimedia.org/r/453164 (https://phabricator.wikimedia.org/T199720) [14:19:52] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10Andrew) Agreed that 1G is fine [14:27:54] 10Operations, 10SRE-Access-Requests: NDA access for Telecom Paristech Research Team - https://phabricator.wikimedia.org/T200800 (10Dzahn) Hello @Rossi.dario.g I see you have provided all the info and a key, thanks for that. I just can't find your Wikitech user yet. Have you created it or did you just speci... [14:38:18] !log rebooting elnath for some kernel tests [14:38:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:22] 10Operations, 10SRE-Access-Requests, 10User-Addshore: Requesting Access to view EventLogging data - https://phabricator.wikimedia.org/T202072 (10Dzahn) Hi, You need to (digitally) sign the NDA, please write to Rachel Stallman and she'll prepare it for you. This also needs to be si... [14:50:46] 10Operations, 10SRE-Access-Requests, 10User-Addshore: Requesting access to view EventLogging data - https://phabricator.wikimedia.org/T202069 (10Dzahn) Hi, You need to (digitally) sign the NDA, please write to Rachel Stallman (@RStallman-legalteam ) and she'll prepare it for you.... [14:51:00] 10Operations, 10SRE-Access-Requests, 10User-Addshore: Requesting access to view EventLogging data - https://phabricator.wikimedia.org/T202063 (10Dzahn) Hi, You need to (digitally) sign the NDA, please write to Rachel Stallman (@RStallman-legalteam ) and she'll prepare it for you.... [14:52:58] 10Operations, 10Puppet: Stop introducing new code expanded from erb templates - https://phabricator.wikimedia.org/T200984 (10Dzahn) This seems more of a discussion / should be added to the wikitech page on Puppet coding standards.. but as a ticket i'm not sure how actionable it is. It would be resolved once... [14:53:16] 10Operations, 10SRE-Access-Requests, 10User-Addshore: Requesting Access to view EventLogging data - https://phabricator.wikimedia.org/T202072 (10Addshore) analytics-wmde-users only provides a very limited set of access, and that does not include access to EventLogging data afaik. They need something else li... [14:53:23] 10Operations, 10Puppet: Stop introducing new code expanded from erb templates - https://phabricator.wikimedia.org/T200984 (10Dzahn) p:05Triage>03Normal [14:55:14] 10Operations: Onboarding Cole White - https://phabricator.wikimedia.org/T202136 (10Dzahn) [14:55:48] (03PS1) 10Andrew Bogott: Neutron policy changes: [puppet] - 10https://gerrit.wikimedia.org/r/453407 [14:56:20] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts: ``` ['cp4025.ulsfo.wmnet', 'cp2004.codfw.wmnet'] ``` The log can be found in `/var/l... [14:58:33] 10Operations: Onboarding Cole White - https://phabricator.wikimedia.org/T202136 (10Dzahn) It all starts with having an email address and that is done by OIT. I don't see that yet. cwhite@ doesn't exist yet. [14:58:41] 10Operations, 10SRE-Access-Requests, 10User-Addshore: Requesting access to view EventLogging data - https://phabricator.wikimedia.org/T202069 (10Addshore) Followed up in T202072#4510001, lets have the group conversation there. [14:58:44] 10Operations, 10SRE-Access-Requests, 10User-Addshore: Requesting access to view EventLogging data - https://phabricator.wikimedia.org/T202063 (10Addshore) Followed up in T202072#4510001, lets have the group conversation there. [14:59:08] 10Operations, 10SRE-Access-Requests, 10User-Addshore: Requesting Access to view EventLogging data for gabriel-wmde - https://phabricator.wikimedia.org/T202072 (10Addshore) [14:59:11] (03PS2) 10Andrew Bogott: Neutron policy changes: [puppet] - 10https://gerrit.wikimedia.org/r/453407 [14:59:24] 10Operations, 10SRE-Access-Requests, 10User-Addshore: Requesting access to view EventLogging data for Tonina WMDE - https://phabricator.wikimedia.org/T202069 (10Addshore) [14:59:31] 10Operations, 10SRE-Access-Requests, 10User-Addshore: Requesting access to view EventLogging data for Tim WMDE - https://phabricator.wikimedia.org/T202063 (10Addshore) [14:59:51] (03CR) 10Andrew Bogott: [C: 032] Neutron policy changes: [puppet] - 10https://gerrit.wikimedia.org/r/453407 (owner: 10Andrew Bogott) [15:04:38] (03PS13) 10Vgutierrez: Refactor certcentral.certificate_management() [software/certcentral] - 10https://gerrit.wikimedia.org/r/451867 [15:04:40] (03PS4) 10Vgutierrez: Implement different Certificate.save() modes [software/certcentral] - 10https://gerrit.wikimedia.org/r/453124 [15:08:12] (03PS1) 10Andrew Bogott: Horizon: use neutron_policy.json [puppet] - 10https://gerrit.wikimedia.org/r/453412 [15:09:00] (03CR) 10Andrew Bogott: [C: 032] Horizon: use neutron_policy.json [puppet] - 10https://gerrit.wikimedia.org/r/453412 (owner: 10Andrew Bogott) [15:09:53] (03CR) 10Alex Monk: Replace acme_tiny with acme_requests (031 comment) [software/certcentral] - 10https://gerrit.wikimedia.org/r/451866 (owner: 10Vgutierrez) [15:16:13] Is Gerrit's UI unbearably slow for anyone else? [15:16:37] well not so much the UI itself but networking to the web interface [15:17:41] my ping to it is all over the place [15:18:07] (03CR) 10Alex Monk: Replace acme_tiny with acme_requests (031 comment) [software/certcentral] - 10https://gerrit.wikimedia.org/r/451866 (owner: 10Vgutierrez) [15:18:18] but bloody O2 seems to be breaking traceroute [15:20:33] (03PS1) 10Andrew Bogott: Neutron policy: forbid most things to project owner [puppet] - 10https://gerrit.wikimedia.org/r/453414 [15:23:12] (03CR) 10Alex Monk: [C: 032] Replace acme_tiny with acme_requests [software/certcentral] - 10https://gerrit.wikimedia.org/r/451866 (owner: 10Vgutierrez) [15:24:01] (03PS2) 10Andrew Bogott: Neutron policy: forbid most things to project owner [puppet] - 10https://gerrit.wikimedia.org/r/453414 [15:24:03] (03PS1) 10Andrew Bogott: neutron policy: Make a bunch of read-only calls public [puppet] - 10https://gerrit.wikimedia.org/r/453415 [15:24:14] (03PS7) 10Ayounsi: [WIP] Bird anycast: add anycast_healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/397723 [15:24:23] (03Merged) 10jenkins-bot: Replace acme_tiny with acme_requests [software/certcentral] - 10https://gerrit.wikimedia.org/r/451866 (owner: 10Vgutierrez) [15:24:43] (03CR) 10Andrew Bogott: [C: 032] Neutron policy: forbid most things to project owner [puppet] - 10https://gerrit.wikimedia.org/r/453414 (owner: 10Andrew Bogott) [15:24:59] (03CR) 10Andrew Bogott: [C: 032] neutron policy: Make a bunch of read-only calls public [puppet] - 10https://gerrit.wikimedia.org/r/453415 (owner: 10Andrew Bogott) [15:25:47] (03CR) 10jenkins-bot: Replace acme_tiny with acme_requests [software/certcentral] - 10https://gerrit.wikimedia.org/r/451866 (owner: 10Vgutierrez) [15:28:48] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp2004.codfw.wmnet', 'cp4025.ulsfo.wmnet'] ``` and were **ALL** successful. [15:29:32] (03PS1) 10Andrew Bogott: neutron policy.json: Restrict some default-permissive actions [puppet] - 10https://gerrit.wikimedia.org/r/453417 [15:30:23] (03CR) 10Andrew Bogott: [C: 032] neutron policy.json: Restrict some default-permissive actions [puppet] - 10https://gerrit.wikimedia.org/r/453417 (owner: 10Andrew Bogott) [15:35:22] (03PS1) 10Andrew Bogott: neutron policy.json: Fix up floating IP rules [puppet] - 10https://gerrit.wikimedia.org/r/453418 [15:36:48] (03CR) 10Andrew Bogott: [C: 032] neutron policy.json: Fix up floating IP rules [puppet] - 10https://gerrit.wikimedia.org/r/453418 (owner: 10Andrew Bogott) [15:37:55] (03PS1) 10Andrew Bogott: neutron policy.json: restrict router creation to admins [puppet] - 10https://gerrit.wikimedia.org/r/453420 [15:38:36] (03CR) 10Andrew Bogott: [C: 032] neutron policy.json: restrict router creation to admins [puppet] - 10https://gerrit.wikimedia.org/r/453420 (owner: 10Andrew Bogott) [15:43:21] (03CR) 10Urbanecm: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451823 (https://phabricator.wikimedia.org/T200201) (owner: 10Tulsi Bhagat) [15:46:20] (03CR) 10Ayounsi: "Replied to comments inline." (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/397723 (owner: 10Ayounsi) [16:01:24] 10Operations, 10SRE-Access-Requests, 10User-Addshore: Requesting access to view EventLogging data for Tonina WMDE - https://phabricator.wikimedia.org/T202069 (10RStallman-legalteam) We have an active NDA on file for Tonina. Thanks! [16:09:09] 10Operations, 10SRE-Access-Requests, 10User-Addshore: Requesting Access to view EventLogging data for gabriel-wmde - https://phabricator.wikimedia.org/T202072 (10RStallman-legalteam) If this is for Tim Fabian Eulitz, we have a current NDA on file. [16:11:07] 10Operations, 10SRE-Access-Requests, 10User-Addshore: Requesting Access to view EventLogging data for gabriel-wmde - https://phabricator.wikimedia.org/T202072 (10Addshore) >>! In T202072#4510187, @RStallman-legalteam wrote: > If this is for Tim Fabian Eulitz, we have a current NDA on file. I think your loo... [16:12:15] (03CR) 10Thcipriani: [C: 031] releases/mediawiki: proper Icinga monitoring for both Apache vhosts [puppet] - 10https://gerrit.wikimedia.org/r/453267 (owner: 10Dzahn) [16:13:22] 10Operations, 10SRE-Access-Requests, 10User-Addshore: Requesting access to view EventLogging data for Tim WMDE - https://phabricator.wikimedia.org/T202063 (10RStallman-legalteam) We have an NDA on file for Tim Fabian Eulitz :) [16:14:48] 10Operations, 10Scap: Intermittent git-fat failure during deploy - https://phabricator.wikimedia.org/T202100 (10thcipriani) Adding @Ottomata since they were the person who initially helped me get git-fat packaged after my tweak. [16:16:26] PROBLEM - Check systemd state on elastic1039 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:16:30] (03PS8) 10Ayounsi: [WIP] Bird anycast: add anycast_healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/397723 [16:16:46] PROBLEM - Check systemd state on elastic1037 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:17:06] PROBLEM - Check systemd state on elastic1045 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:17:06] PROBLEM - Check systemd state on elastic1020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:17:09] 10Operations, 10SRE-Access-Requests, 10User-Addshore: Requesting Access to view EventLogging data for gabriel-wmde - https://phabricator.wikimedia.org/T202072 (10RStallman-legalteam) Sorry, I responded to the wrong ticket for Tim! I can't find a Gabriel in our NDA records for WMDE staff. [16:17:16] PROBLEM - Check systemd state on elastic1019 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:17:25] PROBLEM - Check systemd state on elastic1025 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:17:26] PROBLEM - Check systemd state on elastic1027 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:17:26] PROBLEM - Check systemd state on elastic1029 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:17:35] PROBLEM - Check systemd state on elastic1033 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:17:36] PROBLEM - Check systemd state on elastic1050 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:17:45] PROBLEM - Check systemd state on elastic1018 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:17:55] PROBLEM - Check systemd state on elastic1044 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:18:05] PROBLEM - Check systemd state on elastic1047 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:18:05] PROBLEM - Check systemd state on elastic1022 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:18:16] PROBLEM - Check systemd state on elastic1026 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:18:17] 10Operations, 10SRE-Access-Requests, 10User-Addshore: Requesting Access to view EventLogging data for gabriel-wmde / gbirke - https://phabricator.wikimedia.org/T202072 (10Addshore) [16:18:25] PROBLEM - Check systemd state on elastic1052 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:18:25] PROBLEM - Check systemd state on elastic1031 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:18:26] PROBLEM - Check systemd state on elastic1051 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:18:35] PROBLEM - Check systemd state on elastic1032 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:18:35] PROBLEM - Check systemd state on elastic1049 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:18:35] PROBLEM - Check systemd state on elastic1034 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:19:05] PROBLEM - Check systemd state on elastic1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:19:06] PROBLEM - Check systemd state on elastic1024 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:19:06] PROBLEM - Check systemd state on elastic1036 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:19:06] PROBLEM - Check systemd state on elastic1023 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:19:15] PROBLEM - Check systemd state on elastic1048 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:19:15] PROBLEM - Check systemd state on elastic1046 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:19:29] 10Operations, 10SRE-Access-Requests, 10User-Addshore: Requesting Access to view EventLogging data for gabriel-wmde / gbirke - https://phabricator.wikimedia.org/T202072 (10Addshore) [16:19:45] PROBLEM - Check systemd state on elastic1028 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:19:49] ^ is this expected ^ [16:19:51] 10Operations, 10SRE-Access-Requests, 10User-Addshore: Requesting access to view EventLogging data for Tonina WMDE - https://phabricator.wikimedia.org/T202069 (10Addshore) [16:20:06] RECOVERY - Check systemd state on elastic1036 is OK: OK - running: The system is fully operational [16:20:38] 10Operations, 10SRE-Access-Requests, 10User-Addshore: Requesting access to view EventLogging data for Tim WMDE - https://phabricator.wikimedia.org/T202063 (10Addshore) [16:21:06] RECOVERY - Check systemd state on elastic1022 is OK: OK - running: The system is fully operational [16:21:28] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10RobH) I've emailed Dell to see what our other 10G network card options are: > Dell Team, > > We are experiencing a driver support issue on the 10G N... [16:22:35] RECOVERY - Check systemd state on elastic1019 is OK: OK - running: The system is fully operational [16:22:53] robh: not entirely expected, but not entirely surprising either :/ [16:23:04] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10RobH) >>! In T199125#4509644, @aborrero wrote: > Ok, > > @RobH let's assume we won't be using the 2x10G NICs in the short-mid term. > How many 1G NICs... [16:23:25] gehel: as long as i wasnt the ony person aware since i have no idea how to fix it ;D [16:23:52] i havent known how to fix search since it was the old pre-elastic days ;D [16:24:00] T202120 for the details, it is going to come back up on its own, but we have a monitoring issue, or a systemd issue [16:24:02] T202120: mjolnir-kafka-bulk-daemon failed on all elastic / eqiad nodes - https://phabricator.wikimedia.org/T202120 [16:24:09] I'm going to disable that check for the moment [16:24:36] RECOVERY - Check systemd state on elastic1051 is OK: OK - running: The system is fully operational [16:24:46] 10Operations, 10SRE-Access-Requests, 10User-Addshore: Requesting Access to view EventLogging data for gabriel-wmde / gbirke - https://phabricator.wikimedia.org/T202072 (10RStallman-legalteam) Will send an email to gabriel.birke@wikimedia.de about signing the NDA. Thanks. [16:24:48] !log disabling systemd state check for elastic eqiad until T202120 is fixed [16:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:16] RECOVERY - Check systemd state on elastic1047 is OK: OK - running: The system is fully operational [16:26:26] RECOVERY - Check systemd state on elastic1045 is OK: OK - running: The system is fully operational [16:26:56] RECOVERY - Check systemd state on elastic1033 is OK: OK - running: The system is fully operational [16:27:25] RECOVERY - Check systemd state on elastic1024 is OK: OK - running: The system is fully operational [16:28:06] RECOVERY - Check systemd state on elastic1050 is OK: OK - running: The system is fully operational [16:28:45] RECOVERY - Check systemd state on elastic1026 is OK: OK - running: The system is fully operational [16:29:15] RECOVERY - Check systemd state on elastic1018 is OK: OK - running: The system is fully operational [16:29:35] RECOVERY - Check systemd state on elastic1048 is OK: OK - running: The system is fully operational [16:30:55] RECOVERY - Check systemd state on elastic1025 is OK: OK - running: The system is fully operational [16:34:05] RECOVERY - Check systemd state on elastic1031 is OK: OK - running: The system is fully operational [16:34:55] RECOVERY - Check systemd state on elastic1020 is OK: OK - running: The system is fully operational [16:35:15] RECOVERY - Check systemd state on elastic1027 is OK: OK - running: The system is fully operational [16:36:25] RECOVERY - Check systemd state on elastic1039 is OK: OK - running: The system is fully operational [16:37:26] RECOVERY - Check systemd state on elastic1028 is OK: OK - running: The system is fully operational [16:38:05] RECOVERY - Check systemd state on elastic1046 is OK: OK - running: The system is fully operational [16:38:15] RECOVERY - Check systemd state on elastic1052 is OK: OK - running: The system is fully operational [16:38:22] 10Operations, 10netops: connectivity issues between several hosts on asw2-b-eqiad - https://phabricator.wikimedia.org/T201039 (10ayounsi) JTAC dug down into the VCF to confirm that it was a miss-programming issue. It's also non-trivial to list all hosts potentially having the same issue. To fix it, other than... [16:38:25] RECOVERY - Check systemd state on elastic1029 is OK: OK - running: The system is fully operational [16:39:26] RECOVERY - Check systemd state on elastic1032 is OK: OK - running: The system is fully operational [16:41:36] RECOVERY - Check systemd state on elastic1034 is OK: OK - running: The system is fully operational [16:43:06] RECOVERY - Check systemd state on elastic1044 is OK: OK - running: The system is fully operational [16:43:06] RECOVERY - Check systemd state on elastic1037 is OK: OK - running: The system is fully operational [16:43:45] RECOVERY - Check systemd state on elastic1049 is OK: OK - running: The system is fully operational [16:44:25] RECOVERY - Check systemd state on elastic1023 is OK: OK - running: The system is fully operational [16:46:26] RECOVERY - Check systemd state on elastic1030 is OK: OK - running: The system is fully operational [16:47:47] (03PS1) 10EBernhardson: Enable logstash for mjolnir [puppet] - 10https://gerrit.wikimedia.org/r/453431 [16:48:49] (03PS2) 10EBernhardson: Enable logstash for mjolnir [puppet] - 10https://gerrit.wikimedia.org/r/453431 [16:55:38] 10Operations, 10DBA: rack/setup/install dbproxy101[2-7].eqiad.wmnet - https://phabricator.wikimedia.org/T196690 (10RobH) a:05RobH>03None These systems are now ready for the #DBA team to take over and press into service. This can be taken over by @jcrespo or @Marostegui. I've not assigned to either since... [16:59:16] (03PS3) 10Dzahn: releases/mediawiki: proper Icinga monitoring for both Apache vhosts [puppet] - 10https://gerrit.wikimedia.org/r/453267 [17:00:04] (03CR) 10Dzahn: [C: 032] releases/mediawiki: proper Icinga monitoring for both Apache vhosts [puppet] - 10https://gerrit.wikimedia.org/r/453267 (owner: 10Dzahn) [17:01:08] (03PS3) 10Gehel: Enable logstash for mjolnir [puppet] - 10https://gerrit.wikimedia.org/r/453431 (owner: 10EBernhardson) [17:01:16] (03PS1) 10Dzahn: admins: create user for Dario Rossi [puppet] - 10https://gerrit.wikimedia.org/r/453433 (https://phabricator.wikimedia.org/T201196) [17:02:01] (03CR) 10Gehel: [C: 032] Enable logstash for mjolnir [puppet] - 10https://gerrit.wikimedia.org/r/453431 (owner: 10EBernhardson) [17:04:56] (03PS9) 10Ema: ATS: storage configuration [puppet] - 10https://gerrit.wikimedia.org/r/453164 (https://phabricator.wikimedia.org/T199720) [17:05:52] (03CR) 10Ema: [C: 032] ATS: storage configuration [puppet] - 10https://gerrit.wikimedia.org/r/453164 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema) [17:08:25] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: labstore1003: more SMART failures - https://phabricator.wikimedia.org/T199780 (10Bstorm) This is an older host, is it in warranty, I wonder? I am also wondering if recent disk orders will include the one in this. There's maintenance to be done... [17:12:18] (03PS9) 10Ayounsi: Bird anycast: add anycast_healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/397723 [17:14:31] (03CR) 10Dzahn: [C: 032] "https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=releases1001&service=HTTP+releases-jenkins.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/453267 (owner: 10Dzahn) [17:19:51] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: labstore1003: more SMART failures - https://phabricator.wikimedia.org/T199780 (10Cmjohnson) labstore1003 is 7.5 years old and well beyond it's warranty expirations (4.5 years). This host really needs to be decommissioned. [17:20:54] (03CR) 10Urbanecm: [C: 031] Allow all bureaucrats to remove interface-admin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450450 (owner: 10Gergő Tisza) [17:22:47] (03CR) 10Ayounsi: "Last outstanding issue fixed." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/397723 (owner: 10Ayounsi) [17:26:58] (03CR) 10Dzahn: [C: 032] "works :)" [puppet] - 10https://gerrit.wikimedia.org/r/453267 (owner: 10Dzahn) [17:27:13] (03CR) 10Dzahn: [C: 04-2] "need UID" [puppet] - 10https://gerrit.wikimedia.org/r/453433 (https://phabricator.wikimedia.org/T201196) (owner: 10Dzahn) [17:35:37] (03PS1) 10Andrew Bogott: wmcs puppetmaster: profile::openstack::main::second_region_designate_host: 'cloudservices1003.wikimedia.org' [puppet] - 10https://gerrit.wikimedia.org/r/453442 [17:35:56] (03CR) 10jerkins-bot: [V: 04-1] wmcs puppetmaster: profile::openstack::main::second_region_designate_host: 'cloudservices1003.wikimedia.org' [puppet] - 10https://gerrit.wikimedia.org/r/453442 (owner: 10Andrew Bogott) [17:37:36] (03PS2) 10Andrew Bogott: wmcs: set main second_region_designate_host to cloudservices1003.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/453442 [17:38:25] (03CR) 10Andrew Bogott: [C: 032] wmcs: set main second_region_designate_host to cloudservices1003.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/453442 (owner: 10Andrew Bogott) [17:45:06] (03PS1) 10Andrew Bogott: wmcs puppetmasters: another ferm fix to allow puppet cert cleaning from eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/453444 [17:48:20] (03CR) 10Andrew Bogott: [C: 032] wmcs puppetmasters: another ferm fix to allow puppet cert cleaning from eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/453444 (owner: 10Andrew Bogott) [17:53:16] (03PS18) 10Bstorm: WIP toolforge: write/move a sonofgridengine module and toolforge profile [puppet] - 10https://gerrit.wikimedia.org/r/448791 (https://phabricator.wikimedia.org/T200557) [18:00:17] 10Operations, 10Discovery-Search (Current work): mjolnir-kafka-bulk-daemon failed on all elastic / eqiad nodes - https://phabricator.wikimedia.org/T202120 (10EBernhardson) Looking at elastic1020 we have in `journalctl -u mjolnir-kafka-bulk-daemon` ``` Aug 17 16:14:14 elastic1020 systemd[1]: mjolnir-kafka-bulk... [18:01:39] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: labstore1003: more SMART failures - https://phabricator.wikimedia.org/T199780 (10Bstorm) 100% agreed. It's replacement is at T193655. I last recall they needed to be plugged in or something switch side. I may want to try to change their names w... [18:04:11] (03PS1) 10EBernhardson: Mjolnir daemons should run with Restart=always [puppet] - 10https://gerrit.wikimedia.org/r/453450 (https://phabricator.wikimedia.org/T202120) [18:05:13] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install labstore1008 & labstore1009 - https://phabricator.wikimedia.org/T193655 (10Bstorm) These should honestly also be renamed/relabeled before they are put in production to cloudstore1008/9. [18:06:07] (03CR) 10EBernhardson: "tested after deploy and messages are now going through to logstash" [puppet] - 10https://gerrit.wikimedia.org/r/453431 (owner: 10EBernhardson) [18:06:25] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install labstore1008 & labstore1009 - https://phabricator.wikimedia.org/T193655 (10Bstorm) a:05Bstorm>03None [18:06:58] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: mjolnir-kafka-bulk-daemon failed on all elastic / eqiad nodes - https://phabricator.wikimedia.org/T202120 (10EBernhardson) a:03EBernhardson [18:11:02] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install labstore1008 & labstore1009 - https://phabricator.wikimedia.org/T193655 (10Bstorm) >>! In T193655#4306018, @Bstorm wrote: > Current status: labstore1009 appears to not be plugged in on any port.... [18:11:17] (03PS2) 10Legoktm: php72: Add more missing extensions that php5.6 had [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/453064 (https://phabricator.wikimedia.org/T188318) [18:14:58] (03PS3) 10Legoktm: php72: Add more missing extensions that php5.6 had [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/453064 (https://phabricator.wikimedia.org/T188318) [18:20:40] 10Operations, 10Packaging, 10Toolforge, 10Patch-For-Review: Please add php-imagick and php-redis packages to apt.wikimedia.org thirdparty/php72 - https://phabricator.wikimedia.org/T200666 (10Legoktm) 05Open>03stalled This is stalled on reprepro not working, so the existing packages can't be updated, le... [18:22:18] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install labstore1008 & labstore1009 - https://phabricator.wikimedia.org/T193655 (10Bstorm) No, this is an onboard 10G card. It doesn't seem to be able to reach install1002 for DHCP only, and otherwise... [18:45:48] 10Operations, 10SRE-Access-Requests, 10User-Addshore: Requesting Access to view EventLogging data for gabriel-wmde / gbirke - https://phabricator.wikimedia.org/T202072 (10Dzahn) p:05Triage>03Normal [18:45:58] PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert. [18:46:03] 10Operations, 10SRE-Access-Requests, 10User-Addshore: Requesting access to view EventLogging data for Tonina WMDE - https://phabricator.wikimedia.org/T202069 (10Dzahn) p:05Triage>03Normal [18:46:05] 10Operations, 10SRE-Access-Requests, 10User-Addshore: Requesting access to view EventLogging data for Tim WMDE - https://phabricator.wikimedia.org/T202063 (10Dzahn) p:05Triage>03Normal [18:46:28] 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission radon - https://phabricator.wikimedia.org/T202040 (10Dzahn) p:05Triage>03Normal [18:48:08] RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting. [18:53:43] 10Operations, 10Wikimedia-Mailing-lists: wikimedia-us-mn administration password reset - https://phabricator.wikimedia.org/T201920 (10Dzahn) Done by running: ``` [fermium:~] $ sudo /var/lib/mailman/bin/change_pw -l wikimedia-us-mn -p $(pwgen -c1 -s 12) New wikimedia-us-mn password: ... ``` The script will... [18:54:11] 10Operations, 10Wikimedia-Mailing-lists: wikimedia-us-mn administration password reset - https://phabricator.wikimedia.org/T201920 (10Dzahn) 05Open>03Resolved a:03Dzahn [19:04:27] 10Operations, 10Wikimedia-Mailing-lists: Wikimedia Community User Group Albania mailing list request - https://phabricator.wikimedia.org/T201670 (10Dzahn) 05Open>03Resolved a:03Dzahn You have successfully created the mailing list wikimedia-wcuga and notification has been sent to the list owner silva@arap... [19:08:22] 10Operations, 10Wikimedia-Mailing-lists: Mailing list for Wikimedians of Tamazight User Group - https://phabricator.wikimedia.org/T201929 (10Dzahn) You have successfully created the mailing list wikimedia-tamazight and notification has been sent to the list owner vikoula5@yahoo.fr. You can now: [[ https://lis... [19:08:33] 10Operations, 10Wikimedia-Mailing-lists: Mailing list for Wikimedians of Tamazight User Group - https://phabricator.wikimedia.org/T201929 (10Dzahn) 05Open>03Resolved a:03Dzahn [19:09:28] 10Operations: Feedback Appreciatted: Use of HTTP Without TLS - https://phabricator.wikimedia.org/T202033 (10Dzahn) It's true that it's probably different for each case, but since we have to go through it case-by-case, having a list like this to check-off is actually useful to me. Sending it to mailing list would... [19:12:01] 10Operations: Feedback Appreciatted: Use of HTTP Without TLS - https://phabricator.wikimedia.org/T202033 (10Dzahn) Also having the relevant full target URLs used in these files and sorting / uniq'ing them would be useful. One common example is use of the proxy "http://url-downloader." . [19:18:55] 10Operations, 10Wikimedia-Mailing-lists, 10User-herron: Growth Team Mailing List - https://phabricator.wikimedia.org/T201467 (10Dzahn) Error: List already exists: growth-team [19:19:59] 10Operations, 10Wikimedia-Mailing-lists, 10User-herron: Growth Team Mailing List - https://phabricator.wikimedia.org/T201467 (10Dzahn) Looks like somebody created the list without using a ticket and the list is hidden from listinfo view. Please don't hide the sheer fact the list exists, even if you want only... [19:20:58] 10Operations, 10Wikimedia-Mailing-lists, 10User-herron: Growth Team Mailing List - https://phabricator.wikimedia.org/T201467 (10Dzahn) https://lists.wikimedia.org/mailman/listinfo/growth-team exists but no admin is set "Growth-team list run by .." .. adding the admin [19:21:43] 10Operations, 10Wikimedia-Mailing-lists, 10User-herron: Growth Team Mailing List - https://phabricator.wikimedia.org/T201467 (10Dzahn) "Emergency moderation of all list traffic is enabled" what happened here? [19:27:11] 10Operations, 10Wikimedia-Mailing-lists, 10User-herron: Growth Team Mailing List - https://phabricator.wikimedia.org/T201467 (10Dzahn) 05Open>03Resolved - re-enabled the list by running `[fermium:~] $ sudo /usr/local/sbin/disable_list -e growth-team` - added Jazmin and Ryan as admins - reset the passwor... [20:23:31] 10Operations, 10Wikimedia-Mailing-lists, 10User-herron: Growth Team Mailing List - https://phabricator.wikimedia.org/T201467 (10Legoktm) >>! In T201467#4510805, @Dzahn wrote: > Looks like somebody created the list without using a ticket and the list is hidden from listinfo view. Please don't hide the sheer f... [20:27:07] 10Operations, 10Wikimedia-Mailing-lists, 10User-herron: Growth Team Mailing List - https://phabricator.wikimedia.org/T201467 (10Aklapper) >>! In T201467#4510805, @Dzahn wrote: > Looks like somebody created the list without using a ticket Cannot find a ticket about its previous creation but the ticket about... [20:43:03] 08Warning Alert for device cr2-codfw.wikimedia.org - Juniper environment status [20:46:46] 10Operations, 10Wikimedia-Mailing-lists, 10User-herron: Growth Team Mailing List - https://phabricator.wikimedia.org/T201467 (10Dzahn) Oooh, this explains it all. Thank you legoktm and andre! Well.. i think "recycling" it was ok (i just silently unsubscribed the old list members). Unless you _really_ don't... [20:47:49] 10Operations, 10Wikimedia-Mailing-lists, 10User-herron: Growth Team Mailing List - https://phabricator.wikimedia.org/T201467 (10Dzahn) So i removed all users from the list to make sure people don't get unexpected mails. My apologies if i removed anyone who already signed up for this NEW list. You can just r... [20:48:04] 08Warning Alert for device cr2-codfw.wikimedia.org - Juniper environment status got acknowledged [20:49:34] papaul: yo! [20:49:44] papaul: are you in the DC? [20:49:53] nope [20:51:27] 10Operations, 10ops-codfw: Check/replace PEM2 on cr2-codfw - https://phabricator.wikimedia.org/T202166 (10ayounsi) p:05Triage>03Normal [20:53:21] papaul: https://phabricator.wikimedia.org/T202166 [20:53:26] (03CR) 10Dzahn: [C: 032] restbase: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/451819 (owner: 10Dzahn) [20:53:53] (03PS6) 10Volans: Add confctl module to interact with conftool [software/spicerack] - 10https://gerrit.wikimedia.org/r/451254 (https://phabricator.wikimedia.org/T199079) [20:53:55] (03PS4) 10Volans: Add remote module to interact with Cumin [software/spicerack] - 10https://gerrit.wikimedia.org/r/451538 (https://phabricator.wikimedia.org/T199079) [20:53:57] (03PS2) 10Volans: config: directly inject global config path [software/spicerack] - 10https://gerrit.wikimedia.org/r/453371 (https://phabricator.wikimedia.org/T199079) [20:53:59] (03PS2) 10Volans: log: directly inject running user [software/spicerack] - 10https://gerrit.wikimedia.org/r/453372 (https://phabricator.wikimedia.org/T199079) [20:54:02] (03PS2) 10Volans: Add service locator class Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/453373 (https://phabricator.wikimedia.org/T199079) [20:54:13] (03CR) 10Volans: "addressed comments" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/451254 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [20:54:36] XioNoX: dont' think we have spare PSU for cr* [20:54:39] papaul: do you have access to the Juniper portal to create RMA? [20:54:42] (03CR) 10jerkins-bot: [V: 04-1] Add remote module to interact with Cumin [software/spicerack] - 10https://gerrit.wikimedia.org/r/451538 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [20:54:44] (03CR) 10jerkins-bot: [V: 04-1] Add confctl module to interact with conftool [software/spicerack] - 10https://gerrit.wikimedia.org/r/451254 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [20:54:46] (03CR) 10jerkins-bot: [V: 04-1] config: directly inject global config path [software/spicerack] - 10https://gerrit.wikimedia.org/r/453371 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [20:54:52] XioNoX: no access [20:55:14] (03CR) 10jerkins-bot: [V: 04-1] log: directly inject running user [software/spicerack] - 10https://gerrit.wikimedia.org/r/453372 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [20:55:22] (03CR) 10jerkins-bot: [V: 04-1] Add service locator class Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/453373 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [20:55:35] (03CR) 10Volans: "addressed comments" (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/451538 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [20:56:04] @seen andrewbogott [20:56:04] mutante: andrewbogott is in here, right now [20:56:12] (03CR) 10Volans: "partially addressed comment, see inline" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/453373 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [20:56:23] mutante: what's up? [20:57:19] andrewbogott: hi! i would like to merge changes that (seeminlgy) affect "labs" and "toollabs". i promise they are 100% noop but i also didn't want to just do it without any comment because it looks like i'm touching firewalls [20:57:33] (it's just refactoring though) [20:57:41] mutante: ok :) Got links? [20:58:29] andrewbogott: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/451818/ https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/448785/ [20:59:18] (03CR) 10Andrew Bogott: [C: 031] "Seems harmless :)" [puppet] - 10https://gerrit.wikimedia.org/r/451818 (owner: 10Dzahn) [20:59:48] mutante: as long as it's just renaming that seems fine. As far as I know that firewall doesnt actually apply on VMs. [21:00:20] andrewbogott: that profile includes the same code as before.. it's been noop on dozens of other places.. ack [21:00:30] thanks [21:00:31] ok :) [21:00:55] (03CR) 10Dzahn: [C: 032] labs: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/451818 (owner: 10Dzahn) [21:01:03] (03PS2) 10Dzahn: labs: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/451818 [21:01:59] (03Abandoned) 10Thcipriani: Pipeline: setup minikube in CI [puppet] - 10https://gerrit.wikimedia.org/r/428010 (https://phabricator.wikimedia.org/T188936) (owner: 10Thcipriani) [21:02:57] (03PS2) 10Dzahn: toollabs: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/448785 [21:03:09] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [21:04:44] zooming out of that graph it looks like having those spikes is not unusual (since a couple days at least) [21:13:11] this looks like memcached errors coming mostly from labtestwiki / labtestweb2001 https://logstash.wikimedia.org/app/kibana#/dashboard/memcached?_g=h@66534ad&_a=h@c822620 [21:13:58] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [21:14:35] (03CR) 10Dzahn: [C: 032] toollabs: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/448785 (owner: 10Dzahn) [21:37:05] how quickly do recently merged patches get deployed to en.wikipedia.beta.wmflabs.org? (or, where would i go to find that out?) [21:38:15] 10Operations, 10ops-eqiad, 10Discovery, 10Discovery-Search, 10Elasticsearch: check elastic1022 power supply redundancy - https://phabricator.wikimedia.org/T177631 (10Volans) I'm not sure it's a case of IPMI false positive, I can get the failure status from `racadm` too, see the line with `cfgServerPowerS... [21:42:25] MatmaRex: 10-15 minutes [21:42:39] https://integration.wikimedia.org/ci/view/Beta/job/beta-code-update-eqiad/ [21:43:11] 10Operations, 10ops-eqiad, 10Discovery, 10Discovery-Search, 10Elasticsearch: check elastic1022 power supply redundancy - https://phabricator.wikimedia.org/T177631 (10Volans) It's also interesting to note how both of the PSUs are reported failing from `ipmi-oem`: ```lang=shell $ sudo ipmi-oem Dell power-s... [22:34:50] 10Operations, 10SRE-Access-Requests, 10User-Addshore: Requesting access to view EventLogging data for Tim WMDE - https://phabricator.wikimedia.org/T202063 (10Dzahn) Hi @Tim_WMDE we'll need an SSH key from you. Could you create one and paste it here? Also, could you read and sign L3 unless you already have i... [22:36:03] 10Operations, 10SRE-Access-Requests, 10User-Addshore: Requesting access to view EventLogging data for Tonina WMDE - https://phabricator.wikimedia.org/T202069 (10Dzahn) Hi @Tonina_Zhelyazkova_WMDE ! We'll need an SSH key from you. Could you create one and paste it here? Also, could you read and sign L3 unles... [22:36:59] 10Operations, 10SRE-Access-Requests, 10User-Addshore: Requesting Access to view EventLogging data for gabriel-wmde / gbirke - https://phabricator.wikimedia.org/T202072 (10Dzahn) Hi @gabriel-wmde ! We'll need an SSH key from you. Could you create one and paste it here? Also, could you read and sign L3 unless... [22:37:45] 10Operations: Onboarding Cole White - https://phabricator.wikimedia.org/T202136 (10Dzahn) p:05Triage>03High [22:43:47] 10Operations, 10WMF-Communications, 10Wikimedia-Apache-configuration, 10wikimediafoundation.org: Update redirect for jobs.wikimedia.org - https://phabricator.wikimedia.org/T200951 (10Dzahn) It looks like this has been done or is already the case: ``` curl -vvv https://jobs.wikimedia.org

The document... [22:46:36] 10Operations, 10WMF-Communications, 10Wikimedia-Apache-configuration, 10wikimediafoundation.org: Update redirect for jobs.wikimedia.org - https://phabricator.wikimedia.org/T200951 (10Dzahn) 05Open>03Resolved a:03Dzahn was done in https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/450232/ i'm bei... [22:51:21] 10Operations, 10WMF-Communications, 10Wikimedia-Apache-configuration, 10wikimediafoundation.org: Update redirect for jobs.wikimedia.org - https://phabricator.wikimedia.org/T200951 (10Varnent) Thank you @Dzahn! [23:01:56] (03CR) 10Gehel: Add confctl module to interact with conftool (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/451254 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [23:21:42] (03PS1) 10Dzahn: replace a couple http links with https where possible [puppet] - 10https://gerrit.wikimedia.org/r/453541 [23:34:29] (03PS1) 10Dzahn: dumps: add https and ftp urls to mirror list [puppet] - 10https://gerrit.wikimedia.org/r/453543 [23:41:10] (03PS1) 10Dzahn: wikistats(vps): convert apache to httpd module [puppet] - 10https://gerrit.wikimedia.org/r/453544 [23:41:52] (03CR) 10jerkins-bot: [V: 04-1] wikistats(vps): convert apache to httpd module [puppet] - 10https://gerrit.wikimedia.org/r/453544 (owner: 10Dzahn) [23:42:46] (03PS2) 10Dzahn: wikistats(vps): convert apache to httpd module [puppet] - 10https://gerrit.wikimedia.org/r/453544 [23:43:22] (03CR) 10jerkins-bot: [V: 04-1] wikistats(vps): convert apache to httpd module [puppet] - 10https://gerrit.wikimedia.org/r/453544 (owner: 10Dzahn) [23:44:56] (03PS3) 10Dzahn: wikistats(vps): convert apache to httpd module [puppet] - 10https://gerrit.wikimedia.org/r/453544 [23:51:55] (03PS1) 10Dzahn: piwik: convert apache to httpd module [puppet] - 10https://gerrit.wikimedia.org/r/453546