[00:04:10] 10Operations, 10Patch-For-Review: CPU throttling on DELL PowerEdge R320 - https://phabricator.wikimedia.org/T162850#3689330 (10Dzahn) Only a single host is left to upgrade and reboot, baham. But this is a bit more complicated to coordinate. [00:04:44] (03PS1) 10Legoktm: admin: Add legoktm's new ed25519 key [puppet] - 10https://gerrit.wikimedia.org/r/384634 [00:16:48] (03PS1) 10Dzahn: screen-monitor: raise WARN to 4 days, lower CRIT to 20 days [puppet] - 10https://gerrit.wikimedia.org/r/384637 (https://phabricator.wikimedia.org/T165348) [01:04:08] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1508202246 600 - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 4179090 keys, up 4 minutes 3 seconds - replication_delay is 1508202246 [01:04:38] PROBLEM - Check health of redis instance on 6480 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1508202275 600 - REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 4177016 keys, up 4 minutes 32 seconds - replication_delay is 1508202275 [01:05:08] PROBLEM - Check health of redis instance on 6481 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6481 [01:05:17] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 4178709 keys, up 5 minutes 6 seconds - replication_delay is 0 [01:05:47] RECOVERY - Check health of redis instance on 6480 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 4175975 keys, up 5 minutes 34 seconds - replication_delay is 0 [01:06:08] RECOVERY - Check health of redis instance on 6481 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 4175681 keys, up 6 minutes 1 seconds - replication_delay is 0 [02:25:33] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.3) (duration: 08m 23s) [02:25:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:32:31] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Oct 17 02:32:30 UTC 2017 (duration 6m 57s) [02:32:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:53:47] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.129 second response time [03:20:57] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.152 second response time [03:27:07] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 620.39 seconds [04:24:39] 10Operations, 10Collaboration-Team-Triage, 10Notifications, 10Anti-Harassment (AHT Sprint 7), 10Patch-For-Review: Recover Echo Notification Blacklist from Backup - https://phabricator.wikimedia.org/T178313#3689472 (10dbarratt) >>! In T178313#3688554, @jcrespo wrote: > Yeah, all modifications of echo-noti... [04:25:18] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 284.42 seconds [05:25:04] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1073" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384649 [05:25:07] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1073" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384649 [05:27:20] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1073" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384649 (owner: 10Marostegui) [05:28:39] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1073" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384649 (owner: 10Marostegui) [05:28:49] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1073" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384649 (owner: 10Marostegui) [05:29:42] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1073 - T174509 (duration: 00m 46s) [05:29:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:29:50] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [05:31:03] (03PS1) 10Marostegui: db-eqiad.php: Depool db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384650 (https://phabricator.wikimedia.org/T174509) [05:32:35] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384650 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [05:32:35] 10Operations, 10Collaboration-Team-Triage, 10Notifications, 10Anti-Harassment (AHT Sprint 7), 10Patch-For-Review: Recover Echo Notification Blacklist from Backup - https://phabricator.wikimedia.org/T178313#3689502 (10dbarratt) In https://gerrit.wikimedia.org/r/#/c/374361/2 @Anomie said: > You'll want to... [05:32:48] !log Optimize pagelinks and templatelinks on db1090 - T174509 [05:32:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:33:44] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384650 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [05:34:46] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1090 - T174509 (duration: 00m 45s) [05:34:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:53] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [05:35:59] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384650 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [05:36:20] (03PS1) 10Marostegui: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384651 (https://phabricator.wikimedia.org/T174509) [05:37:41] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384651 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [05:37:53] !log Optimize pagelinks and templatelinks on db1081 - T174509 [05:37:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:38:53] !log Optimize recentchanges on db1081 - T177772 [05:38:57] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384651 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [05:39:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:39:01] T177772: Purge 90% of rows from recentchanges (and posibly defragment) from commonswiki and ruwiki (the ones with source:wikidata) - https://phabricator.wikimedia.org/T177772 [05:39:09] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384651 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [05:39:25] 10Operations, 10Collaboration-Team-Triage, 10Notifications, 10Anti-Harassment (AHT Sprint 7), 10Patch-For-Review: Recover Echo Notification Blacklist from Backup - https://phabricator.wikimedia.org/T178313#3689511 (10dbarratt) At least in debug on production, it seems to work: ``` hphpd> $lookup = Centra... [05:40:01] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1081 - T174509 (duration: 00m 45s) [05:40:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:09] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [05:42:36] !log Optimize pagelinks, templatelinks and ores_classification on db1095 - T174509 T159753 [05:42:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:42:45] T159753: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753 [05:44:48] (03PS1) 10Marostegui: db-eqiad.php: Depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384652 (https://phabricator.wikimedia.org/T174509) [05:45:27] Anyone else wanna take a look at https://phabricator.wikimedia.org/T178313 ? I believe i have hit a dead-end [05:46:30] !log Optimize pagelinks, templatelinks and recentchanges on db1085 T177772 T174509 [05:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:39] T177772: Purge 90% of rows from recentchanges (and posibly defragment) from commonswiki and ruwiki (the ones with source:wikidata) - https://phabricator.wikimedia.org/T177772 [05:46:39] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [05:47:54] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384652 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [05:48:59] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384652 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [05:49:11] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384652 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [05:50:10] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1085 - T174509 T177772 (duration: 00m 45s) [05:50:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:51:43] (03PS1) 10Marostegui: db-eqiad.php: Depool db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384653 (https://phabricator.wikimedia.org/T174509) [05:54:13] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384653 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [05:54:23] !log Optimize pagelinks and templatelinks on db1034 - T174509 [05:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:30] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [05:55:30] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384653 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [05:55:57] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384653 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [05:57:07] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1034 - T174509 (duration: 00m 45s) [05:57:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:07] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [06:05:38] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0 [06:07:33] !log Stop replication in sync on db1072 and db1103 for data drift fixing - T164488 [06:07:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:07:40] T164488: Run pt-table-checksum on s3 - https://phabricator.wikimedia.org/T164488 [06:18:46] 10Operations, 10Collaboration-Team-Triage, 10Notifications, 10Anti-Harassment (AHT Sprint 7), 10Patch-For-Review: Recover Echo Notification Blacklist from Backup - https://phabricator.wikimedia.org/T178313#3689553 (10dbarratt) Is it possible.... however unlikely... that these rows were //already// zeros?... [06:21:57] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 [06:22:17] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [06:25:53] (03PS7) 10Giuseppe Lavagetto: [WIP]: Support multiinstance in core servers [puppet] - 10https://gerrit.wikimedia.org/r/384452 (owner: 10Marostegui) [06:28:07] PROBLEM - graphite.wikimedia.org on graphite1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.002 second response time [06:29:07] RECOVERY - graphite.wikimedia.org on graphite1003 is OK: HTTP OK: HTTP/1.1 200 OK - 1547 bytes in 0.014 second response time [06:31:27] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [06:32:07] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0 [06:33:07] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 [06:33:27] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [06:43:07] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.128 second response time [06:43:19] (03PS8) 10Marostegui: [WIP]: Support multiinstance in core servers [puppet] - 10https://gerrit.wikimedia.org/r/384452 [06:47:32] (03CR) 10Marostegui: "All the changes looking good: https://puppet-compiler.wmflabs.org/compiler02/8350/" [puppet] - 10https://gerrit.wikimedia.org/r/384452 (owner: 10Marostegui) [06:55:21] !log installing libffi security updates on trusty (Debian already fixed) [06:55:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:38] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [06:57:38] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [06:58:30] 10Operations, 10Continuous-Integration-Infrastructure (shipyard), 10Patch-For-Review, 10User-Joe: Unify production and CI docker image build process - https://phabricator.wikimedia.org/T177276#3689558 (10Joe) I have a proposal: what about controlling semantic versioning via the changelog but allowing peopl... [06:59:46] (03PS5) 10Giuseppe Lavagetto: Port docker builder [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/384081 (https://phabricator.wikimedia.org/T177276) [07:09:43] (03PS1) 10Marostegui: db-eqiad.php: Depool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384656 (https://phabricator.wikimedia.org/T174509) [07:10:49] !log Optimize enwiki.ores_classification on db1067 - T159753 [07:10:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:56] T159753: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753 [07:11:08] !log Optimize templatelinks and pagelinks on db1067 - T174509 [07:11:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:15] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [07:11:16] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384656 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [07:12:32] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384656 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [07:12:44] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384656 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [07:13:52] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1067 - T174509 (duration: 00m 45s) [07:13:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:44] (03PS1) 10Ema: cache: upgrade misc_eqiad to Varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/384657 (https://phabricator.wikimedia.org/T177233) [07:18:55] 10Operations, 10wikidiff2, 10User-Addshore, 10WMDE-QWERTY-Team-Board: Update and use php-wikidiff2 to 1.5 in production - https://phabricator.wikimedia.org/T177891#3689571 (10Tobi_WMDE_SW) @MoritzMuehlenhoff Great! Yes, please keep us updated for when the app servers caught up. I think it makes sense then... [07:30:08] (03CR) 10Ema: [C: 032] cache: upgrade misc_eqiad to Varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/384657 (https://phabricator.wikimedia.org/T177233) (owner: 10Ema) [07:31:17] !log upgrade misc_eqiad to varnish 5 T177233 [07:31:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:23] T177233: Upgrade cache_misc to Varnish 5 - https://phabricator.wikimedia.org/T177233 [07:33:48] PROBLEM - Varnish HTTP misc-frontend - port 80 on cp1045 is CRITICAL: connect to address 10.64.32.97 and port 80: Connection refused [07:33:58] that's me, the host is depooled ^ [07:34:07] PROBLEM - Varnish HTTP misc-backend - port 3128 on cp1045 is CRITICAL: connect to address 10.64.32.97 and port 3128: Connection refused [07:34:07] PROBLEM - Varnish HTTP misc-frontend - port 3122 on cp1045 is CRITICAL: connect to address 10.64.32.97 and port 3122: Connection refused [07:34:48] RECOVERY - Varnish HTTP misc-frontend - port 80 on cp1045 is OK: HTTP OK: HTTP/1.1 200 OK - 454 bytes in 0.001 second response time [07:35:07] RECOVERY - Varnish HTTP misc-backend - port 3128 on cp1045 is OK: HTTP OK: HTTP/1.1 200 OK - 177 bytes in 0.001 second response time [07:35:07] RECOVERY - Varnish HTTP misc-frontend - port 3122 on cp1045 is OK: HTTP OK: HTTP/1.1 200 OK - 454 bytes in 0.001 second response time [07:35:38] (03PS1) 10Marostegui: mariadb: Add db2086 to s5 and later s8 [puppet] - 10https://gerrit.wikimedia.org/r/384659 (https://phabricator.wikimedia.org/T170662) [07:36:24] (03PS1) 10Marostegui: s5.hosts: Add db2086 to s5 [software] - 10https://gerrit.wikimedia.org/r/384660 (https://phabricator.wikimedia.org/T170662) [07:36:33] !log rebooting boron for kernel update [07:36:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:43] (03CR) 10Marostegui: [C: 032] s5.hosts: Add db2086 to s5 [software] - 10https://gerrit.wikimedia.org/r/384660 (https://phabricator.wikimedia.org/T170662) (owner: 10Marostegui) [07:39:30] (03Merged) 10jenkins-bot: s5.hosts: Add db2086 to s5 [software] - 10https://gerrit.wikimedia.org/r/384660 (https://phabricator.wikimedia.org/T170662) (owner: 10Marostegui) [07:40:27] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.298 second response time [07:40:39] (03CR) 10Marostegui: [C: 032] "Looks good https://puppet-compiler.wmflabs.org/compiler02/8352/" [puppet] - 10https://gerrit.wikimedia.org/r/384659 (https://phabricator.wikimedia.org/T170662) (owner: 10Marostegui) [07:42:29] !log Stop MySQL on db2079 to clone db2086 - T170662 [07:42:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:36] T170662: Productionize 22 new codfw database servers - https://phabricator.wikimedia.org/T170662 [07:46:51] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Upgrade to Varnish 5 - https://phabricator.wikimedia.org/T168529#3689628 (10ema) [07:46:54] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Upgrade cache_misc to Varnish 5 - https://phabricator.wikimedia.org/T177233#3689625 (10ema) 05Open>03Resolved a:03ema Done. [07:47:14] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Add db2086 to the config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384661 (https://phabricator.wikimedia.org/T170662) [07:49:05] (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw.php: Add db2086 to the config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384661 (https://phabricator.wikimedia.org/T170662) (owner: 10Marostegui) [07:50:16] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Add db2086 to the config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384661 (https://phabricator.wikimedia.org/T170662) (owner: 10Marostegui) [07:50:25] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Add db2086 to the config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384661 (https://phabricator.wikimedia.org/T170662) (owner: 10Marostegui) [07:51:15] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Add db2086 to the config - T170662 (duration: 00m 45s) [07:51:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:23] T170662: Productionize 22 new codfw database servers - https://phabricator.wikimedia.org/T170662 [07:52:39] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Add db2086 to the config - T170662 (duration: 00m 45s) [07:52:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:40] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Add db2086 to the config - T170662 (duration: 00m 45s) [07:53:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:27] !log Stop MySQL and reboot db2081 to see if it works fine - T178140 [07:54:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:34] T178140: db2081 unreachable - https://phabricator.wikimedia.org/T178140 [07:57:15] (03CR) 10Hashar: [C: 031] Deploy Compact Language Links on the German Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384527 (https://phabricator.wikimedia.org/T177836) (owner: 10KartikMistry) [08:00:55] 10Operations, 10ops-codfw, 10DBA: db2081 unreachable - https://phabricator.wikimedia.org/T178140#3689646 (10Marostegui) 05Open>03Resolved I am going to close this for now as resolved. Rebooted the host twice without any issues. So far it looks good, if it happens again, we can reopen. Thanks @Papaul for... [08:05:34] (03CR) 10Gehel: Backends: add support to external backends plugins (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/384616 (https://phabricator.wikimedia.org/T178342) (owner: 10Volans) [08:11:21] 10Operations: Integrate jessie 8.9 point release - https://phabricator.wikimedia.org/T171452#3689670 (10MoritzMuehlenhoff) These are fully rolled out: libapache2-mod-perl2 gtk+2.0 gnutls28 libonig [08:11:45] (03CR) 10Gehel: [C: 04-1] "We're almost there! Some more comments inline..." (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/383916 (https://phabricator.wikimedia.org/T178096) (owner: 10Bearloga) [08:11:49] 10Operations: Integrate stretch 9.1 point release - https://phabricator.wikimedia.org/T171453#3689671 (10MoritzMuehlenhoff) 05Open>03Resolved This is complete [08:12:07] 10Operations, 10User-fgiunchedi: Integrate stretch 9.2 point release - https://phabricator.wikimedia.org/T177739#3689673 (10MoritzMuehlenhoff) These are fully rolled out: linux 4.9.51 (with a few reboots pending) [08:16:02] (03CR) 10Ema: "Thanks Alex!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/384520 (owner: 10Ema) [08:18:02] (03PS2) 10Ema: varnish: execution environment configuration [puppet] - 10https://gerrit.wikimedia.org/r/384520 [08:20:45] (03CR) 10Ema: varnish: execution environment configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/384520 (owner: 10Ema) [08:27:25] (03PS1) 10Gehel: wdqs: cleanup JVM options [puppet] - 10https://gerrit.wikimedia.org/r/384663 (https://phabricator.wikimedia.org/T175919) [08:36:25] 10Operations, 10Collaboration-Team-Triage, 10Notifications, 10Anti-Harassment (AHT Sprint 7), 10Patch-For-Review: Recover Echo Notification Blacklist from Backup - https://phabricator.wikimedia.org/T178313#3689690 (10jcrespo) I've checked other wikis, the pattern repeats: * The only mention of echo-noti... [08:39:28] !log mobrovac@tin Started deploy [restbase/deploy@3bcd1b7]: Remove /page/revision and add history metrics end points - T158100 T175805 [08:39:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:36] T158100: Deprecate and remove the public title/{title} endpoint - https://phabricator.wikimedia.org/T158100 [08:39:36] T175805: Add mediawiki-history metrics to AQS - https://phabricator.wikimedia.org/T175805 [08:40:31] !log mobrovac@tin Finished deploy [restbase/deploy@3bcd1b7]: Remove /page/revision and add history metrics end points - T158100 T175805 (duration: 01m 05s) [08:40:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:20] !log mobrovac@tin Started deploy [restbase/deploy@3bcd1b7]: Remove /page/revision and add history metrics end points, part #2 [08:42:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:30] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 404 (expecting: 200) [08:43:11] ignore ^ [08:45:55] !log restarting blazegraph on wdqs1004 for GC tuning (adding -XX:+G1PrintRegionLivenessInfo) - T175919 [08:46:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:03] T175919: investigate GC times on wikidata query service - https://phabricator.wikimedia.org/T175919 [08:55:42] (03CR) 10Marostegui: [C: 032] "Going to start testing this on db2084" [puppet] - 10https://gerrit.wikimedia.org/r/384452 (owner: 10Marostegui) [09:08:16] !log mobrovac@tin Finished deploy [restbase/deploy@3bcd1b7]: Remove /page/revision and add history metrics end points, part #2 (duration: 25m 57s) [09:08:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:29] !log mobrovac@tin Started deploy [restbase/deploy@7f228d4]: Remove /page/revision and add history metrics end points, part #3 [09:08:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:59] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [09:16:03] !log mobrovac@tin Finished deploy [restbase/deploy@7f228d4]: Remove /page/revision and add history metrics end points, part #3 (duration: 07m 34s) [09:16:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:28] (03PS9) 10Marostegui: [WIP]: Support multiinstance in core servers [puppet] - 10https://gerrit.wikimedia.org/r/384452 [09:20:53] (03CR) 10Marostegui: [C: 032] [WIP]: Support multiinstance in core servers [puppet] - 10https://gerrit.wikimedia.org/r/384452 (owner: 10Marostegui) [09:23:09] !log Stop MySQL on db2084 for multi-instance testing [09:23:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:09] (03PS1) 10Volans: ssh-keys: remove roca-vulnerable keys [puppet] - 10https://gerrit.wikimedia.org/r/384670 [09:25:11] (03CR) 10Ema: [C: 031] ssh-keys: remove roca-vulnerable keys [puppet] - 10https://gerrit.wikimedia.org/r/384670 (owner: 10Volans) [09:25:49] (03PS2) 10Volans: ssh-keys: remove roca-vulnerable keys [puppet] - 10https://gerrit.wikimedia.org/r/384670 [09:27:07] (03CR) 10Volans: [C: 032] ssh-keys: remove roca-vulnerable keys [puppet] - 10https://gerrit.wikimedia.org/r/384670 (owner: 10Volans) [09:30:58] (03CR) 10Alexandros Kosiaris: varnish: execution environment configuration (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/384520 (owner: 10Ema) [09:35:55] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1085" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384674 [09:35:59] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1085" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384674 [09:37:54] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1085" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384674 (owner: 10Marostegui) [09:38:43] (03CR) 10Jcrespo: [C: 031] Enable description usage tracking on a few test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384003 (https://phabricator.wikimedia.org/T177155) (owner: 10Hoo man) [09:39:05] (03CR) 10Jcrespo: [C: 031] Re-enable Statement usage tracking on cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384592 (https://phabricator.wikimedia.org/T151717) (owner: 10Hoo man) [09:39:08] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1085" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384674 (owner: 10Marostegui) [09:39:15] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1085" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384674 (owner: 10Marostegui) [09:40:35] (03CR) 10Ema: varnish: execution environment configuration (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/384520 (owner: 10Ema) [09:46:30] (03PS3) 10Ema: varnish: execution environment configuration [puppet] - 10https://gerrit.wikimedia.org/r/384520 [09:46:35] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1085 - T174509 T177772 (duration: 00m 46s) [09:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:42] T177772: Purge 90% of rows from recentchanges (and posibly defragment) from commonswiki and ruwiki (the ones with source:wikidata) - https://phabricator.wikimedia.org/T177772 [09:46:42] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [09:48:25] PROBLEM - High lag on wdqs1004 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [1800.0] [09:48:44] (03PS4) 10Ema: varnish: execution environment configuration [puppet] - 10https://gerrit.wikimedia.org/r/384520 [09:49:17] (03CR) 10Ema: varnish: execution environment configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/384520 (owner: 10Ema) [09:52:55] (03CR) 10Giuseppe Lavagetto: Port docker builder (0316 comments) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/384081 (https://phabricator.wikimedia.org/T177276) (owner: 10Giuseppe Lavagetto) [09:57:16] (03PS6) 10Giuseppe Lavagetto: Port docker builder [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/384081 (https://phabricator.wikimedia.org/T177276) [09:58:24] 10Operations, 10Goal: Provide dedicated database resources for wikidata - https://phabricator.wikimedia.org/T177208#3689860 (10jcrespo) [10:11:14] 10Operations, 10Traffic, 10Wikimedia-Logstash, 10Patch-For-Review, 10Services (watching): RESTBase logs disappeared from logstash - https://phabricator.wikimedia.org/T178078#3689875 (10Pchelolo) The logs are back where they belong, so I guess the ticket can be resolved. Thank you @fgiunchedi [10:16:36] !log removing akosiaris's yubikey from network devices [10:16:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:54] 10Operations, 10Continuous-Integration-Infrastructure (shipyard), 10Patch-For-Review, 10User-Joe: Unify production and CI docker image build process - https://phabricator.wikimedia.org/T177276#3689890 (10Addshore) 1 more thing to throw into the mix. Right now we have a mediawiki-phan image, and I want to... [10:26:35] !log start of cleaning up ores_classification in cswiki (T159753) [10:26:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:42] T159753: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753 [10:36:32] !log end of cleaning up ores_classification table in cswiki, start of etwiki (T159753) [10:36:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:40] T159753: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753 [10:39:05] (03CR) 10Alexandros Kosiaris: varnish: execution environment configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/384520 (owner: 10Ema) [10:40:37] !log end of cleaning up ores_classification table in etwiki, start of fawiki (T159753) [10:40:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:00] !log restarting wdqs-updater on wdqs1004 - T175919 [11:12:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:07] T175919: investigate GC times on wikidata query service - https://phabricator.wikimedia.org/T175919 [11:24:06] !log end of cleaning up ores_classification table in fawiki, start of fiwiki (T159753) [11:24:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:13] T159753: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753 [11:32:18] !log test-installing proxysql on wasat T175672 [11:32:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:24] T175672: Make apache/maintenance hosts TLS connections to mariadb work - https://phabricator.wikimedia.org/T175672 [11:35:06] (03PS2) 10Volans: PuppetDB backend: Class, Roles and Profiles shortcuts [software/cumin] - 10https://gerrit.wikimedia.org/r/384547 (https://phabricator.wikimedia.org/T178279) [11:36:39] (03PS2) 10Volans: Backends: add support to external backends plugins [software/cumin] - 10https://gerrit.wikimedia.org/r/384616 (https://phabricator.wikimedia.org/T178342) [11:36:48] (03CR) 10Volans: "done" (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/384616 (https://phabricator.wikimedia.org/T178342) (owner: 10Volans) [11:39:43] !log end of cleaning up ores_classification table in fiwiki, start of hewiki (T159753) [11:39:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:50] T159753: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753 [11:45:11] !log end of cleaning up ores_classification table in hewiki, start of nlwiki (T159753) [11:45:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:18] T159753: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753 [11:47:02] !log Optimize templatelinks and pagelinks on db1102 for s4 - T174509 [11:47:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:09] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [11:47:44] !log Optimize recentchanges on db1102 for s4 - T174509 [11:47:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:29] (03PS1) 10Marostegui: install_server: Migrate db2084 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/384690 (https://phabricator.wikimedia.org/T178359) [11:56:07] (03PS2) 10Marostegui: install_server: Reinstall db2084 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/384690 (https://phabricator.wikimedia.org/T178359) [12:40:22] !log upgrading mwdebug* to hhvm-wikidiff 1.5.1 [12:40:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:54] (03CR) 10Aklapper: "Works like a charm! Thanks a lot!" [puppet] - 10https://gerrit.wikimedia.org/r/380959 (owner: 10Aklapper) [12:42:54] RECOVERY - High lag on wdqs1004 is OK: OK: Less than 30.00% above the threshold [600.0] [12:45:35] PROBLEM - Host db1101 is DOWN: PING CRITICAL - Packet loss = 100% [12:46:18] checking that [12:46:21] it is depooled anyways [12:49:58] 10Operations, 10ops-eqiad, 10DBA: db1101 crashed - https://phabricator.wikimedia.org/T178383#3690357 (10Marostegui) [12:50:08] 10Operations, 10ops-eqiad, 10DBA: db1101 crashed - https://phabricator.wikimedia.org/T178383#3690369 (10Marostegui) p:05Triage>03Normal [12:50:27] ha [12:50:39] is that the one that crashed while I was on holidays? [12:50:40] maybe a duplicate? [12:50:43] or was it db1100? [12:50:48] checking [12:50:56] i cannot find a ticket for db1101 :( [12:51:11] it was 1100 [12:51:15] awesome... [12:51:17] but too close to be a coincidence [12:51:27] i am checking the hw logs [12:51:33] but yes, it doesn't smell good [12:51:34] was maintenance ongoing there? [12:51:37] you have the ticket for db1100? [12:51:38] or just depooled [12:51:39] yep, alter tables [12:51:42] https://phabricator.wikimedia.org/T175973 [12:51:43] it was depooled + alters [12:51:44] mmm [12:51:46] thanks [12:52:06] it was IME last time [12:53:16] !log upgrading cumin masters to cumin 1.2.2-1 [12:53:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:08] (03PS1) 10Jcrespo: proxysql: Setup proxysql on terbium/wasat as a test [puppet] - 10https://gerrit.wikimedia.org/r/384695 (https://phabricator.wikimedia.org/T175672) [12:55:51] (03PS2) 10Jcrespo: proxysql: Setup proxysql on terbium/wasat as a test [puppet] - 10https://gerrit.wikimedia.org/r/384695 (https://phabricator.wikimedia.org/T175672) [12:56:25] jouncebot: next [12:56:26] In 0 hour(s) and 3 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171017T1300) [12:56:52] 10Operations, 10ops-eqiad, 10DBA: db1101 crashed - memory errors - https://phabricator.wikimedia.org/T178383#3690380 (10Marostegui) [12:56:54] 10Operations, 10ops-eqiad, 10DBA: db1101 crashed - memory errors - https://phabricator.wikimedia.org/T178383#3690357 (10Marostegui) [12:56:57] 10Operations, 10ops-eqiad, 10DBA: db1101 crashed - memory errors - https://phabricator.wikimedia.org/T178383#3690357 (10Marostegui) [12:57:25] (03PS3) 10Jcrespo: proxysql: Setup proxysql on terbium/wasat as a test [puppet] - 10https://gerrit.wikimedia.org/r/384695 (https://phabricator.wikimedia.org/T175672) [12:59:39] (03PS2) 10Zfilipin: Deploy Compact Language Links on the German Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384527 (https://phabricator.wikimedia.org/T177836) (owner: 10KartikMistry) [12:59:59] (03PS2) 10Zfilipin: [cirrus] Fix typo in UserTesting config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384529 (https://phabricator.wikimedia.org/T177502) (owner: 10DCausse) [13:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for European Mid-day SWAT(Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171017T1300). [13:00:05] kart_ and dcausse: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:15] I can SWAT today! [13:00:17] o/ [13:00:25] * kart_ here [13:00:41] (03PS4) 10Jcrespo: proxysql: Setup proxysql on terbium/wasat as a test [puppet] - 10https://gerrit.wikimedia.org/r/384695 (https://phabricator.wikimedia.org/T175672) [13:00:50] zeljkof: for me, the first patch need to deploy first and then I'll run the script and then deploy of config. [13:01:22] kart_: ok, want to deploy everything yourself? or should I deploy some parts? or everything? [13:02:01] 10Operations, 10ops-eqiad, 10DBA: db1101 crashed - memory errors - https://phabricator.wikimedia.org/T178383#3690405 (10Marostegui) a:03Cmjohnson @Cmjohnson can we get a new dimm for this host to replaced that one mentioned on the logs? Thanks [13:02:13] zeljkof: please deploy. I'll take care of script part. [13:02:23] cmjohnson1: ^ [13:02:39] zeljkof: deploy first patch, I'll test run + run the script and then deploy config. [13:02:49] moarostegui: okay...it's a HP so it will take a little longer than Dell [13:02:52] wmf3+config is for you :) [13:03:17] kart_: ok, merging and deploying 384666 [13:03:24] cmjohnson1: that one is a DEll I believe [13:03:43] cmjohnson1: yep, according to racktables, it is :) [13:04:25] (03PS5) 10Ema: varnish: execution environment configuration [puppet] - 10https://gerrit.wikimedia.org/r/384520 [13:05:12] 10Operations, 10ops-eqiad, 10DBA: db1101 crashed - memory errors - https://phabricator.wikimedia.org/T178383#3690430 (10Marostegui) I have powercycled the host and it came back up fine, but we better replace that DIMM as the server is quite new. Going to execute the alters again to see if it crashes once more. [13:05:27] zeljkof: cool. Let me know when done. I can't test it without dry running the script. [13:05:46] kart_: ok, so no need to deploy to mwdebug? [13:05:54] or should I deploy there first? [13:06:06] zeljkof: no. Just go ahead and deploy to wmf.3 [13:06:10] (03CR) 10Ema: varnish: execution environment configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/384520 (owner: 10Ema) [13:06:13] kart_: ok [13:06:43] hi kart_ [13:07:05] hello aharoni [13:07:46] are you deploying the new script? [13:08:10] aharoni: zeljkof is deploying the script, after that I'll dry-run the output [13:08:13] 10Operations, 10Contributors-Team, 10MobileFrontend, 10wikidiff2, and 4 others: Diff page consistently produces 503 on beta cluster on first visit - https://phabricator.wikimedia.org/T176637#3690441 (10Tobi_WMDE_SW) [13:08:17] kart_: goog [13:08:20] good [13:09:16] kart_: The script can be longish. [13:09:34] PROBLEM - HHVM rendering on mw2216 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:09:53] aharoni: yes. but we need to run updated script. [13:10:11] kart_: yes, of course [13:10:24] RECOVERY - HHVM rendering on mw2216 is OK: HTTP OK: HTTP/1.1 200 OK - 74954 bytes in 0.283 second response time [13:11:34] kart_: can I deploy dcausse's commits while you are running the script? looks his couple of commits are not related to your commits [13:11:55] zeljkof: Sure [13:12:09] zeljkof: yes [13:12:28] kart_, aharoni: great, thanks, that will speed things up [13:12:30] zeljkof: also, the script cannot really be tested on web [13:12:50] zeljkof: Let me know once you deploy it, I need to quickly start dry-run. [13:12:52] but, come to think of it, kart_ - can you read the scripts code before you run it? to check whether it's the new version. [13:13:12] aharoni: yes. I'll confirm it. [13:13:15] good, please do [13:13:27] kart_: the last job finishing in a few seconds [13:13:28] done [13:13:44] (03PS1) 10Volans: Cumin WMCS: explicitely install OpenStack dependencies [puppet] - 10https://gerrit.wikimedia.org/r/384701 [13:14:33] zeljkof: deployed? [13:14:41] deploying [13:14:50] sorry, merge is done, deploying now [13:14:53] (03PS4) 10BBlack: browsersec: bump to 100% 2017-10-17, update translations [puppet] - 10https://gerrit.wikimedia.org/r/376316 (https://phabricator.wikimedia.org/T163251) [13:14:54] :D [13:15:17] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/384520 (owner: 10Ema) [13:17:01] (03CR) 10Herron: [C: 032] MX: Change Exim configuration to use letsencrypt certificate [puppet] - 10https://gerrit.wikimedia.org/r/384591 (https://phabricator.wikimedia.org/T174081) (owner: 10Herron) [13:17:09] (03PS3) 10Herron: MX: Change Exim configuration to use letsencrypt certificate [puppet] - 10https://gerrit.wikimedia.org/r/384591 (https://phabricator.wikimedia.org/T174081) [13:17:20] !log zfilipin@tin Synchronized php-1.31.0-wmf.3/extensions/UniversalLanguageSelector/maintenance/ULSCompactLinksDisablePref.php: SWAT: [[gerrit:384666|Remove the 20 edits threshold from ULSCompactLinksDisablePref.php]] (duration: 00m 46s) [13:17:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:35] kart_: deployed [13:17:35] (03CR) 10Ema: "The only catalog changes are due to using Base::Service_unit instead of Systemd::Unit: https://puppet-compiler.wmflabs.org/compiler02/8353" [puppet] - 10https://gerrit.wikimedia.org/r/384520 (owner: 10Ema) [13:18:06] zeljkof: kart_ is running [13:18:11] kart_: I will start merging and deploying dcausse's commits, let me know when you are done, I can switch back to your second commit [13:18:24] zeljkof: OK. [13:18:34] aharoni: I hope he is running the script, not running ;) [13:18:54] dcausse: reviewing 384529 [13:18:58] ok [13:19:36] zeljkof: haha :) [13:19:37] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384529 (https://phabricator.wikimedia.org/T177502) (owner: 10DCausse) [13:19:38] zeljkof: kart_ is good at running and at running scripts! [13:19:48] aharoni: --really? :) [13:20:02] aharoni: but can he run _while_ running the script ;) [13:20:44] aharoni, kart_: lot's f those in the logs "at runtime/ext_mysql: slow query: SELECT MASTER_GTID_WAIT" [13:20:52] (03Merged) 10jenkins-bot: [cirrus] Fix typo in UserTesting config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384529 (https://phabricator.wikimedia.org/T177502) (owner: 10DCausse) [13:21:07] (03CR) 10jenkins-bot: [cirrus] Fix typo in UserTesting config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384529 (https://phabricator.wikimedia.org/T177502) (owner: 10DCausse) [13:21:15] (03PS5) 10BBlack: browsersec: bump to 100% 2017-10-17, update translations [puppet] - 10https://gerrit.wikimedia.org/r/376316 (https://phabricator.wikimedia.org/T163251) [13:21:41] zeljkof: which wiki? [13:22:03] kart_: in hhvm.log on mwlog1001 [13:22:05] zeljkof: does it just mean that it's slow? or does it mean that maybe something is lost? [13:22:26] aharoni: I don't really know, I have just notice many new error messages since the deploy [13:22:29] zeljkof: because I know that it's longish, and that's OK. [13:22:43] (03CR) 10BBlack: [C: 031] varnish: execution environment configuration [puppet] - 10https://gerrit.wikimedia.org/r/384520 (owner: 10Ema) [13:22:56] zeljkof: shouldn't be since the _deploy_, but not surprising if it's since the _running_ [13:23:02] aharoni: not saying anything is wrong, just noticing [13:23:02] (03CR) 10Muehlenhoff: [C: 031] Cumin WMCS: explicitely install OpenStack dependencies [puppet] - 10https://gerrit.wikimedia.org/r/384701 (owner: 10Volans) [13:23:25] (03CR) 10BBlack: [C: 032] varnish: execution environment configuration [puppet] - 10https://gerrit.wikimedia.org/r/384520 (owner: 10Ema) [13:23:31] zeljkof: yeah, that's good [13:23:35] (03CR) 10BBlack: [C: 031] varnish: execution environment configuration [puppet] - 10https://gerrit.wikimedia.org/r/384520 (owner: 10Ema) [13:23:43] dcausse: 384529 is at mwdebug1002, please check and let me know if I can deploy [13:23:46] (03CR) 10BBlack: [C: 032] browsersec: bump to 100% 2017-10-17, update translations [puppet] - 10https://gerrit.wikimedia.org/r/376316 (https://phabricator.wikimedia.org/T163251) (owner: 10BBlack) [13:25:42] (03PS2) 10Volans: Cumin WMCS: explicitely install OpenStack dependencies [puppet] - 10https://gerrit.wikimedia.org/r/384701 [13:25:42] kart_, aharoni: https://logstash.wikimedia.org/app/kibana#/dashboard/mediawiki-errors?_g=h@44136fa&_a=h@f9b0d18 [13:25:52] lots of "Server {host} has {lag} seconds of lag (>= {maxlag})" [13:26:18] there is lag on s4 [13:26:21] what is going on? [13:26:44] jynus: kart_ is running a script, maybe that is causing the lag [13:26:59] kart_ STOP NO [13:27:01] W [13:27:13] <_joe_> why is that not !logged [13:27:14] <_joe_> ? [13:27:31] commons is going to be down or degraded [13:27:38] <_joe_> also, where is that script being run from? [13:27:43] <_joe_> fuck. [13:28:01] aharoni, kart_ ^ [13:28:04] I am checking terbium [13:28:10] <_joe_> not there [13:28:46] <_joe_> zeljkof: where is that script being run? [13:28:48] _joe_: script is over few min. back [13:28:50] my guess would be terbium [13:28:58] _joe_: tin. [13:28:59] I found it, killing it [13:29:06] <_joe_> jynus: ? [13:29:12] <_joe_> which one? I didn't find one on tin [13:29:14] jynus: still running? It is done already. [13:29:17] <_joe_> where kartik was [13:29:25] <_joe_> so you're killing something else [13:29:31] !log killing commonswiki script on tin [13:29:34] PROBLEM - puppet last run on cp1063 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:29:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:09] <_joe_> jynus: what was the ps line? [13:30:21] kart_: you are done with the script? I am waiting for dcausse to confirm 384529 is ok to deploy, it's already at mwdebug1002 [13:30:23] _joe_: script runs on dewiki, not commons. [13:30:29] "mysql -uwikiadmin -p -h10.64.48.23 commonswiki" [13:30:33] _joe_, jynus: should I stop the SWAT? [13:30:37] <_joe_> so various levels of wtfs [13:30:39] <_joe_> zeljkof: indeed [13:30:43] zeljkof: let's stop for now yes [13:30:47] <_joe_> 1) never ever run scripts from tin [13:31:05] !log stopping EU SWAT [13:31:08] who is mlitn ? [13:31:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:16] <_joe_> 2) if the script touches the db significantly, tell the dbas [13:31:31] jynus that would be me [13:31:33] jynus: Found the person on the contact list [13:31:37] yep XD [13:31:51] this is the second time it happens this week, maybe we should think of start retiring permissions [13:31:53] _joe_: was that our script causing issue or something else? [13:32:05] <_joe_> 3) any script run in production that is changing anything needs to be properly logged in the SAL [13:32:10] jynus: Matthias Mullie [13:32:13] <_joe_> kart_: not sure [13:32:22] _joe_: OK. Do let me know. [13:32:38] 10Operations, 10Mail, 10Patch-For-Review: mail.wikimedia.org SSL cert expiring Mon 23 Oct 2017 - https://phabricator.wikimedia.org/T174081#3690551 (10herron) 05Open>03Resolved LE certs have been deployed to mx1001 and mx2001. ``` Certificate: Data: Version: 3 (0x2) Serial Number:... [13:32:42] <_joe_> if your script was on commons, the issue wasn't caused by your script [13:32:44] matthiasmullie: what were you running? [13:33:52] dcausse: SWAT is stopped for now, still around? [13:34:00] zeljkof: yes [13:34:20] dcausse: did you test 384529 at mwdebug1002? [13:34:25] jynus UPDATE image SET img_media_type="AUDIO", img_major_mime="audio" WHERE img_media_type="VIDEO" AND img_major_mime="video" AND img_minor_mime="webm" AND img_metadata LIKE '%s:9:"mime_type";s:10:"audio/webm";%'; [13:34:34] RECOVERY - puppet last run on cp1063 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:34:40] (+ oldimage + filearchive) [13:34:44] zeljkof: I was testing it, and did not work as expected and was investigating [13:34:56] matthiasmullie: WTF [13:35:11] dcausse: ok, feel free to continue, I did not revert anything [13:35:14] <_joe_> ahahah [13:35:18] zeljkof: thanks [13:35:37] <_joe_> matthiasmullie: please NEVER do an update in production without asking the DBAs first [13:35:53] <_joe_> matthiasmullie: that's gonna cause havoc and outages [13:35:59] were you literally running that? [13:36:01] matthiasmullie: probably doing it during regular SWAT is not a good idea too [13:36:02] <_joe_> it's also very very wrong in terms of process [13:36:13] _joe_ got it, won't happen again [13:36:14] <_joe_> zeljkof: it's not a good idea EVER [13:36:16] from the command line? [13:36:20] in a single transaction? [13:36:30] jynus yes, that was the exact query [13:36:34] pffff [13:38:17] (03PS4) 10Aude: Stop using $wgWikibaseSharedCacheKeyPrefix from Wikidata build [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381371 (https://phabricator.wikimedia.org/T176948) [13:38:36] (03PS5) 10Aude: Stop using $wgWikibaseSharedCacheKeyPrefix from Wikidata build [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381371 (https://phabricator.wikimedia.org/T176948) [13:39:47] matthiasmullie: there are _many_ things wrong in doing it that way. First you should have contacted the DBAs explaining what you wanted to do and when [13:39:50] matthiasmullie: what was the plan if the query, aside from blocking all writes on commons, would fail and corrupt the image table? [13:41:28] (03CR) 10Volans: [C: 032] Cumin WMCS: explicitely install OpenStack dependencies [puppet] - 10https://gerrit.wikimedia.org/r/384701 (owner: 10Volans) [13:41:42] marostegui: I'll send mail to DBA+Ops for our script run queries. AFAIK, it was reviewed by DBA, but I'll double check. [13:41:48] _joe_: ^ [13:41:54] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0 [13:41:55] matthiasmullie: Second, NEVER run updates from the command line just like, that use mediawiki. It will make sure consistency is there, it will throttle the writes, will wait for replication…all the sanity checks needed to perform a safe (and multiple, and not just one) transaction [13:42:05] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0 [13:42:11] matthiasmullie: Third, please use !log to log when you start the script and from where [13:42:14] <_joe_> kart_: yeah sorry you were just running the script at the same time as another that was killing commons :) [13:42:23] <_joe_> kart_: still, remember to !log here [13:42:27] ans second, did you notice the 200 000 mediawiki errors? https://logstash.wikimedia.org/goto/2f63333fab7e7a078d643c39c21a22fb [13:42:33] Noted _joe_ [13:43:03] matthiasmullie: Fourth, please ALWAYS monitor what you are doing, at least the the start. To make sure things are going as expected (oh. what jynus just wrote) [13:43:13] jynus, marostegui: will I be able to continue with SWAT, or should I start reverting? there are a couple of commits that are in different stages of deployment [13:43:32] zeljkof: let me check that the queries are not still rolling back [13:43:37] and we are back to normal [13:43:38] kart_: are you done with 384666? [13:43:39] _joe_: Yes. zeljkof noticed it, but I wasn't sure since I was using dewiki. [13:43:41] jynus: thanks [13:44:02] zeljkof: It is deployed, right? [13:44:12] marostegui jynus It touched ~100 rows, which I (mistakenly) assumed wouldn't be a problem - this won't happen again [13:44:16] zeljkof: We have to abort further running script today. [13:44:31] kart_: yes, from my side; from your side the script is done? [13:44:33] zeljkof: and so, 384527. [13:44:38] apologies for the troubles! [13:44:41] zeljkof: yes. done. [13:45:02] can we please avoid saying "WTF" or scolding people like that please? [13:45:03] zeljkof: please don't deploy 384527. [13:45:23] zeljkof: things seem ok [13:45:23] we're all on the same team, an honest mistake was made, let's make sure we communicate what went wrong without assigning blame [13:45:47] and potentially take technical measures in order to avoid this kind of thing in the future [13:45:51] kart_: ok, to summarize, 384666 is deployed, nothing to do, 384527 should not be deployed; so you are done? [13:46:02] zeljkof: yes. [13:46:18] zeljkof: Thanks for (not) deploying with us :D [13:46:22] and just be nice to each other, please [13:46:33] paravoid: +1 [13:46:39] kart_: in that case, thanks for deploying with #releng ;) [13:46:59] jynus: ok, I can continue with SWAT? [13:47:01] * kart_ goes on vacation now [13:47:04] yes [13:47:10] kart_: enjoy! [13:47:17] jynus: great, thanks :) [13:47:23] !log EU SWAT continues [13:47:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:56] dcausse: did you test 384529? we can continue the swat [13:48:16] kart_: have fun :) [13:48:26] zeljkof: yes and found a problem [13:48:54] dcausse: should I revert 384529? [13:49:46] zeljkof: it's fine to deploy if easier, I have a followup patch to insert if that's ok with you? [13:50:08] dcausse: sure, I can deploy both at the same time, ok? [13:50:24] zeljkof: sure, lemme upload the new patch [13:50:29] dcausse: ok [13:51:52] 10Operations, 10Operations-Software-Development: Upgrade Cumin masters to stretch - https://phabricator.wikimedia.org/T177385#3690615 (10MoritzMuehlenhoff) Actually when looking at Racktables both neodymium and sarin had their warranty expired in January 2016, so they're pretty close to our usual five years li... [13:52:04] (03PS1) 10DCausse: [cirrus] Fix typo in UserTesting config var (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384703 (https://phabricator.wikimedia.org/T177502) [13:52:09] (03CR) 10Rush: "I will bring this up in our weekly today" [puppet] - 10https://gerrit.wikimedia.org/r/382917 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [13:52:12] (03CR) 10jerkins-bot: [V: 04-1] [cirrus] Fix typo in UserTesting config var (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384703 (https://phabricator.wikimedia.org/T177502) (owner: 10DCausse) [13:52:26] (03PS2) 10DCausse: [cirrus] Fix typo in UserTesting config var (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384703 (https://phabricator.wikimedia.org/T177502) [13:52:46] (03PS9) 10Rush: WIP openstack: dns-floating-ip-updater to module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/384620 (https://phabricator.wikimedia.org/T171583) [13:53:09] zeljkof: https://gerrit.wikimedia.org/r/#/c/384703/ [13:53:10] (03PS10) 10Rush: WIP openstack: dns-floating-ip-updater to module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/384620 (https://phabricator.wikimedia.org/T171583) [13:53:47] (03CR) 10jerkins-bot: [V: 04-1] WIP openstack: dns-floating-ip-updater to module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/384620 (https://phabricator.wikimedia.org/T171583) (owner: 10Rush) [13:53:48] dcausse: please also add to the deployment calendar, reviewing [13:53:53] ok [13:54:39] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384703 (https://phabricator.wikimedia.org/T177502) (owner: 10DCausse) [13:55:07] (03PS11) 10Rush: WIP openstack: dns-floating-ip-updater to module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/384620 (https://phabricator.wikimedia.org/T171583) [13:55:10] (03CR) 10Ema: [C: 04-1] "Do NOT merge. Changing the way the varnish service is defined caused varnish to be restarted." [puppet] - 10https://gerrit.wikimedia.org/r/384520 (owner: 10Ema) [13:55:47] (03Merged) 10jenkins-bot: [cirrus] Fix typo in UserTesting config var (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384703 (https://phabricator.wikimedia.org/T177502) (owner: 10DCausse) [13:56:13] (03CR) 10jenkins-bot: [cirrus] Fix typo in UserTesting config var (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384703 (https://phabricator.wikimedia.org/T177502) (owner: 10DCausse) [13:57:02] (03PS12) 10Rush: WIP openstack: dns-floating-ip-updater to module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/384620 (https://phabricator.wikimedia.org/T171583) [13:57:06] dcausse: it's at mwdebug1002, let me know when you test it [13:57:12] zeljkof: works fine now [13:57:20] dcausse: ok, deploying [13:58:17] zeljkof: if we're running out of time I can move the third one to another window [13:58:38] dcausse: please do, we have a couple minutes left, not enough [13:58:43] ok [13:59:00] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:384703|[cirrus] Fix typo in UserTesting config var (take 2) (T177502)]] (duration: 00m 45s) [13:59:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:07] T177502: Deploy A/B test to test relaxing the retrieval query filter - https://phabricator.wikimedia.org/T177502 [13:59:11] dcausse: it's deployed, please check [13:59:13] zeljkof: thanks! [13:59:25] dcausse: and thanks for deploying with #releng ;) [13:59:40] :) [14:00:33] !log EU SWAT finished [14:00:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:24] 10Operations, 10hardware-requests: Replacement hardware for cumin masters - https://phabricator.wikimedia.org/T178392#3690629 (10MoritzMuehlenhoff) [14:05:03] (03CR) 10Rush: "http://puppet-compiler.wmflabs.org/8354/" [puppet] - 10https://gerrit.wikimedia.org/r/384620 (https://phabricator.wikimedia.org/T171583) (owner: 10Rush) [14:05:05] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384705 [14:05:08] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384705 [14:05:21] !log start of cleaning up ores_classification table in plwiki (T159753) [14:05:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:29] T159753: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753 [14:05:32] (03PS14) 10Rush: openstack: dns-floating-ip-updater to module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/384620 (https://phabricator.wikimedia.org/T171583) [14:05:47] (03PS6) 10Ema: varnish: execution environment configuration [puppet] - 10https://gerrit.wikimedia.org/r/384520 [14:06:58] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384705 (owner: 10Marostegui) [14:08:15] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384705 (owner: 10Marostegui) [14:08:25] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384705 (owner: 10Marostegui) [14:09:23] (03CR) 10Rush: [C: 032] openstack: dns-floating-ip-updater to module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/384620 (https://phabricator.wikimedia.org/T171583) (owner: 10Rush) [14:09:34] (03Abandoned) 10Gehel: osm: install prerequisite packages for meddo [puppet] - 10https://gerrit.wikimedia.org/r/328176 (https://phabricator.wikimedia.org/T153289) (owner: 10Gehel) [14:09:36] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1081 - T174509 T177772 (duration: 00m 45s) [14:09:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:43] T177772: Purge 90% of rows from recentchanges (and posibly defragment) from commonswiki and ruwiki (the ones with source:wikidata) - https://phabricator.wikimedia.org/T177772 [14:09:44] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [14:10:02] (03Abandoned) 10Gehel: maps - cleartables osm replication [puppet] - 10https://gerrit.wikimedia.org/r/341563 (https://phabricator.wikimedia.org/T157613) (owner: 10Gehel) [14:17:38] 10Operations, 10Collaboration-Team-Triage, 10Notifications, 10Anti-Harassment (AHT Sprint 7), 10Patch-For-Review: Recover Echo Notification Blacklist from Backup - https://phabricator.wikimedia.org/T178313#3690690 (10Anomie) >>! In T178313#3689502, @dbarratt wrote: > In https://gerrit.wikimedia.org/r/#/c... [14:18:07] (03CR) 10Alexandros Kosiaris: [C: 031] varnish: execution environment configuration [puppet] - 10https://gerrit.wikimedia.org/r/384520 (owner: 10Ema) [14:19:25] (03PS1) 10BBlack: browsersec: use status code 403 [puppet] - 10https://gerrit.wikimedia.org/r/384707 (https://phabricator.wikimedia.org/T163251) [14:20:00] (03CR) 10BBlack: [C: 032] browsersec: use status code 403 [puppet] - 10https://gerrit.wikimedia.org/r/384707 (https://phabricator.wikimedia.org/T163251) (owner: 10BBlack) [14:21:24] 10Operations, 10Wikimedia-Fundraising-CiviCRM, 10fundraising-tech-ops: mintaka disk space warning - https://phabricator.wikimedia.org/T177852#3690713 (10Jgreen) a:03Jgreen [14:21:38] (03PS1) 10Rush: labstore: use profile param for observer_pass [puppet] - 10https://gerrit.wikimedia.org/r/384708 (https://phabricator.wikimedia.org/T171494) [14:25:02] !log end of cleaning up ores_classification table in plwiki, start of ruwiki (T159753) [14:25:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:10] T159753: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753 [14:25:55] this is going to take a little bit of time, 8M rows there [14:28:32] (03CR) 10Rush: [C: 032] labstore: use profile param for observer_pass [puppet] - 10https://gerrit.wikimedia.org/r/384708 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [14:39:49] (03PS1) 10Rush: mariadb: remove cruft class for old wikitech setup [puppet] - 10https://gerrit.wikimedia.org/r/384710 (https://phabricator.wikimedia.org/T171494) [14:44:47] (03CR) 10Rush: "http://puppet-compiler.wmflabs.org/8356/" [puppet] - 10https://gerrit.wikimedia.org/r/384710 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [14:45:16] 10Operations, 10Collaboration-Team-Triage, 10Notifications, 10Anti-Harassment (AHT Sprint 7), 10Patch-For-Review: Recover Echo Notification Blacklist from Backup - https://phabricator.wikimedia.org/T178313#3690799 (10dbarratt) >>! In T178313#3689690, @jcrespo wrote: > I've checked other wikis, the patter... [14:45:20] (03CR) 10Marostegui: [C: 031] mariadb: remove cruft class for old wikitech setup [puppet] - 10https://gerrit.wikimedia.org/r/384710 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [14:45:35] (03CR) 10Rush: [C: 032] mariadb: remove cruft class for old wikitech setup [puppet] - 10https://gerrit.wikimedia.org/r/384710 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [14:47:18] !log wikitech: Cleaned up 'importers' user_group entries (T171682) [14:47:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:24] T171682: Remove 'importers' (note the ending 's') group from wikitech - https://phabricator.wikimedia.org/T171682 [14:53:40] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: frpm1001 is dead, looks like hardware failure - https://phabricator.wikimedia.org/T177710#3690841 (10Jgreen) 05Open>03Resolved a:03Jgreen this is done [14:53:45] (03PS1) 10Rush: openstack: move unused role/labs/dns [puppet] - 10https://gerrit.wikimedia.org/r/384712 (https://phabricator.wikimedia.org/T171494) [14:54:31] (03PS2) 10Rush: openstack: remove unused role/labs/dns [puppet] - 10https://gerrit.wikimedia.org/r/384712 (https://phabricator.wikimedia.org/T171494) [15:01:36] (03PS7) 10Ema: varnish: execution environment configuration [puppet] - 10https://gerrit.wikimedia.org/r/384520 [15:01:44] (03CR) 10Ema: [V: 032 C: 032] varnish: execution environment configuration [puppet] - 10https://gerrit.wikimedia.org/r/384520 (owner: 10Ema) [15:09:25] <_joe_> win 25 [15:10:10] (03PS1) 10Muehlenhoff: Fix setup of libapache2-mod-security2 on stretch [puppet] - 10https://gerrit.wikimedia.org/r/384713 (https://phabricator.wikimedia.org/T174431) [15:10:40] (03CR) 10jerkins-bot: [V: 04-1] Fix setup of libapache2-mod-security2 on stretch [puppet] - 10https://gerrit.wikimedia.org/r/384713 (https://phabricator.wikimedia.org/T174431) (owner: 10Muehlenhoff) [15:11:38] (03PS2) 10Muehlenhoff: Fix setup of libapache2-mod-security2 on stretch [puppet] - 10https://gerrit.wikimedia.org/r/384713 (https://phabricator.wikimedia.org/T174431) [15:12:18] (03CR) 10jerkins-bot: [V: 04-1] Fix setup of libapache2-mod-security2 on stretch [puppet] - 10https://gerrit.wikimedia.org/r/384713 (https://phabricator.wikimedia.org/T174431) (owner: 10Muehlenhoff) [15:13:09] (03PS3) 10Muehlenhoff: Fix setup of libapache2-mod-security2 on stretch [puppet] - 10https://gerrit.wikimedia.org/r/384713 (https://phabricator.wikimedia.org/T174431) [15:13:17] godog: thoughts on https://gerrit.wikimedia.org/r/#/c/384608/ ? [15:33:56] anyone else having a problem with pushing to gerrit? [15:34:19] and pulling too [15:34:19] <_joe_> moritzm: I don't think mod_security is installed and/or enabled at all on the mw hosts? [15:34:42] <_joe_> bmansurov: I have problems from time to time, but I cloned a repo 3 minutes ago [15:35:42] _joe_: thanks, i'll try again [15:36:00] <_joe_> moritzm: uh, it actually is, I completely forgot [15:36:16] no luck ;( [15:36:38] <_joe_> bmansurov: I'll check gerrit in ~ 2 minutes [15:36:49] thanks [15:37:43] <_joe_> bmansurov: what's the error you get if you run git with GIT_TRACE=1 ? [15:38:46] <_joe_> I can confirm I have the same issue now [15:38:53] the command halts at trace: run_command: 'ssh' '-p' '29418' 'bmansurov@gerrit.wikimedia.org' 'git-upload-pack '\''/mediawiki/services/chromium-render'\''' [15:38:53] <_joe_> so I can reproduce at least [15:39:20] <_joe_> I'll let you know, but I think it's just broken [15:39:35] gerrit or something on my end? [15:40:19] <_joe_> no no gerrit [15:40:26] <_joe_> nothing broken on your end :P [15:40:34] ok [15:43:13] do you want me to kill the stuck process? [15:45:45] <_joe_> bmansurov: can you try again? [15:46:00] <_joe_> !log killed a stuck gerrit process on cobalt [15:46:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:30] i am wondering what was the gerrit process that was stuck please? [15:51:13] _joe_: same thing, stuck [15:53:29] <_joe_> bmansurov: uhm it works for me now [15:53:45] <_joe_> paladox: git-receive-pack '/mediawiki/core.git' [15:53:50] thanks [15:53:58] <_joe_> since 12:16 of today [15:54:06] _joe_: can you clone https://gerrit.wikimedia.org/r/#/admin/projects/mediawiki/services/chromium-render ? [15:54:19] maybe the error is happening for this repo only? [15:54:27] i get this [15:54:27] fatal: unable to update url base from redirection: [15:55:09] clones over ssh with the correct url [15:55:11] <_joe_> bmansurov: so I tried https and it works [15:55:35] and http [15:55:43] try git clone https://gerrit.wikimedia.org/r/p/mediawiki/services/chromium-render instead [15:55:44] <_joe_> ssh too [15:55:54] the URL is different with /admin/projects/ [15:56:02] <_joe_> mutante: I don't think that was the URL bmansurov was using [15:56:05] <_joe_> he's using ssh [15:56:47] <_joe_> he was pointing y'all to the project page :) [15:56:52] _joe_: https works for me too, ssh isn't working [15:56:55] i see, ok [15:57:02] <_joe_> bmansurov: uhm lemme check something [15:57:30] <_joe_> bmansurov: can you try again? [15:57:39] ok [15:57:55] <_joe_> I just want to look at the logs while you try [15:58:28] _joe_: stuck again [15:59:05] <_joe_> bmansurov: uhm I'm not even sure you made it to gerrit [15:59:15] <_joe_> did you change your ssh config today by any chance? [15:59:22] the ssh logs for gerrit is under /var/lib/gerrit2/review_site/logs/ssh* [15:59:49] sshd_log [16:00:03] <_joe_> paladox: thanks, just found them [16:00:04] /var/lib/gerrit2/review_site/logs/sshd_log [16:00:06] godog, moritzm, and _joe_: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Puppet SWAT(Max 8 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171017T1600). [16:00:06] Amir1 and Addshore: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:07] _joe_: not today [16:00:08] your welcome :) [16:00:09] <_joe_> why that dam location? [16:00:11] o/ [16:00:35] _joe_: one of my team mates also had the same problem [16:00:50] _joe_: we don't share the same internet or ssh configs [16:00:52] <_joe_> bmansurov: you don't get past the login [16:00:55] _joe_ it's the path upstream choose [16:01:08] <_joe_> paladox: that's a very bad path :P [16:01:22] <_joe_> paladox: but I'm sure you agree [16:01:28] it's because gerrit uses an internal ssh lib. [16:01:32] yep, can be confusing :) [16:01:37] yea, i wish that was in /var/log , but we also have a pending patch to move those to logstash, right paladox [16:01:43] <_joe_> bmansurov: ok tbh there is nothing in the logs that really tells me what's wrong [16:01:58] yep, though i carn't seem to get logstash working with gerrit at the moment. [16:02:27] <_joe_> at this point it could be anything. I really don't know [16:03:03] hmm, yea, just LOGIN followed by LOGOUT though [16:03:31] <_joe_> bmansurov: not being able to reproduce your issue is a bit of a problem, can you paste for me on phabricator of a git clone with GIT_TRACE=1 ? [16:03:54] _joe_: [16:03:55] ok [16:03:55] <_joe_> bmansurov: when was the last time you were able to clone from gerrit via ssh? [16:04:16] _joe_: I think Friday, I didn't try yesterday [16:04:28] (03CR) 10Ayounsi: [C: 032] "Some related articles:" [puppet] - 10https://gerrit.wikimedia.org/r/384526 (owner: 10BBlack) [16:04:30] <_joe_> did you try to go via a vpn? [16:05:20] _joe_: no I haven't done that yet. Should I? [16:05:23] _joe_: https://phabricator.wikimedia.org/P6138 [16:06:15] <_joe_> bmansurov: just to rule out network issues [16:06:31] the only Gerrit puppet changes since then i know of were no-op on cobalt [16:07:00] <_joe_> so... lemme try one thing [16:07:13] oh, look, bmansurov - AUTH FAILURE FROM [16:07:18] "no-matching-key" [16:07:32] but _also_ LOGOUTs right before that.. odd [16:08:15] oh, I did ssh-add id_rsa today [16:08:19] maybe that's the reason? [16:08:23] i notice that the "auth failure from" is followed by a v6 IP [16:08:26] <_joe_> mutante: that was me [16:08:28] (03PS7) 10Rush: base/icinga: if mysql is in labtest never send pages [puppet] - 10https://gerrit.wikimedia.org/r/384183 (https://phabricator.wikimedia.org/T178008) (owner: 10Dzahn) [16:08:31] <_joe_> the error [16:08:31] oh.. [16:08:44] <_joe_> I tried to check if gerrit was stuck searching for his user :P [16:08:50] heh, nod [16:09:04] (03CR) 10Rush: [C: 031] "I actually didn't see this part before:" [puppet] - 10https://gerrit.wikimedia.org/r/384183 (https://phabricator.wikimedia.org/T178008) (owner: 10Dzahn) [16:09:24] _joe_: mutante my teammates are able to clone, so the issue is on my end it seems. [16:09:24] (03PS8) 10Rush: base/icinga: if mysql is in labtest never send pages [puppet] - 10https://gerrit.wikimedia.org/r/384183 (https://phabricator.wikimedia.org/T178008) (owner: 10Dzahn) [16:09:39] I'll see if I can fix it myself, I'll ping you guys if I cannot. [16:09:52] <_joe_> bmansurov: try with a vpn first [16:09:56] <_joe_> and let us know, yes [16:10:11] (03CR) 10Rush: [C: 032] base/icinga: if mysql is in labtest never send pages [puppet] - 10https://gerrit.wikimedia.org/r/384183 (https://phabricator.wikimedia.org/T178008) (owner: 10Dzahn) [16:10:12] _joe_: ok, I'm in a meeting right now, and I'd like to not lose my connection. [16:10:18] I'll try after the meeting. [16:10:33] <_joe_> bmansurov: yeah do it later, but ping mutante then, I'm going off in a few [16:10:45] ok I will. thanks for helping. [16:11:53] (03CR) 10Volans: "See comments inline, I've done only a very quick pass over the tests." (0326 comments) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/384081 (https://phabricator.wikimedia.org/T177276) (owner: 10Giuseppe Lavagetto) [16:15:33] 10Operations, 10Graphite: Add cobalt to grafana dashboard - https://phabricator.wikimedia.org/T178401#3691048 (10Paladox) [16:15:55] 10Operations, 10Ops-Access-Requests, 10DBA, 10cloud-services-team (Kanban): Access to raw database tables on labsdb* for wmcs-admin users - https://phabricator.wikimedia.org/T178128#3691050 (10chasemp) p:05Triage>03Normal a:03madhuvishy @madhuvishy is going to take a tour here and document from our e... [16:15:58] 10Operations, 10monitoring, 10Graphite: Add cobalt to grafana dashboard - https://phabricator.wikimedia.org/T178401#3691053 (10Aklapper) [16:16:00] (03PS2) 10Rush: openstack: remove ganglia disk stats [puppet] - 10https://gerrit.wikimedia.org/r/382917 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [16:16:04] 10Operations, 10monitoring, 10Graphite: Add cobalt to grafana dashboard - https://phabricator.wikimedia.org/T178401#3691055 (10Paladox) [16:16:11] 10Operations, 10monitoring, 10Patch-For-Review: Uninstall ganglia from the fleet - https://phabricator.wikimedia.org/T177225#3691054 (10Paladox) [16:16:34] PROBLEM - MariaDB disk space on db1102 is CRITICAL: DISK CRITICAL - free space: /srv 224920 MB (5% inode=99%) [16:16:48] checking that, could be the alters [16:17:15] (03CR) 10Rush: [C: 032] openstack: remove ganglia disk stats [puppet] - 10https://gerrit.wikimedia.org/r/382917 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [16:18:09] :) thanks [16:18:19] 10Operations, 10monitoring, 10Graphite: Add cobalt to grafana dashboard - https://phabricator.wikimedia.org/T178401#3691060 (10Paladox) [16:18:21] 10Operations, 10monitoring, 10Patch-For-Review: Uninstall ganglia from the fleet - https://phabricator.wikimedia.org/T177225#3691059 (10Paladox) [16:18:24] (03CR) 10Rush: [C: 032] openstack: remove unused role/labs/dns [puppet] - 10https://gerrit.wikimedia.org/r/384712 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [16:18:31] 10Operations, 10monitoring, 10Graphite: Add cobalt to grafana dashboard - https://phabricator.wikimedia.org/T178401#3691024 (10Paladox) [16:18:31] (03PS3) 10Rush: openstack: remove unused role/labs/dns [puppet] - 10https://gerrit.wikimedia.org/r/384712 (https://phabricator.wikimedia.org/T171494) [16:18:33] 10Operations, 10Patch-For-Review, 10Prometheus-metrics-monitoring: Port application-specific metrics from ganglia to prometheus - https://phabricator.wikimedia.org/T145659#3691062 (10Paladox) [16:25:08] (03PS34) 10Paladox: Gerrit: Use systemd::service for systemd [puppet] - 10https://gerrit.wikimedia.org/r/378768 (https://phabricator.wikimedia.org/T157414) [16:26:35] PROBLEM - Router interfaces on cr1-eqdfw is CRITICAL: CRITICAL: No response from remote host 208.80.153.198 for 1.3.6.1.2.1.2.2.1.3 with snmp version 2 [16:26:36] btw. No one is doing puppet SWAT? [16:27:25] RECOVERY - Router interfaces on cr1-eqdfw is OK: OK: host 208.80.153.198, interfaces up: 35, down: 0, dormant: 0, excluded: 0, unused: 0 [16:30:58] jouncebot: next [16:30:58] In 0 hour(s) and 29 minute(s): Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171017T1700) [16:31:03] jouncebot: now [16:31:03] For the next 0 hour(s) and 28 minute(s): Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171017T1600) [16:32:25] !log gehel@tin Started deploy [tilerator/deploy@f3b26f3]: fix tileshell not exiting - T177389 [16:32:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:34] T177389: tileshell does not exit - https://phabricator.wikimedia.org/T177389 [16:32:45] (03PS5) 10Dzahn: ci: provide bare copy of puppet.git on Docker slaves [puppet] - 10https://gerrit.wikimedia.org/r/383843 (https://phabricator.wikimedia.org/T178076) (owner: 10Hashar) [16:33:04] I can merge the wikidata.pp one [16:33:09] but it is wrong [16:33:26] it should change to ensure => absent [16:33:44] (03CR) 10Dzahn: [C: 032] ci: provide bare copy of puppet.git on Docker slaves [puppet] - 10https://gerrit.wikimedia.org/r/383843 (https://phabricator.wikimedia.org/T178076) (owner: 10Hashar) [16:33:44] and later taken out, otherwise the cron will be there forever [16:34:07] !log gehel@tin Finished deploy [tilerator/deploy@f3b26f3]: fix tileshell not exiting - T177389 (duration: 01m 42s) [16:34:11] (03CR) 10Dzahn: "labs-only" [puppet] - 10https://gerrit.wikimedia.org/r/383843 (https://phabricator.wikimedia.org/T178076) (owner: 10Hashar) [16:34:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:15] Amir1: does it make sense? [16:34:29] I can merge that outside of swat, so no worries to do it now [16:34:44] (03PS2) 10Dzahn: ci: add mediawiki core and vendor to gitcache for docker slaves [puppet] - 10https://gerrit.wikimedia.org/r/384570 (https://phabricator.wikimedia.org/T178076) (owner: 10Addshore) [16:34:53] o/ [16:35:06] !log restart tilerator / tileratorui on all maps cluster after deployment of latest fix - T177389 [16:35:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:08] jynus: I will fix it right now [16:36:36] addshore: hey, so the directories really end in ".git "? [16:36:38] thanks for telling me, my knowledge about puppet is as good as a toddler's [16:36:44] addshore: are you on labs to confirm that change? [16:36:47] mutante: yup [16:36:47] Amir1: as I said, no need to do it now if you do not want [16:36:52] mutante: yup [16:37:00] addshore: cool, merging the second one now [16:37:00] we can do it on my morning, but as you wish [16:37:17] (03CR) 10Dzahn: [C: 032] ci: add mediawiki core and vendor to gitcache for docker slaves [puppet] - 10https://gerrit.wikimedia.org/r/384570 (https://phabricator.wikimedia.org/T178076) (owner: 10Addshore) [16:39:28] Amir1: I can also merge now maintain-views.yaml [16:39:32] no blockers [16:39:49] but it needs a heads up to cloud to run the upgrade script [16:40:09] (if such a thing exists for deleting views) [16:40:13] jynus: yeah, I'm planning to do it after it's merged if that's okay [16:40:25] you mean pinging them, right? [16:40:28] yeah [16:40:29] (03PS3) 10Ladsgroup: Do not rebuild wb_entity_per_page [puppet] - 10https://gerrit.wikimedia.org/r/352574 (https://phabricator.wikimedia.org/T140890) [16:40:45] I just updated the patch for cronjob [16:41:18] did addshore / mutante finished with theirs? [16:42:37] yes, finished [16:42:40] coll [16:42:42] cool [16:42:44] i did th 2 CI ones [16:42:46] will do these 2 [16:42:49] from Amir [16:42:51] leaving the others for you. and thanks :) [16:43:15] (03CR) 10Jcrespo: [C: 032] Do not rebuild wb_entity_per_page [puppet] - 10https://gerrit.wikimedia.org/r/352574 (https://phabricator.wikimedia.org/T140890) (owner: 10Ladsgroup) [16:43:44] Amir1: so we will deploy this, run puppet on the maintenance hosts (codfw one should be noop) [16:43:55] and then remove the code on a second patch [16:44:10] I will do that in the mean time [16:44:13] otherwise, the puppet will go away, but the cron would be still active [16:44:33] that is something nobody likes and it is confusing [16:46:24] Notice: /Stage[main]/Mediawiki::Maintenance::Wikidata/Cron[wikibase-rebuild-entityperpage]/ensure: removed [16:46:29] on terbium [16:46:37] should be noop on wasat [16:46:40] (03PS1) 10Ladsgroup: mediawiki: Remove the cronjob for rebuilding entity_per_page table [puppet] - 10https://gerrit.wikimedia.org/r/384723 (https://phabricator.wikimedia.org/T140890) [16:47:03] jynus: https://gerrit.wikimedia.org/r/#/c/384723/1 [16:47:13] Thanks [16:47:29] was there more maintenace hosts patches recently added? [16:47:41] I got a bunch of changes on wasat [16:47:55] leaving some dirs behind [16:48:03] mutante: those patches went fine btw, confirmed on one of the labs hosts they run on! Thanks! [16:49:06] addshore: cool:) [16:49:53] !log demon@tin Pruned MediaWiki: 1.31.0-wmf.2 [keeping static files] (duration: 01m 24s) [16:49:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:12] Amir1: ok they are old things from when codfw was active [16:51:15] not an issue [16:51:24] ACKNOWLEDGEMENT - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0: Ayounsi Zayo fibercut [16:51:24] ACKNOWLEDGEMENT - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0: Ayounsi Zayo fibercut [16:51:26] I just wanted to understand where they came from [16:51:48] (03CR) 10Jcrespo: [C: 032] mediawiki: Remove the cronjob for rebuilding entity_per_page table [puppet] - 10https://gerrit.wikimedia.org/r/384723 (https://phabricator.wikimedia.org/T140890) (owner: 10Ladsgroup) [16:52:32] Thanks :) [16:52:59] deploying now on terbium [16:53:50] (03PS3) 10Jcrespo: labs: do not replicate wb_entity_per_page table [puppet] - 10https://gerrit.wikimedia.org/r/382694 (https://phabricator.wikimedia.org/T95685) (owner: 10Ladsgroup) [16:53:53] noop [16:53:57] so everthing good [16:54:25] I am rebasing the last one [16:56:27] jouncebot: next [16:56:27] In 0 hour(s) and 3 minute(s): Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171017T1700) [16:56:43] hey, hey, do not put me pressure [16:57:03] haha, i was just checking for the next thing to see if we can squeeze in a change to gerrit itself :) [16:57:08] I lost it [16:57:09] 10Operations, 10Ops-Access-Requests: Requesting access to analytics servers for cwdent - https://phabricator.wikimedia.org/T178406#3691188 (10cwdent) [16:57:22] too many tabs open on console and browser [16:57:28] I will search it up here [16:57:53] apparently I merged it already [16:58:22] ah no [16:58:59] (03CR) 10Jcrespo: [C: 032] labs: do not replicate wb_entity_per_page table [puppet] - 10https://gerrit.wikimedia.org/r/382694 (https://phabricator.wikimedia.org/T95685) (owner: 10Ladsgroup) [17:00:04] gwicke, cscott, arlolra, subbu, halfak, and Amir1: Your horoscope predicts another unfortunate Services – Graphoid / Parsoid / OCG / Citoid / ORES deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171017T1700). [17:00:04] No GERRIT patches in the queue for this window AFAICS. [17:00:37] Amir1: the yaml has been updated, but the execution is not automatic [17:00:45] at least, not for now [17:00:55] yeah, I will contact cloud team [17:00:56] it needs attention to deploy it to the wikireplicas [17:01:08] we also have to delete it [17:01:19] normally deletions do not have high priority for us [17:01:31] specially for public data [17:01:42] so they will come, eventually [17:01:49] I hope it frees up some space for wikidatawiki [17:01:56] is it large? [17:02:16] 40M rows, columns are [17:02:21] not big but still [17:02:34] 1GB, not too large [17:11:10] no parsoid deploy today [17:17:30] (03PS8) 10Rush: Add role::labs::libraryupgrader puppet configuration [puppet] - 10https://gerrit.wikimedia.org/r/372213 (https://phabricator.wikimedia.org/T173478) (owner: 10Legoktm) [17:17:39] (03CR) 10Muehlenhoff: [C: 031] admins: add missing wmde LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/384584 (owner: 10Dzahn) [17:17:56] (03CR) 10jerkins-bot: [V: 04-1] Add role::labs::libraryupgrader puppet configuration [puppet] - 10https://gerrit.wikimedia.org/r/372213 (https://phabricator.wikimedia.org/T173478) (owner: 10Legoktm) [17:18:45] 10Operations, 10Analytics, 10DBA: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3691322 (10faidon) [17:18:56] (03PS1) 10Chad: group0 to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384727 [17:18:58] (03CR) 10Chad: [C: 04-2] group0 to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384727 (owner: 10Chad) [17:23:25] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack and setup db1107 and db1108 - https://phabricator.wikimedia.org/T177405#3691345 (10jcrespo) One of you can go ahead and apply the same roles than its replacements (IMPORTANT: with notifications disabled on hiera), once they are ready, we DBAs can copy trans... [17:23:31] !log demon@tin Started scap: bootstrap wmf.4 [17:23:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:45] 10Operations, 10Discovery, 10Discovery-Search, 10Elasticsearch, and 2 others: logs sent to logstash are lost when the elasticsearch cirrus cluster is unavailable - https://phabricator.wikimedia.org/T176335#3691350 (10debt) This is fairly tricky and no obvious solution right now. We might want to wait for t... [17:26:04] (03PS7) 10Dzahn: admins: add missing wmde LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/384584 [17:26:55] (03CR) 10Dzahn: [C: 032] admins: add missing wmde LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/384584 (owner: 10Dzahn) [17:31:36] (03CR) 10Zoranzoki21: [C: 031] "Looks good. Too rebase this" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383378 (https://phabricator.wikimedia.org/T176469) (owner: 10Jdlrobson) [17:32:21] (03CR) 10Zoranzoki21: [C: 031] group0 to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384727 (owner: 10Chad) [17:36:23] 10Operations, 10Discovery, 10Discovery-Search, 10Elasticsearch: Use SSL certificates with discovery entry for elasticsearch - https://phabricator.wikimedia.org/T162037#3691375 (10debt) [17:36:32] (03CR) 10Chad: [C: 04-2] "Why would you put a +1 on this when I did a -2?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384727 (owner: 10Chad) [17:38:00] 10Operations, 10Discovery, 10Discovery-Analysis, 10Discovery-Search (Current work): Upload shiny-server .deb to our Jessie apt repository - https://phabricator.wikimedia.org/T168967#3691411 (10debt) p:05High>03Normal @mpopov - is this still needed? [17:38:14] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on naos is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging [17:38:49] ^ ran something with sudo ? [17:38:51] Probably transient while scap is running ^ [17:38:55] gotcha [17:39:05] rsync, hasn't finished? [17:39:18] (guessing, we'll know in a minute or so) [17:48:07] jouncebot refresh [17:48:11] I refreshed my knowledge about deployments. [17:48:12] jouncebot: now [17:48:12] For the next 0 hour(s) and 11 minute(s): Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171017T1700) [17:48:14] RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on naos is OK: Files ownership is ok. [17:50:57] mutante: Yeah, so my guess is just bad timing on that check ^ [17:51:08] It caught me mid co-master-sync in a scap [17:51:24] no_justification: ok, makes sense if that happens during sync .. i guess [17:52:27] Is there a way to say "only alert IRC after the check fails N" times? [17:52:45] Like, one failure is easily transient. More than 2 or 3 is a problem [17:53:34] (03PS35) 10Paladox: Gerrit: Use systemd::service for systemd [puppet] - 10https://gerrit.wikimedia.org/r/378768 (https://phabricator.wikimedia.org/T157414) [17:55:00] no_justification: yes, there is a setting for this in icinga/nagios [17:55:18] That'd be ideal. "Only alert if failed >= 2?" [17:56:37] msot of our notifications are set up like that [17:57:40] Yeah, let's swap the root-owned-files-on-deploy-master one to do the same [17:58:06] that is SOFT vs HARD state [17:58:09] "When a service or host check results in a non-OK or non-UP state and the service check has not yet been (re)checked the number of times specified by the max_check_attempts directive in the service or host definition. This is called a soft error. " [17:58:32] and then we have to say what to do in each state [17:59:18] the table at the bottom of https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/3/en/statetypes.html ... [17:59:49] so notifications are sent after 3rd check. by default [18:00:13] and then there is another one for the interval between the checks [18:04:30] 10Operations, 10ops-ulsfo: Multiple systems in ulsfo 1.22 showing PSU failures - https://phabricator.wikimedia.org/T177622#3691573 (10RobH) [18:04:33] 10Operations, 10ops-ulsfo: Check cp4026 power supply redundancy - https://phabricator.wikimedia.org/T178085#3691569 (10RobH) 05Open>03Resolved p:05Triage>03Normal a:03RobH The power plug on the PDU tower side was slightly unseated. The pdu towers provided by unitedlayer lack any kind of anchoring cl... [18:08:40] no_justification: for the ownership alerts I wonder if this has to do with scap-master-sync's use of rsync --delay-updates creating that .~tmp~ directory. Maybe there's an option to move that somewhere else. [18:09:08] That's what I'm guessing is at fault [18:10:36] --partial-dir seems like it's the option we need, but the manpage has lots of caveats [18:10:54] yeah, I was just reading that. [18:11:33] It conflicts with lots of options :p [18:11:50] _joe_, mutante, fyi, I got my gerrit issue sorted out. It didn't work with VPN, but after changing some config related to gnome keyring, gerrit is working for me again. [18:12:13] !log demon@tin Finished scap: bootstrap wmf.4 (duration: 48m 41s) [18:12:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:34] RECOVERY - IPMI Sensor Status on cp4026 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [18:14:09] !log dist-upgrading sodium to stretch [18:14:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:36] warning: apt updates (and thus puppet runs) may fail for a bit [18:16:45] PROBLEM - DPKG on sodium is CRITICAL: Return code of 255 is out of bounds [18:16:46] bmansurov: wow, glad to hear. now i though it's related to the number of objects, but that's cool [18:17:05] PROBLEM - puppet last run on sodium is CRITICAL: Return code of 255 is out of bounds [18:17:23] ;) [18:19:45] RECOVERY - DPKG on sodium is OK: All packages OK [18:20:05] <_joe_> bmansurov: oh, cool! [18:22:04] RECOVERY - puppet last run on sodium is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [18:22:55] !log rebooting sodium for kernel/distro upgrade [18:23:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:24] PROBLEM - HHVM rendering on mw2141 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:24:15] RECOVERY - HHVM rendering on mw2141 is OK: HTTP OK: HTTP/1.1 200 OK - 75014 bytes in 0.271 second response time [18:24:45] PROBLEM - Host sodium is DOWN: PING CRITICAL - Packet loss = 100% [18:25:24] RECOVERY - Host sodium is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [18:40:12] 10Operations, 10ops-ulsfo, 10Traffic: rack/setup/install cp40(29|3[123]).ulsfo.wmnet - https://phabricator.wikimedia.org/T178423#3691718 (10RobH) [18:40:33] 10Operations, 10ops-ulsfo, 10Traffic: rack/setup/install cp40(29|3[012]).ulsfo.wmnet - https://phabricator.wikimedia.org/T178423#3691739 (10RobH) [18:44:20] 10Operations, 10ops-ulsfo, 10Traffic, 10hardware-requests, 10Patch-For-Review: Decom cp4005-8,13-16 (8 nodes) - https://phabricator.wikimedia.org/T176366#3691743 (10RobH) All of these hosts have now been wiped and bios/drac reset. Stalling this until we're ready to sell off the entire batch of old ulsfo... [18:44:24] no_justification i wonder should we do that gerrit change for systemd now to allow it to sit please? :) [18:44:32] 10Operations, 10ops-ulsfo, 10Traffic, 10hardware-requests, 10Patch-For-Review: Decom cp4005-8,13-16 (8 nodes) - https://phabricator.wikimedia.org/T176366#3691744 (10RobH) [18:44:48] paladox: mutante and I wanna get it merged [18:45:13] :) [18:46:45] is here :) [18:47:45] :) [18:50:41] (03CR) 10Dzahn: [C: 032] Gerrit: Use systemd::service for systemd [puppet] - 10https://gerrit.wikimedia.org/r/378768 (https://phabricator.wikimedia.org/T157414) (owner: 10Paladox) [18:50:47] :) [18:51:01] you have to stop gerrit before applying ^^ [18:51:14] to allow gerrit to remove the pid [18:51:17] from logs/ [18:51:53] i am disabling puppet on cobalt [18:52:12] Did we want to start with gerrit2001? [18:52:13] ok thanks :) [18:52:18] Oh yeah, disable on cobalt first [18:52:19] Duh [18:52:23] !log cobalt: temp disable puppet | gerrit2001: stop puppet [18:52:23] * no_justification grabs 2nd coffee [18:52:28] eh, sorry [18:52:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:44] no_justification: yes, i am only touching gerrit2001 first [18:53:24] !log gerrit2001: stopping the gerrit service [18:53:24] Let's make sure to purge the init.d file explicitly too [18:53:28] So we don't accidentally use it [18:53:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:49] ok! yes [18:54:18] rm: cannot remove '/etc/init.d/gerrit': No such file or directory [18:54:19] :) [18:54:48] :) [18:55:14] Yep, it's still on cobalt but yeah gone from 2001 [18:55:38] and my connection decides to become flaky, of course. ok there we are again [18:56:05] so,i confirmed there is no more java process on gerrit2001 [18:56:11] where would the pid file be again [18:56:22] in logs/ [18:56:25] review_site/logs [18:56:30] lol :) [18:56:33] /var/lib/gerrit2/review_site/logs/ [18:56:44] gerrit_* [18:56:49] ok, it's empty [18:57:08] so it's stopped :) [18:57:10] i will enable puppet again and run it, still just gerrit2001 [18:57:18] ok :) [18:57:22] submits change on puppetmaster [18:57:30] :) [18:57:54] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] [18:58:09] .. [18:58:14] :( [18:59:19] changed applied on gerrit2001, nrpe command gets changed as expected.. letsencrypt fails [18:59:39] oh is letsencrypt failure expected? [19:00:04] no_justification: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171017T1900). [19:00:04] No GERRIT patches in the queue for this window AFAICS. [19:00:07] yes and no. it has nothing to do with the systemd change at all [19:00:19] we have the same issue on another host [19:00:24] yeh [19:00:27] but that doesnt mean i expected it _right now_ here as well [19:00:35] Fatal error: Stack overflow in /srv/mediawiki/php-1.31.0-wmf.3/vendor/wikimedia/remex-html/RemexHtml/Serializer/Serializer.php on line 274 [19:00:47] Not sure if related to any current stuff ^ [19:00:55] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [19:02:00] paladox: so.. i ran puppet and systemd status is Active (running) [19:02:06] oh [19:02:07] i did nothing manual besides puppet [19:02:11] yeh [19:02:19] reboot will start gerrit too heh :) [19:02:35] no_justification: should we try the reboot ?:) [19:02:53] * mutante tries starting/stopping with systemctl as well [19:03:12] Can't hurt [19:03:56] systemctl stop gerrit, systemctl start gerrit .. works for me [19:04:12] :) [19:04:47] (03CR) 10Aaron Schulz: [C: 031] Enable $wgAbuseFilterProfile on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383858 (https://phabricator.wikimedia.org/T177641) (owner: 10Dmaza) [19:05:31] schedules downtime on icinga to reboot it without spamming channel [19:06:42] !log gerrit2001 - rebooting to verify systemd change and service comes back properly [19:06:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:39] (03PS1) 10Chad: Do clones of MediaWiki + extensions + skins + vendor to /srv/mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/384754 [19:09:06] it worked. service is running after reboot [19:09:23] no_justification: lgtm, we could move on to cobalt i guess [19:09:38] :) [19:10:00] (03PS1) 10Chad: Git::clone tidy up default gerrit URLs [puppet] - 10https://gerrit.wikimedia.org/r/384756 [19:10:04] Hot damn :) [19:10:39] Looks good to me [19:11:00] Guess we still gotta fix the "actually listen" bit [19:11:01] But yay [19:11:17] :) [19:11:21] should i do the restart? anything else we wanted before.. moving the init.d/gerrit to /root [19:11:29] Still gotta fix logging [19:11:39] Er, where do I look instead? [19:11:41] logstash doesnt work , i hear [19:11:50] Logstash stuff hasn't merged yet [19:11:55] And ./logs is empty [19:12:38] yea,. eh. confirmed, no logs.. uhm [19:12:42] paladox: ideas? [19:12:55] Um [19:12:56] journalctl only gives me the "started service" [19:12:57] logs emptiy on gerrit2001? [19:12:59] Yeah [19:13:03] ah [19:13:09] it's because gerrit carn't start [19:13:10] could be becuase it's slave ? [19:13:11] due to db issues [19:13:19] that , heh [19:13:22] but the service is running [19:13:27] then it cant connect [19:13:32] see https://phabricator.wikimedia.org/T176532 [19:13:54] syslog is just spamming about apache_status.py/gmond [19:14:06] Oh yeah, that again [19:14:14] Still, no log to tell us this.... [19:14:16] that one will be removed globally soon :) [19:14:16] We gotta fix /that/ [19:14:20] part of killin ganglia [19:14:46] yea, we should have something that gives us the DB connect error [19:14:56] and doesnt just look like active (running) and we dont notice [19:14:57] (03CR) 10Chad: [C: 032] group0 to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384727 (owner: 10Chad) [19:15:45] paladox: where would we see the DB error [19:15:51] you saw it somewhere [19:15:52] in the logs [19:16:03] though it seems it is not logging again [19:16:03] in the empty log dir? [19:16:06] same issue as last time [19:16:09] yeh [19:16:13] it's kind of meta :) [19:16:36] we must have seen it somewhere last time, heh [19:16:38] i think it's because gerrit did not setup properly there. But it's a slave so i am not sure if it logs. [19:16:41] yeh [19:17:00] We had the same problems with logs before on gerrit2001 [19:17:09] yea, things are different "if slave", right [19:17:12] which led us to finding a db problem. [19:17:14] yeh [19:17:21] did you try on labs to switch "is slave" thing around [19:17:26] and see an effect on logging [19:17:29] oh [19:17:32] haven't try that [19:17:47] i can switch slaves on [19:17:50] drwxr-xr-x 2 root root 4096 Sep 26 18:18 . [19:17:53] That might explain it [19:17:54] .... [19:18:13] Swapped to gerrit2:gerrit2 [19:18:23] Anyway, start service again? [19:18:35] ah :) [19:18:52] issued restart [19:19:03] ah [19:19:12] you know what's also fun, the great error message when you forget sudo with systemctl :) [19:19:23] lol [19:19:32] it doesn't say "you are not allowed" or something.. no... it says " The name org.freedesktop.PolicyKit1 was not provided by any .service files" :) [19:19:53] lol [19:20:08] Ok, I see java entry, but still no logs (or PID even) [19:20:25] mutante: Kill the service, if you would [19:20:28] * no_justification wants to try something [19:20:35] it's feature that the PID file isnt used anymore, afacit [19:21:02] no_justification: done, killed [19:21:23] (03PS5) 10Madhuvishy: [WIP]ssh-key-ldap-lookup: Deny user authorization if /etc/ssh-nologin is present [puppet] - 10https://gerrit.wikimedia.org/r/384574 (https://phabricator.wikimedia.org/T171508) [19:21:34] I wonder if doing --console-log would spit logs back to systemd [19:21:54] (Also, we were missing --slave flag) [19:22:05] aha [19:22:10] ugh [19:22:13] will fix that [19:22:27] systemd is doing something with gerrit :/ [19:22:35] (03Merged) 10jenkins-bot: group0 to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384727 (owner: 10Chad) [19:22:53] no_justification: try adding --console-log to the exact command that systemd uses for ExecStart now? [19:23:17] Maybe. But I'm getting same no response from running it by hand too [19:23:26] ExecStart=/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -Xmx20g -jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site [19:23:27] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 to wmf.4 [19:23:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:34] (03PS6) 10Madhuvishy: [WIP] ssh-key-ldap-lookup: Deny user authorization if /etc/ssh-nologin is present [puppet] - 10https://gerrit.wikimedia.org/r/384574 (https://phabricator.wikimedia.org/T171508) [19:24:44] (03Draft1) 10Paladox: Gerrit: Insert --slave into systemd script if it is a slave [puppet] - 10https://gerrit.wikimedia.org/r/384758 [19:24:48] (03PS2) 10Paladox: Gerrit: Insert --slave into systemd script if it is a slave [puppet] - 10https://gerrit.wikimedia.org/r/384758 [19:24:55] mutante no_justification ^^ [19:24:55] :) [19:25:09] (03CR) 10jerkins-bot: [V: 04-1] [WIP] ssh-key-ldap-lookup: Deny user authorization if /etc/ssh-nologin is present [puppet] - 10https://gerrit.wikimedia.org/r/384574 (https://phabricator.wikimedia.org/T171508) (owner: 10Madhuvishy) [19:25:11] (03PS3) 10Paladox: Gerrit: Insert --slave into systemd script if it is a slave [puppet] - 10https://gerrit.wikimedia.org/r/384758 [19:26:09] no_justification mutante i think we had that problem before. [19:26:17] stuck on T176532 [19:26:18] T176532: Gerrit is failing to connect to db on gerrit2001 thus preventing systemd from working - https://phabricator.wikimedia.org/T176532 [19:26:44] 10Operations, 10ops-ulsfo, 10Traffic: rack/setup/install cp40(29|3[012]).ulsfo.wmnet - https://phabricator.wikimedia.org/T178423#3691925 (10RobH) [19:27:02] yea, it's not a great test since it's different from cobalt [19:27:13] due to codfw [19:27:19] yeh [19:27:25] (03PS7) 10Madhuvishy: [WIP]ssh-key-ldap-lookup: Deny user authorization if /etc/ssh-nologin exists [puppet] - 10https://gerrit.wikimedia.org/r/384574 (https://phabricator.wikimedia.org/T171508) [19:27:35] though it logs for me [19:27:36] :) [19:27:51] stuff like [19:27:51] Oct 17 19:20:57 gerrit-test3 java[3324]: [2017-10-17 19:20:57,453] [main] WARN com.google.gerrit.server.config.AdministrateServerGroupsProvider : Group "ldap/ops" not available, skipping. [19:27:52] Can you test the part where there is no DB on your end? [19:27:58] ok [19:27:58] yeh [19:27:59] paladox: what db are you using? [19:28:00] Like, shut off your DB [19:28:04] gerrit-mysql [19:28:07] i was about to say the same thing:) [19:28:09] shutting it down now [19:28:10] kill the db [19:28:19] (03PS8) 10Madhuvishy: [WIP]ssh-key-ldap-lookup: Deny user authorization if /etc/ssh-nologin exists [puppet] - 10https://gerrit.wikimedia.org/r/384574 (https://phabricator.wikimedia.org/T171508) [19:29:26] meanwhile i compiled your change to the template http://puppet-compiler.wmflabs.org/8357/gerrit2001.wikimedia.org/ [19:29:36] for some reason it touches that other line , diff on cobalt [19:29:59] looks like missing newline in the result.. odd [19:30:20] http://puppet-compiler.wmflabs.org/8357/cobalt.wikimedia.org/ [19:30:51] see how it appends the "KillSignal" line.. .. [19:31:46] blames the inline web editor :) [19:33:12] (03PS1) 10Chad: Also use scap-deployed version of gerrit.war for actual running of gerrit [puppet] - 10https://gerrit.wikimedia.org/r/384760 [19:33:30] (03CR) 10jenkins-bot: group0 to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384727 (owner: 10Chad) [19:34:01] logging dosen't work if it carn't connect to the port, ie firewall [19:34:03] (03PS4) 10Dzahn: Gerrit: Insert --slave into systemd script if it is a slave [puppet] - 10https://gerrit.wikimedia.org/r/384758 (owner: 10Paladox) [19:34:09] but logs if it find that the port is not available [19:34:15] mutante no_justification ^^ [19:34:22] You mean the mysql firewall? [19:34:27] s/firewall/ [19:34:28] yeh, i tryed both ways [19:34:36] firewall, ie bin-address [19:34:39] and stopping it [19:34:59] I can't believe it can't even start logging without trying the damn DB ping [19:35:05] so like when packets are just dropped and there is no reject [19:35:07] That's batshit [19:35:25] i think it's using some kind of timeout. [19:35:27] maybe [19:35:35] yes @ carn't belive it [19:35:47] -r [19:35:50] it should log but seems it is not if it see the port is available but carn't connect [19:35:51] :) [19:36:38] hmm, it has started to sundely start even though i prevented it with bin-address [19:36:39] [2017-10-17 19:35:36,911] [main] INFO com.google.gerrit.pgm.Daemon : Gerrit Code Review 2.14.4-28-gd57189eb4f read [19:36:51] bin-address ? [19:37:45] bind [19:37:51] :) [19:37:55] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] [19:38:01] i will look up the command to try and block the firewall [19:38:34] paladox: do you have ferm ? [19:38:44] ferm seems to be failing [19:38:48] heh [19:39:04] Hah, gerrit has been running on cobalt since Sept26, so I have no logs from last startup [19:39:04] Heh [19:39:05] PROBLEM - puppet last run on sodium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[acme-setup-acme-mirrors] [19:39:32] paladox: go to /etc/ferm/conf.d/ are there files? [19:39:38] ports blocked now [19:39:41] tested with telnet [19:39:45] starting gerrit now [19:40:24] nope, systemd still shows working but now output [19:40:31] so same as gerrit2001 [19:40:49] no output or now output [19:40:55] no output [19:40:59] ok [19:41:05] sorry i made spelling mistake again :) [19:41:10] Ok, so clearly blocking the ports disables gerrit from logging anything. That's a general upstream bug [19:41:30] if the port is available but blocked from incomming traffic, i think gerrit will keep retrying the db though i have no idea about the internals of mysql [19:41:30] I think we can move forward with cobalt tbh [19:41:37] i can filled a bug report if you want? [19:41:45] Let's figure that out later [19:41:49] i have expanded powers to triage it now :) [19:41:58] But anyway, I *think* cobalt will be ok [19:42:03] ok [19:42:09] i think yea to all of that too [19:42:14] yep [19:42:14] make upstream bug, do cobalt [19:42:23] deffitly a bug upstream. [19:42:55] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [19:42:55] It should have some kind of max-retries or timeouts or something [19:43:13] Considering it can barf over bad command line arguments prior to just hanging, it clearly can boot into the basic gerrit framework [19:43:15] yeh [19:43:17] So logging /something/ should be possible [19:44:03] ?connectTimeout=5000&socketTimeout=30000 [19:44:09] https://stackoverflow.com/questions/21351002/how-to-set-a-connection-timeout-on-the-mysql-jdbc-driver [19:45:55] yeh, it should log even if it carn't connect or is waiting for a timeout to be kicked in [19:46:27] brb [19:46:47] I guess it just waits since we set no timeout [19:46:57] Gerrit doesn't know better [19:47:44] PROBLEM - Disk space on contint1001 is CRITICAL: DISK CRITICAL - /var/lib/docker/overlay2/35ca40c8e8fcc59fd40848e1a0c40275d7f2db69a5a57323328ae88010578006/merged is not accessible: Permission denied [19:50:44] !log short gerrit downtime for systemd change coming up [19:50:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:34] !log cobalt - stopping gerrit service, confirming pid file gets removed, move /etc/init.d/gerrit to /root [19:51:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:17] now join the chan and yell OUT LOUT that gerrit is down... as usual... :D [19:53:47] Also puppet will start flipping out on hosts that have git::clone usage :) [19:54:18] so, Y E L L E V E N L O U D E R?? :D [19:54:30] !log cobalt - re-enabled puppet, ran puppet, systemd says gerrit is active(running) [19:54:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:44] RECOVERY - Disk space on contint1001 is OK: DISK OK [19:54:48] works, yup [19:54:51] loading at least [19:54:54] pheeew :) [19:55:00] gerrit looks full back [19:55:18] And we got logging too [19:55:22] yay! [19:55:31] So yeah, firewall between gerrit & database makes gerrit go BOOM [19:55:37] especially nice since it wasnt the first attempt : [19:55:52] and now we don't use gerrit.sh at all [19:56:22] gerrit's `init` will keep installing it [19:56:24] But yeah, unused [19:57:07] We can dismantle the debian package now actually [19:57:17] Or maybe, let's let it run for a few days :p [19:57:45] PROBLEM - puppet last run on kafka2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [19:57:54] we can dismantle the repo for it [19:58:00] but not delete it from apt [19:58:54] PROBLEM - puppet last run on contint1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_jenkins CI slave scripts] [19:59:04] we still need the --slave in template change for 2001 [19:59:34] i wish it wouldnt be change on cobalt though [20:00:00] If we adjust it we can avoid the trailing space [20:00:04] Put it inside the if [20:00:21] Then cobalt will be a no op [20:01:35] 10Operations, 10Puppet, 10User-Joe: Set up octocatalog-diff on host with access to puppetmasters and puppetdb - https://phabricator.wikimedia.org/T177843#3691966 (10herron) [20:02:30] ah, right, amending [20:03:03] (03PS5) 10Dzahn: Gerrit: Insert --slave into systemd script if it is a slave [puppet] - 10https://gerrit.wikimedia.org/r/384758 (owner: 10Paladox) [20:03:24] the contint1001 and kafka2003 issues are indirect due to the short downtime [20:03:33] expects recoveries in a minute [20:04:08] Yeah [20:04:23] (03PS1) 10Herron: puppetdb: allow puppetcompiler1001 to reach puppetdb nginx frontend [puppet] - 10https://gerrit.wikimedia.org/r/384762 (https://phabricator.wikimedia.org/T177843) [20:04:32] (03CR) 10Chad: [C: 031] Gerrit: Insert --slave into systemd script if it is a slave [puppet] - 10https://gerrit.wikimedia.org/r/384758 (owner: 10Paladox) [20:04:47] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe: Set up octocatalog-diff on host with access to puppetmasters and puppetdb - https://phabricator.wikimedia.org/T177843#3691992 (10herron) [20:05:23] it's still like http://puppet-compiler.wmflabs.org/8359/cobalt.wikimedia.org/ ..eh [20:05:42] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe: Set up octocatalog-diff on host with access to puppetmasters and puppetdb - https://phabricator.wikimedia.org/T177843#3672181 (10herron) [20:05:53] Oh, duh. Remove the - from the closing end [20:06:00] We don't want to strip that whitespace [20:07:15] back [20:08:23] (03PS6) 10Dzahn: Gerrit: Insert --slave into systemd script if it is a slave [puppet] - 10https://gerrit.wikimedia.org/r/384758 (owner: 10Paladox) [20:08:37] the "-" to strip whitespace.. ah [20:08:48] wait, both of them? [20:08:55] thanks [20:08:58] should we do it to if [20:09:00] too [20:09:00] ? [20:09:25] no [20:09:30] that's within the line [20:09:53] (03CR) 10Chad: [C: 04-1] "Don't these need to go into lfs.config? We'll also want to set storage.backend," (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/384596 (https://phabricator.wikimedia.org/T171758) (owner: 10Paladox) [20:10:15] (03CR) 10Chad: [C: 04-1] "https://gerrit.googlesource.com/plugins/lfs/+/stable-2.13/src/main/resources/Documentation/config.md#global-plugin-settings" [puppet] - 10https://gerrit.wikimedia.org/r/384596 (https://phabricator.wikimedia.org/T171758) (owner: 10Paladox) [20:10:45] (03CR) 10Chad: [C: 04-1] "The part in gerrit.confg will be lfs.plugin" [puppet] - 10https://gerrit.wikimedia.org/r/384596 (https://phabricator.wikimedia.org/T171758) (owner: 10Paladox) [20:10:51] ah [20:11:21] I had to read closely too. Sorta like replication.config, right? [20:11:21] thanks [20:11:23] will fix [20:11:24] http://puppet-compiler.wmflabs.org/8360/cobalt.wikimedia.org/ [20:11:27] yeh [20:11:36] mutante: There we go [20:11:42] i didnt look so close like you so i missed it. [20:11:45] except there is still a change :) [20:11:54] but different [20:12:05] well, it makes sense if the template is touched i supose [20:12:21] no_justification also we doint need to set the backend it defaults to filesystem [20:12:22] :) [20:12:36] we just need to set the directory for it [20:12:37] jynus ping [20:12:42] Let's be explicit [20:13:02] ok [20:13:21] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/8360/" [puppet] - 10https://gerrit.wikimedia.org/r/384758 (owner: 10Paladox) [20:14:13] thanks [20:14:13] :) [20:14:41] !log gerrit2001 - re-enabled puppet, attempting restart with --slave after gerrit:384758 [20:14:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:00] Since it all goes in an unused lfs.config for now, we can land that as soon as we want, then install the plugin, then enable in gerrit.config [20:18:24] done [20:18:26] (03PS4) 10Paladox: Gerrit: Set lfs configuation [puppet] - 10https://gerrit.wikimedia.org/r/384596 (https://phabricator.wikimedia.org/T171758) [20:18:37] yeh [20:19:23] paladox: so yea, i confirmed it is using --slave now [20:19:25] thanks [20:19:27] :) [20:19:29] and for the systemd change [20:19:29] your welcome :) [20:19:33] i think that is it for right now [20:19:37] (03CR) 10Chad: Gerrit: Set lfs configuation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/384596 (https://phabricator.wikimedia.org/T171758) (owner: 10Paladox) [20:20:11] thanks :) [20:20:17] (03PS5) 10Paladox: Gerrit: Set lfs configuation [puppet] - 10https://gerrit.wikimedia.org/r/384596 (https://phabricator.wikimedia.org/T171758) [20:20:40] no_justification about the logging, should i file a report or will you? :) [20:20:49] (03CR) 10Chad: [C: 031] "lgtm, can land whenever since it's unused yet" [puppet] - 10https://gerrit.wikimedia.org/r/384596 (https://phabricator.wikimedia.org/T171758) (owner: 10Paladox) [20:20:54] RoanKattouw ping [20:20:55] :) [20:21:03] I've gotta run out for a bit, won't get to it today [20:21:29] ok [20:21:48] can we use /srv/gerrit/lfs or does it have to be separate (nitpicking) [20:21:55] i also need lunch [20:21:55] davidwbarratt: what's up? [20:22:20] But yeah basically what we know/assume is this: When trying to start the gerrit daemon, if you're connecting to a DB (and using a raw connection jdbc url and not letting gerrit construct it) when you don't specify a DB timeout you don't get any sort of logging feedback. [20:22:28] RoanKattouw can you adjust the permissions on the folder in your home direcotry? https://phabricator.wikimedia.org/T178313#3690799 [20:22:43] Yes: workaround/solution is to set a timeout. But some sort of output /before/ beginning that connection would be useful [20:22:55] So actually I think it's mostly an enhancement request, not a bug [20:23:01] oh [20:23:06] paladox: keep that stuff in a pastebin ^ we can do it later, heh [20:23:20] davidwbarratt: will do when I'm back at my computer, I'm in line to order lunch right now [20:23:29] RoanKattouw no problem! [20:23:32] https://phabricator.wikimedia.org/P6141 [20:23:42] Relatedly: we should set a timeout on that connection url ;-) [20:23:54] RECOVERY - puppet last run on contint1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:23:56] Ok, I'm really afk now [20:23:59] back later folks [20:24:56] me too, need to eat. if something breaks with gerrit i can be reached via SMS with number in offic wiki [20:25:18] (03PS6) 10Paladox: Gerrit: Set lfs configuation [puppet] - 10https://gerrit.wikimedia.org/r/384596 (https://phabricator.wikimedia.org/T171758) [20:25:30] (03PS1) 10Faidon Liambotis: mirrors: stop shipping ftpsync, use package instead [puppet] - 10https://gerrit.wikimedia.org/r/384791 [20:27:22] (03PS2) 10Faidon Liambotis: mirrors: stop shipping ftpsync, use package instead [puppet] - 10https://gerrit.wikimedia.org/r/384791 [20:27:34] 10Operations, 10Wikimedia-Fundraising-Banners, 10fundraising-tech-ops: alnitak disk space warning - https://phabricator.wikimedia.org/T177854#3692048 (10Jgreen) 05Open>03Resolved a:03Jgreen We cleaned up and compressed files on the nfs mount, which dropped it back to 67% [20:27:45] RECOVERY - puppet last run on kafka2003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [20:27:47] (03CR) 10Faidon Liambotis: [C: 032] mirrors: stop shipping ftpsync, use package instead [puppet] - 10https://gerrit.wikimedia.org/r/384791 (owner: 10Faidon Liambotis) [20:28:45] 10Operations, 10ops-ulsfo, 10Traffic: rack/setup/install cp40(29|3[012]).ulsfo.wmnet - https://phabricator.wikimedia.org/T178423#3692052 (10RobH) I still need to flash firmware on drac/bios before OS install on all of these. [20:35:07] 10Operations, 10Wikimedia-Fundraising-CiviCRM, 10fundraising-tech-ops: mintaka disk space warning - https://phabricator.wikimedia.org/T177852#3692079 (10Jgreen) 05Open>03Resolved [20:35:54] (03PS1) 10Faidon Liambotis: mirrors: use correct path for the config file [puppet] - 10https://gerrit.wikimedia.org/r/384803 [20:36:39] (03CR) 10Faidon Liambotis: [C: 032] mirrors: use correct path for the config file [puppet] - 10https://gerrit.wikimedia.org/r/384803 (owner: 10Faidon Liambotis) [20:36:51] (03CR) 10jerkins-bot: [V: 04-1] mirrors: use correct path for the config file [puppet] - 10https://gerrit.wikimedia.org/r/384803 (owner: 10Faidon Liambotis) [20:37:10] (03PS2) 10Faidon Liambotis: mirrors: use correct path for the config file [puppet] - 10https://gerrit.wikimedia.org/r/384803 [20:38:33] (03CR) 10Faidon Liambotis: [C: 032] mirrors: use correct path for the config file [puppet] - 10https://gerrit.wikimedia.org/r/384803 (owner: 10Faidon Liambotis) [20:43:45] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.142 second response time [20:45:52] madhuvishy: ^ that + an alert for tools home page which seems fine makes me think the checker is having issues could you help me take a look? [20:46:45] chasemp: yup, looking [20:47:07] well there is a node in notready state actually [20:47:07] tools-worker-1007.tools.eqiad.wmflabs NotReady 1y [20:48:22] andrewbogott: ^ the same k8s issue [20:48:24] - token: faketoken [20:48:24] + token: thoehie3OoD2Eiwoghuhien2 [20:48:27] that's really freaky [20:48:44] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.127 second response time [20:49:14] chasemp: yeah toolschecker itself looks fine, this is the only failing check [20:50:08] (03PS1) 10Herron: puppetmaster: add yaml fact directory to rsyncd on frontends [puppet] - 10https://gerrit.wikimedia.org/r/384834 (https://phabricator.wikimedia.org/T177843) [20:50:32] chasemp, that's not an actual auth token is it? [20:50:41] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: add yaml fact directory to rsyncd on frontends [puppet] - 10https://gerrit.wikimedia.org/r/384834 (https://phabricator.wikimedia.org/T177843) (owner: 10Herron) [20:50:49] shit [20:50:51] :) [20:50:55] (03PS3) 10Jforrester: Remove setting no longer in MediaWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366869 [20:51:04] yes and now I'm changing it [20:51:06] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe: Set up octocatalog-diff on host with access to puppetmasters and puppetdb - https://phabricator.wikimedia.org/T177843#3692112 (10herron) [20:51:06] * Krenair facepalms [20:51:31] I thought it was a labs/private value and not a real private value [20:53:06] it's reverse though, it is sometimes filling in teh labs/private value dummy [20:53:09] (03PS1) 10RobH: production dns for cp40(29|3[012] [dns] - 10https://gerrit.wikimedia.org/r/384841 (https://phabricator.wikimedia.org/T178423) [20:53:22] (03CR) 10jerkins-bot: [V: 04-1] production dns for cp40(29|3[012] [dns] - 10https://gerrit.wikimedia.org/r/384841 (https://phabricator.wikimedia.org/T178423) (owner: 10RobH) [20:53:43] I have seen puppet flapping back and forth between values before [20:53:55] never figured out what it was up to [20:54:00] (03PS2) 10RobH: production dns for cp40(29|3[012] [dns] - 10https://gerrit.wikimedia.org/r/384841 (https://phabricator.wikimedia.org/T178423) [20:54:35] (03CR) 10RobH: [C: 032] production dns for cp40(29|3[012] [dns] - 10https://gerrit.wikimedia.org/r/384841 (https://phabricator.wikimedia.org/T178423) (owner: 10RobH) [20:58:09] 10Operations, 10ops-ulsfo, 10Traffic: rack/setup/install lvs400[567].ulsfo.wmnet - https://phabricator.wikimedia.org/T178436#3692125 (10RobH) [21:12:34] (03PS1) 10BBlack: acme-setup: make compatible with openssl 1.1 [puppet] - 10https://gerrit.wikimedia.org/r/384886 [21:15:36] 10Operations, 10ops-ulsfo, 10Traffic: rack/setup/install lvs400[567].ulsfo.wmnet - https://phabricator.wikimedia.org/T178436#3692212 (10RobH) [21:16:55] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 429 (expecting: 200) [21:17:44] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [21:18:31] (03PS1) 10BBlack: acme_tiny: sync with upstream [puppet] - 10https://gerrit.wikimedia.org/r/384888 [21:18:44] (03CR) 10BBlack: [C: 032] acme-setup: make compatible with openssl 1.1 [puppet] - 10https://gerrit.wikimedia.org/r/384886 (owner: 10BBlack) [21:18:55] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy [21:19:13] (03CR) 10BBlack: [C: 032] acme_tiny: sync with upstream [puppet] - 10https://gerrit.wikimedia.org/r/384888 (owner: 10BBlack) [21:22:04] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target) is CRITICAL: Test normal source and target returned the unexpected status 429 (expecting: 200) [21:23:04] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy [21:23:54] !log cp4026 coming down for memory swap, no point in maint moding the system since other cp checks will alert anyhow [21:24:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:05] RECOVERY - puppet last run on sodium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:24:55] RECOVERY - puppet last run on netmon1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:25:04] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target) is CRITICAL: Test normal source and target returned the unexpected status 429 (expecting: 200): /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 429 (expecting: 2 [21:25:29] (03PS1) 10Addshore: Add filtertags to ci/slave/labs/docker [puppet] - 10https://gerrit.wikimedia.org/r/384889 [21:25:42] if anyone feels like merging a comment / filtertags update only ^^ [21:27:04] PROBLEM - Host cp4026 is DOWN: PING CRITICAL - Packet loss = 100% [21:28:17] (03CR) 10Andrew Bogott: [C: 032] Add filtertags to ci/slave/labs/docker [puppet] - 10https://gerrit.wikimedia.org/r/384889 (owner: 10Addshore) [21:28:23] (03PS2) 10Andrew Bogott: Add filtertags to ci/slave/labs/docker [puppet] - 10https://gerrit.wikimedia.org/r/384889 (owner: 10Addshore) [21:28:26] thanks andrewbogott :) [21:29:05] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy [21:29:14] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) timed out before a response was received [21:31:14] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4026_v4, cp4026_v6 [21:31:14] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4026_v4, cp4026_v6 [21:31:15] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4026_v4, cp4026_v6 [21:31:15] PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 112 not-conn: cp4026_v4, cp4026_v6 [21:31:24] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4026_v4, cp4026_v6 [21:31:24] PROBLEM - IPsec on kafka1018 is CRITICAL: Strongswan CRITICAL - ok: 112 not-conn: cp4026_v4, cp4026_v6 [21:31:24] PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4026_v4, cp4026_v6 [21:31:24] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4026_v4, cp4026_v6 [21:31:25] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp4026_v4, cp4026_v6 [21:31:34] PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp4026_v4, cp4026_v6 [21:31:34] PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp4026_v4, cp4026_v6 [21:31:34] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4026_v4, cp4026_v6 [21:31:35] PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp4026_v4, cp4026_v6 [21:31:35] PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp4026_v4, cp4026_v6 [21:31:44] PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp4026_v4, cp4026_v6 [21:31:45] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4026_v4, cp4026_v6 [21:31:54] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4026_v4, cp4026_v6 [21:31:54] PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 112 not-conn: cp4026_v4, cp4026_v6 [21:31:55] PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4026_v4, cp4026_v6 [21:31:55] PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp4026_v4, cp4026_v6 [21:31:55] PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp4026_v4, cp4026_v6 [21:31:55] PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp4026_v4, cp4026_v6 [21:31:55] PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp4026_v4, cp4026_v6 [21:32:04] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4026_v4, cp4026_v6 [21:32:04] PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 112 not-conn: cp4026_v4, cp4026_v6 [21:32:04] PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 112 not-conn: cp4026_v4, cp4026_v6 [21:32:04] PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 112 not-conn: cp4026_v4, cp4026_v6 [21:32:34] yeah thats due to cp4026 mem swap [21:32:37] ie: expected. [21:32:54] (all the ipsec errors) [21:32:55] PROBLEM - Host cp4026.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:34:14] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy [21:35:15] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 56 ESP OK [21:35:15] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 56 ESP OK [21:35:24] RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 114 ESP OK [21:35:24] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 56 ESP OK [21:35:24] RECOVERY - Host cp4026 is UP: PING OK - Packet loss = 0%, RTA = 78.57 ms [21:35:24] RECOVERY - IPsec on kafka1018 is OK: Strongswan OK - 114 ESP OK [21:35:24] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 56 ESP OK [21:35:25] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 56 ESP OK [21:35:34] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 70 ESP OK [21:35:34] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 70 ESP OK [21:35:34] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 70 ESP OK [21:35:34] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 56 ESP OK [21:35:35] RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 70 ESP OK [21:35:44] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 70 ESP OK [21:35:44] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 70 ESP OK [21:35:45] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 56 ESP OK [21:35:54] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 56 ESP OK [21:35:55] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 56 ESP OK [21:35:55] RECOVERY - IPsec on kafka1012 is OK: Strongswan OK - 114 ESP OK [21:35:55] RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 70 ESP OK [21:35:55] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 70 ESP OK [21:36:04] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 56 ESP OK [21:36:04] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 70 ESP OK [21:36:04] RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 70 ESP OK [21:36:04] RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 114 ESP OK [21:36:05] RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 114 ESP OK [21:36:05] RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 114 ESP OK [21:36:14] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 56 ESP OK [21:38:05] RECOVERY - Host cp4026.mgmt is UP: PING OK - Packet loss = 0%, RTA = 79.10 ms [21:39:46] !log cp4026 being returned to service post memory replacement. puppet runs fine and all icinga checks are green. repooled. [21:39:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:15] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 429 (expecting: 200) [21:41:14] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy [21:42:15] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 429 (expecting: 200) [21:42:45] the icinga check is getting ratelimited? [21:44:15] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy [21:44:53] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current), 10User-Ladsgroup: Review and fix file handle management in worker and celery processes - https://phabricator.wikimedia.org/T174402#3692333 (10Halfak) I just figured out how to upgrade to celery 4.1.0. See T178441. Once that pa... [21:45:10] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current), 10User-Ladsgroup: Review and fix file handle management in worker and celery processes - https://phabricator.wikimedia.org/T174402#3692336 (10Halfak) [21:53:14] (03PS1) 10Dzahn: openstack2: no Icinga paging (SMS) if on labtest [puppet] - 10https://gerrit.wikimedia.org/r/384892 (https://phabricator.wikimedia.org/T178008) [21:55:27] !log Removed 2FA from User:Amakuru [21:55:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:46] (keep forgetting to log these, sorry) [21:56:48] (03CR) 10Andrew Bogott: "This is good but... is it really openstack2-wide? The attached change is only for designate (but maybe that's the only thing that alerted" [puppet] - 10https://gerrit.wikimedia.org/r/384892 (https://phabricator.wikimedia.org/T178008) (owner: 10Dzahn) [21:58:34] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target) timed out before a response was received [21:59:25] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy [22:01:09] (03PS1) 10Dzahn: toollabs/icinga: no paging if on labtest [puppet] - 10https://gerrit.wikimedia.org/r/384893 (https://phabricator.wikimedia.org/T178008) [22:01:39] (03CR) 10jerkins-bot: [V: 04-1] toollabs/icinga: no paging if on labtest [puppet] - 10https://gerrit.wikimedia.org/r/384893 (https://phabricator.wikimedia.org/T178008) (owner: 10Dzahn) [22:01:48] (03CR) 10Dzahn: "you are right, there is another place in the openstack2 module that isnt designate and i should add, will amend" [puppet] - 10https://gerrit.wikimedia.org/r/384892 (https://phabricator.wikimedia.org/T178008) (owner: 10Dzahn) [22:02:35] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target) timed out before a response was received [22:03:55] (03PS2) 10Dzahn: toollabs/icinga: no paging if on labtest [puppet] - 10https://gerrit.wikimedia.org/r/384893 (https://phabricator.wikimedia.org/T178008) [22:04:19] (03CR) 10Rush: "I think I want to do this through teh normal profile params rather than side door via regex. We have profiles and roles that are labtest " [puppet] - 10https://gerrit.wikimedia.org/r/384892 (https://phabricator.wikimedia.org/T178008) (owner: 10Dzahn) [22:04:34] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy [22:07:53] (03PS2) 10Dzahn: openstack2: no Icinga paging (SMS) if on labtest [puppet] - 10https://gerrit.wikimedia.org/r/384892 (https://phabricator.wikimedia.org/T178008) [22:08:35] (03CR) 10Dzahn: "ok, well here is how it looks when it is really all of openstack2 module that is paging. 3 different classes" [puppet] - 10https://gerrit.wikimedia.org/r/384892 (https://phabricator.wikimedia.org/T178008) (owner: 10Dzahn) [22:11:34] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target) is CRITICAL: Test normal source and target returned the unexpected status 429 (expecting: 200) [22:11:44] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 429 (expecting: 200) [22:12:34] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy [22:12:36] (03PS4) 10BBlack: Various minor improvements/updates [software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/384165 [22:12:38] (03PS5) 10BBlack: Remove multi-head support from strq, move into purger. [software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/382865 [22:12:40] (03PS5) 10BBlack: Move all URL parsing and HTTP req generation to receiver [software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/382867 [22:12:42] (03PS5) 10BBlack: Chain the purgers together and split their stats [software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/382868 [22:12:44] (03PS5) 10BBlack: Bump http-parser upstream src to 2.7.1 + fixups [software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/382870 [22:12:46] (03PS6) 10BBlack: Refactor (rewrite?!) purging code [software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/384167 [22:12:48] (03PS5) 10BBlack: strq+purger: refactor, simplify, add queue delays [software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/384433 [22:12:50] (03PS7) 10BBlack: Rework stats further [software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/384434 [22:12:52] (03PS7) 10BBlack: link against jemalloc and tune it a bit [software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/384435 [22:12:54] (03PS9) 10BBlack: Release 0.1.0 [software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/382873 [22:14:24] (03PS1) 10Dzahn: mysql/icinga/labtest: no pages if on labtest, pt.2 [puppet] - 10https://gerrit.wikimedia.org/r/384895 (https://phabricator.wikimedia.org/T178008) [22:15:44] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy [22:16:37] (03CR) 10Dzahn: "this doesn't require puppet changes, only Hiera, since "$is_critical" is already a parameter. it's just like the one we did for check_proc" [puppet] - 10https://gerrit.wikimedia.org/r/384895 (https://phabricator.wikimedia.org/T178008) (owner: 10Dzahn) [22:20:00] (03CR) 10Dzahn: "@dba's the same way we could disable SMS for other groups of servers that use the mariadb::role but, for one reason or another, should not" [puppet] - 10https://gerrit.wikimedia.org/r/384895 (https://phabricator.wikimedia.org/T178008) (owner: 10Dzahn) [22:21:44] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target) is CRITICAL: Test normal source and target returned the unexpected status 429 (expecting: 200) [22:22:44] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy [22:23:44] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target) is CRITICAL: Test normal source and target returned the unexpected status 429 (expecting: 200): /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 429 (expecting: 2 [22:25:45] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy [22:28:43] (03PS2) 10Dzahn: screen-monitor: raise WARN to 4 days, lower CRIT to 20 days [puppet] - 10https://gerrit.wikimedia.org/r/384637 (https://phabricator.wikimedia.org/T165348) [22:30:18] (03CR) 10Dzahn: [C: 032] "as discussed in monitoring meeting, raise WARN limit to a couple days instead of just one, so be more lenient, but also lower the CRIT thr" [puppet] - 10https://gerrit.wikimedia.org/r/384637 (https://phabricator.wikimedia.org/T165348) (owner: 10Dzahn) [22:31:38] (03CR) 10Dzahn: "at time of merge there were 2 WARNs and both just a little over a day" [puppet] - 10https://gerrit.wikimedia.org/r/384637 (https://phabricator.wikimedia.org/T165348) (owner: 10Dzahn) [22:35:04] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target) timed out before a response was received [22:35:54] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy [22:37:55] 10Operations, 10Traffic, 10Wikidata, 10wikiba.se, and 2 others: [Task] move wikiba.se webhosting to wikimedia misc-cluster - https://phabricator.wikimedia.org/T99531#3692384 (10Dzahn) [22:38:14] 10Operations, 10Traffic, 10Wikidata, 10wikiba.se, and 2 others: [Task] move wikiba.se webhosting to wikimedia misc-cluster - https://phabricator.wikimedia.org/T99531#1293753 (10Dzahn) [22:39:47] 10Operations, 10Traffic, 10Wikidata, 10wikiba.se, and 2 others: [Task] move wikiba.se webhosting to wikimedia misc-cluster - https://phabricator.wikimedia.org/T99531#1293753 (10Dzahn) You might hate me for this question but have you considered using an org domain instead of the "se"? Also, i wonder if we... [22:42:03] (03PS1) 10Chad: Scap prep: Clean up everything, fix up StartProfiler symlink mess [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384898 (https://phabricator.wikimedia.org/T126306) [22:42:54] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target) is CRITICAL: Test normal source and target returned the unexpected status 429 (expecting: 200) [22:43:55] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy [22:45:43] 10Operations, 10Phabricator, 10Traffic, 10procurement, 10HTTPS: wmfusercontent.org SSL cert expires 2017-11-22 - https://phabricator.wikimedia.org/T178443#3692404 (10Dzahn) [22:50:05] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target) is CRITICAL: Test normal source and target returned the unexpected status 429 (expecting: 200) [22:50:06] 10Operations, 10Traffic, 10Wikimedia-Planet, 10procurement: *.planet.wikimedia.org SSL cert expires 2017-11-22 - https://phabricator.wikimedia.org/T178444#3692421 (10Dzahn) [22:51:04] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy [22:52:23] bblack: oh, so everything is a single certificate nowadays, *.wikipedia.org and *.planet.wikimedia.org and *.wmfusercontent.org that is all just a single one? [22:52:48] in that case.. it expires in 35 days and monitoring told us about the special cases planet and wmfusercontent but really means all [22:53:04] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target) is CRITICAL: Test normal source and target returned the unexpected status 429 (expecting: 200) [22:54:04] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy [22:55:26] 10Operations, 10Traffic, 10Wikimedia-Planet, 10procurement: *.planet.wikimedia.org SSL cert expires 2017-11-22 - https://phabricator.wikimedia.org/T178444#3692442 (10Dzahn) Actually it looks like everything is a single cert now, so that makes this a duplicate of T178443 and it really just means that globa... [22:55:55] 10Operations, 10Phabricator, 10Traffic, 10procurement, 10HTTPS: wmfusercontent.org SSL cert expires 2017-11-22 - https://phabricator.wikimedia.org/T178443#3692404 (10Dzahn) Actually it looks like everything is a single cert now, so that makes this a duplicate of T178444 and it really just means that glob... [22:56:04] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 429 (expecting: 200) [22:58:04] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy [22:58:40] 10Operations, 10Services, 10monitoring: flapping monitoring for cxserver on scb - https://phabricator.wikimedia.org/T178445#3692449 (10Dzahn) [22:59:21] 10Operations, 10Services, 10monitoring: flapping monitoring for cxserver on scb - https://phabricator.wikimedia.org/T178445#3692462 (10Dzahn) [23:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Your horoscope predicts another unfortunate Evening SWAT (Max 8 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171017T2300). [23:00:05] MaxSem and James_F: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:43] * MaxSem waves [23:02:34] 10Operations, 10Services, 10monitoring: flapping monitoring for recommendation_api on scb - https://phabricator.wikimedia.org/T178445#3692478 (10Dzahn) [23:02:58] Heya. [23:03:14] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 429 (expecting: 200) [23:04:14] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy [23:04:20] (03Draft1) 10Paladox: Gerrit: Replace certificates with tokens for its-phabricator [puppet] - 10https://gerrit.wikimedia.org/r/384901 (https://phabricator.wikimedia.org/T178385) [23:04:25] (03PS2) 10Paladox: Gerrit: Replace certificates with tokens for its-phabricator [puppet] - 10https://gerrit.wikimedia.org/r/384901 (https://phabricator.wikimedia.org/T178385) [23:04:46] 10Operations, 10ops-ulsfo, 10Traffic: cp4026 memory error - https://phabricator.wikimedia.org/T178011#3692490 (10RobH) 05Open>03Resolved memory replacement complete and system returned to service [23:06:21] I can SWAT [23:06:22] (03Draft1) 10Paladox: [labs/private] - 10https://gerrit.wikimedia.org/r/384902 (https://phabricator.wikimedia.org/T178385) [23:06:25] (03PS2) 10Paladox: Gerrit: Replace certificates with tokens for its-phabricator [labs/private] - 10https://gerrit.wikimedia.org/r/384902 (https://phabricator.wikimedia.org/T178385) [23:06:46] (03PS3) 10Paladox: Gerrit: Replace certificates with tokens for its-phabricator [puppet] - 10https://gerrit.wikimedia.org/r/384901 (https://phabricator.wikimedia.org/T178385) [23:06:51] !log disabling Icinga notifications for service recommendation_api on scb hosts - please remember to re-enable once ticket is resolved (T178445) (https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=recommendation_api%20endpoints) [23:06:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:59] T178445: flapping monitoring for recommendation_api on scb - https://phabricator.wikimedia.org/T178445 [23:09:47] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366869 (owner: 10Jforrester) [23:10:14] James_F: could you manually rebase: https://gerrit.wikimedia.org/r/#/c/374383/ gerrit is unhappy about it? [23:10:29] Wait. [23:10:30] I did? [23:10:35] Did I not push it? [23:11:01] thcipriani: Scrub it, let's do it later. [23:11:12] okie doke [23:11:53] (03Merged) 10jenkins-bot: Remove setting no longer in MediaWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366869 (owner: 10Jforrester) [23:12:08] (03CR) 10jenkins-bot: Remove setting no longer in MediaWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366869 (owner: 10Jforrester) [23:12:44] James_F: ^ is live on mwdebug1002, anything you want to test there before it goes out? [23:13:51] nothing appears to explode :) [23:14:19] thcipriani: No, should be good. MW doesn't read it any more. :-) [23:14:34] alrighty, going live [23:15:57] !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:366869|Remove setting no longer in MediaWiki]] (duration: 00m 50s) [23:16:04] ^ live now [23:16:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:07] MaxSem: Shell\Command: Better walltime fallback is live on mwdebug1002, check please [23:18:30] thcipriani: the bug is on job runners only :( I already verified that it fixes the issue on wmf.4 though [23:19:11] MaxSem: ok, will push out [23:22:07] !log thcipriani@tin Synchronized php-1.31.0-wmf.3/includes/shell/Command.php: SWAT: [[gerrit:384790|Shell\Command: Better walltime fallback]] T178314 (duration: 00m 51s) [23:22:13] ^ MaxSem live now [23:22:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:15] T178314: Transcodes has a very small max walltime limit of 3 minutes - https://phabricator.wikimedia.org/T178314 [23:22:18] wee [23:23:05] will look a bit later when the runners reload [23:23:11] thanks thcipriani [23:23:17] yw :) [23:24:09] James_F: changes for Citoid and for VisualEditor for wmf.4 are live on mwdebug1002, check please [23:24:26] Kk. [23:26:21] thcipriani: Yeah, looks good. [23:26:35] ok, both going live, VE first [23:29:05] !log thcipriani@tin Synchronized php-1.31.0-wmf.4/extensions/VisualEditor/modules/ve-mw/ui/styles/widgets/ve.ui.MWMediaInfoFieldWidget.css: SWAT: [[gerrit:384748|ve.ui.MWMediaInfoFieldWidget: Fix positioning of icons]] T178415 (duration: 00m 50s) [23:29:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:12] T178415: [Regression pre-wmf.4] A solid black icon appearing over an image after selecting it inside the Media Settings dialog - https://phabricator.wikimedia.org/T178415 [23:31:15] !log thcipriani@tin Synchronized php-1.31.0-wmf.4/extensions/Citoid/modules/ve.ui.CiteFromIdInspector.css: SWAT: [[gerrit:384817|ve.ui.CiteFromIdInspector: Fix CSS for context menus after changes in OOjs UI]] T178324 (duration: 00m 50s) [23:31:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:22] T178324: VE's context menus are badly broken in beta - https://phabricator.wikimedia.org/T178324 [23:31:26] ^ James_F all live [23:31:45] Excellent, thanks! [23:32:03] yw :) [23:35:49] thcipriani: confirmed worked [23:37:02] MaxSem: cool, thanks for checking! [23:49:59] (03PS1) 10Gergő Tisza: Deploy ReadingLists to the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384908 (https://phabricator.wikimedia.org/T174651)