[00:00:41] (03PS1) 10Chad: Scap clean: ensure proper quotation of deletion commands in keep-static [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346481 [00:02:16] 06Operations, 10Parsoid: Upload of Parsoid deb package 0.7.0 failed - https://phabricator.wikimedia.org/T162200#3155844 (10Dzahn) I moved the old files above out of "incoming", on tin i [tin:~] $ sudo rm /tmp/parsoid_0.7.0all_amd64.bromine.eqiad.wmnet.upload to be able to repeat the upload. I deleted the pack... [00:03:46] RECOVERY - puppet last run on labtestservices2001 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [00:08:11] !log tstarling@tin Synchronized php-1.29.0-wmf.18/extensions/ParserMigration/includes/MigrationEditPage.php: for bug fix gerrit 346478 (duration: 00m 56s) [00:08:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:24] (03PS1) 10Cmjohnson: Adding mgmt dns entries for new hadoop nodes analytics1058-1069 [dns] - 10https://gerrit.wikimedia.org/r/346483 [00:19:36] PROBLEM - puppet last run on dataset1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:20:09] (03PS4) 10Dzahn: releases: remove the precise suite [puppet] - 10https://gerrit.wikimedia.org/r/345838 (owner: 10Faidon Liambotis) [00:22:31] (03CR) 10Cmjohnson: [C: 032] Adding mgmt dns entries for new hadoop nodes analytics1058-1069 [dns] - 10https://gerrit.wikimedia.org/r/346483 (owner: 10Cmjohnson) [00:25:18] 06Operations, 10ops-eqiad: rack and set up analyics1058-1069 - https://phabricator.wikimedia.org/T162216#3155871 (10Cmjohnson) [00:25:36] PROBLEM - puppet last run on mc1030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:29:59] !log tstarling@tin Synchronized php-1.29.0-wmf.18/extensions/ParserMigration/includes: scap test only, no code changes (duration: 01m 21s) [00:30:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:30:41] !log demon@tin Started scap: wmf.14 again, testing testing [00:30:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:31:04] (03PS2) 10Chad: Scap clean: ensure proper quotation of deletion commands in keep-static [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346481 [00:31:09] (03CR) 10Chad: [C: 032] Scap clean: ensure proper quotation of deletion commands in keep-static [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346481 (owner: 10Chad) [00:32:10] (03Merged) 10jenkins-bot: Scap clean: ensure proper quotation of deletion commands in keep-static [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346481 (owner: 10Chad) [00:32:24] (03CR) 10jenkins-bot: Scap clean: ensure proper quotation of deletion commands in keep-static [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346481 (owner: 10Chad) [00:36:56] PROBLEM - puppet last run on cp1061 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:42:46] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [00:47:36] RECOVERY - puppet last run on dataset1001 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [00:47:46] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [00:52:36] PROBLEM - Nginx local proxy to apache on mw1288 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:52:46] PROBLEM - Apache HTTP on mw1288 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:52:47] PROBLEM - HHVM rendering on mw1288 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:53:36] RECOVERY - puppet last run on mc1030 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [00:53:46] PROBLEM - HHVM jobrunner on mw1165 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:55:06] PROBLEM - HHVM rendering on mw1223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:55:16] PROBLEM - Apache HTTP on mw1223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:55:36] PROBLEM - Nginx local proxy to apache on mw1223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:55:56] PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:57:30] !log demon@tin Finished scap: wmf.14 again, testing testing (duration: 26m 48s) [00:57:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:04:46] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [01:04:56] RECOVERY - puppet last run on cp1061 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [01:09:46] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [01:12:31] (03CR) 10Zppix: [C: 031] "Considering we depricated precise in prod and as well as releng i say we go ahead and merge this." [puppet] - 10https://gerrit.wikimedia.org/r/345838 (owner: 10Faidon Liambotis) [01:16:46] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [01:17:44] !log demon@tin Synchronized scap/plugins/clean.py: fixes (duration: 00m 41s) [01:17:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:18:26] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 273 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [01:21:46] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [01:23:26] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 273 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [01:24:56] RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [01:26:37] !log tstarling@tin Synchronized php-1.29.0-wmf.18/extensions/ParserMigration/includes: scap test only, no code changes (duration: 00m 40s) [01:26:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:27:48] !log tstarling@tin Synchronized php-1.29.0-wmf.18/extensions/ParserMigration/includes: scap test only, no code changes (duration: 00m 39s) [01:27:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:29:46] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 81583.267689 Seconds [01:29:56] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 82325.387609 Seconds [01:32:26] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 82477.859562 Seconds [01:32:26] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 82479.834204 Seconds [01:32:36] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 81751.161079 Seconds [01:32:37] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 81754.352817 Seconds [01:35:26] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 274 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [01:38:26] RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 0.0 Seconds [01:41:26] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 83019.890733 Seconds [01:50:26] PROBLEM - puppet last run on dbmonitor2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:50:36] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds [01:50:36] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [01:53:36] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 83014.244867 Seconds [01:54:26] RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 20.106499 Seconds [01:54:26] RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 22.203747 Seconds [01:54:36] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 12.285725 Seconds [01:54:46] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 21.221803 Seconds [01:54:56] RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 47.820263 Seconds [01:59:24] 06Operations, 10Parsoid: Upload of Parsoid deb package 0.7.0 failed - https://phabricator.wikimedia.org/T162200#3156004 (10ssastry) 05Open>03Resolved p:05Triage>03Normal a:03Dzahn [01:59:52] 06Operations, 10Parsoid: Upload of Parsoid deb package 0.7.0 failed - https://phabricator.wikimedia.org/T162200#3155315 (10ssastry) Confirmed. apt-get install parsoid installs the newer version now. [02:05:08] (03PS1) 10Andrew Bogott: Keystonehooks: Create and delete sudoer rules in ldap [puppet] - 10https://gerrit.wikimedia.org/r/346489 (https://phabricator.wikimedia.org/T150091) [02:10:06] RECOVERY - Keystone admin and observer projects exist on labtestnet2001 is OK: Keystone projects exist and have matching names and ids. [02:18:26] RECOVERY - puppet last run on dbmonitor2001 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [02:31:05] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.18) (duration: 08m 47s) [02:31:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:32:12] 06Operations, 10Traffic, 06WMF-Design, 10Wikimedia-General-or-Unknown, 07Design: Better WMF error pages - https://phabricator.wikimedia.org/T76560#3156017 (10Krinkle) [02:32:24] 06Operations, 10Traffic, 06WMF-Design, 10Wikimedia-General-or-Unknown, 07Design: Better WMF error pages - https://phabricator.wikimedia.org/T76560#805779 (10Krinkle) [02:49:46] RECOVERY - Hadoop DataNode on analytics1054 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [02:50:26] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 274 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [02:57:25] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.19) (duration: 07m 22s) [02:57:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:03:18] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Apr 5 03:03:18 UTC 2017 (duration 5m 53s) [03:03:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:04:56] PROBLEM - puppet last run on wtp1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:18:10] 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files (e.g. djvu, jpg, png, webm) on Commons and other sites - https://phabricator.wikimedia.org/T161836#3156034 (10Revent) https://commons.wikimedia.org/wiki/File:Walking_Keage_Incline.webm reappeared, and has been reset [03:31:56] RECOVERY - puppet last run on wtp1003 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [03:35:56] PROBLEM - HP RAID on db2037 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [03:55:46] RECOVERY - HP RAID on db2037 is OK: OK: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12, Controller, Battery/Capacitor [04:35:56] PROBLEM - HP RAID on db2037 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [04:38:56] PROBLEM - puppet last run on mc1033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:42:26] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 274 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [04:52:26] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 274 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [04:55:36] RECOVERY - HP RAID on db2037 is OK: OK: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12, Controller, Battery/Capacitor [05:05:56] RECOVERY - puppet last run on mc1033 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [05:07:56] PROBLEM - puppet last run on labvirt1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:35:56] RECOVERY - puppet last run on labvirt1006 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [05:38:48] PROBLEM - MariaDB disk space on db1047 is CRITICAL: DISK CRITICAL - free space: / 419 MB (5% inode=72%) [05:44:47] 06Operations, 10Traffic, 10media-storage, 13Patch-For-Review, 15User-Urbanecm: Some PNG thumbnails and JPEG originals delivered as [text/html] content-type and hence not rendered in browser - https://phabricator.wikimedia.org/T162035#3150362 (10Pokefan95) All files listed here works for me except https:/... [05:47:26] 06Operations, 10Traffic, 10media-storage, 13Patch-For-Review, 15User-Urbanecm: Some PNG thumbnails and JPEG originals delivered as [text/html] content-type and hence not rendered in browser - https://phabricator.wikimedia.org/T162035#3156223 (10Pokefan95) https://upload.wikimedia.org/wikipedia/commons/th... [05:57:48] RECOVERY - MariaDB disk space on db1047 is OK: DISK OK [05:59:20] 06Operations, 10DBA: dbstore1002 in bad shape - https://phabricator.wikimedia.org/T162212#3156234 (10Marostegui) Might be related to the work that has been done by some analysts with some SUPER heavy queries in the last few days... [05:59:26] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=3205.70 Read Requests/Sec=5661.00 Write Requests/Sec=9.70 KBytes Read/Sec=22650.80 KBytes_Written/Sec=2910.00 [06:01:06] 06Operations, 10Traffic, 05MW-1.28-release (WMF-deploy-2016-08-09_(1.28.0-wmf.14)), 13Patch-For-Review: Decom bits.wikimedia.org hostname - https://phabricator.wikimedia.org/T107430#3156237 (10Tbayer) >>! In T107430#2886903, @Tbayer wrote: >>>! In T107430#2882009, @fgiunchedi wrote: >>>>! In T107430#288195... [06:03:11] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2061" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346497 [06:03:15] (03PS2) 10Marostegui: Revert "db-codfw.php: Depool db2061" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346497 [06:05:11] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2061" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346497 (owner: 10Marostegui) [06:06:34] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2061" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346497 (owner: 10Marostegui) [06:06:44] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2061" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346497 (owner: 10Marostegui) [06:07:36] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2061 - T160390 (duration: 00m 40s) [06:07:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:07:44] T160390: Unify revision table on s7 - https://phabricator.wikimedia.org/T160390 [06:09:26] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=0.40 Read Requests/Sec=0.20 Write Requests/Sec=0.90 KBytes Read/Sec=1.20 KBytes_Written/Sec=21.20 [06:15:36] (03PS1) 10Marostegui: db-codfw.php: Depool db2054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346498 (https://phabricator.wikimedia.org/T160390) [06:20:53] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346498 (https://phabricator.wikimedia.org/T160390) (owner: 10Marostegui) [06:21:53] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346498 (https://phabricator.wikimedia.org/T160390) (owner: 10Marostegui) [06:22:02] (03CR) 10jenkins-bot: db-codfw.php: Depool db2054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346498 (https://phabricator.wikimedia.org/T160390) (owner: 10Marostegui) [06:22:47] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2054 - T160390 (duration: 00m 43s) [06:22:48] !log Deploy schema change db2054 (s7) - https://phabricator.wikimedia.org/T160390 [06:22:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:54] T160390: Unify revision table on s7 - https://phabricator.wikimedia.org/T160390 [06:23:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:39] PROBLEM - MariaDB Slave Lag: s5 on db1070 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 645.98 seconds [06:24:49] came back from downtime [06:25:28] downtimed again [06:25:37] it is depooled anyways [06:32:00] checking mw1223 and mw1288 [06:33:36] PROBLEM - Host cr2-esams is DOWN: CRITICAL - Network Unreachable (91.198.174.244) [06:33:43] !log restart hhvm on mw1223 (hhvm-dump-debug in /tmp/hhvm.2164.bt.) [06:33:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:33:55] cr2-esams down?? [06:34:02] paravoid: ---^ [06:34:06] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/1/3: down - Core: cr2-esams:xe-0/1/3 (Level3, BDFS2448, 84ms) {#2013} [10Gbps wave]BR [06:35:06] RECOVERY - Apache HTTP on mw1223 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.024 second response time [06:35:06] RECOVERY - HHVM rendering on mw1223 is OK: HTTP OK: HTTP/1.1 200 OK - 74750 bytes in 0.183 second response time [06:35:26] RECOVERY - Nginx local proxy to apache on mw1223 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.026 second response time [06:36:55] !log restart hhvm on mw1288 (hhvm-dump-debug in /tmp/hhvm.92520.bt.) [06:37:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:56] PROBLEM - Host cr2-esams IPv6 is DOWN: CRITICAL - Destination Unreachable (2620:0:862:ffff::3) [06:38:36] RECOVERY - Apache HTTP on mw1288 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.037 second response time [06:38:36] RECOVERY - Nginx local proxy to apache on mw1288 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.040 second response time [06:39:06] RECOVERY - HHVM rendering on mw1288 is OK: HTTP OK: HTTP/1.1 200 OK - 74749 bytes in 0.110 second response time [06:40:36] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [06:40:36] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [06:43:19] 06Operations, 10ops-eqiad, 10DBA: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3156329 (10Marostegui) p:05Triage>03Normal [06:43:34] (03PS1) 10Muehlenhoff: Remove access for adavenport [puppet] - 10https://gerrit.wikimedia.org/r/346502 [06:44:36] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:45:36] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:48:06] (03PS1) 10Marostegui: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346503 (https://phabricator.wikimedia.org/T161088) [06:48:46] (03CR) 10Muehlenhoff: [C: 032] Remove access for adavenport [puppet] - 10https://gerrit.wikimedia.org/r/346502 (owner: 10Muehlenhoff) [06:49:55] (03PS1) 10Elukey: Depool esams due to networking failures [dns] - 10https://gerrit.wikimedia.org/r/346504 [06:50:39] (03CR) 10Elukey: [C: 04-1] "Not needed at the moment." [dns] - 10https://gerrit.wikimedia.org/r/346504 (owner: 10Elukey) [06:52:36] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346503 (https://phabricator.wikimedia.org/T161088) (owner: 10Marostegui) [06:53:56] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346503 (https://phabricator.wikimedia.org/T161088) (owner: 10Marostegui) [06:54:09] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346503 (https://phabricator.wikimedia.org/T161088) (owner: 10Marostegui) [06:55:03] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db1081 - T161088 (duration: 00m 39s) [06:55:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:10] T161088: Defragment s4: db1091, db1084, db1081, d1059 and probably the rest - https://phabricator.wikimedia.org/T161088 [06:55:54] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1081 - T161088 (duration: 00m 39s) [06:56:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:41] !log Stop replication on db1081 for maintenance - T161088 [06:56:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:06] PROBLEM - Host mr1-esams.oob is DOWN: PING CRITICAL - Packet loss = 100% [07:21:16] RECOVERY - Host mr1-esams.oob is UP: PING OK - Packet loss = 0%, RTA = 86.47 ms [07:37:09] 06Operations, 10DBA: dbstore1002 in bad shape - https://phabricator.wikimedia.org/T162212#3156441 (10Marostegui) As I mentioned here: T159430#3153285 I would like to convert a couple of enwiki tables to InnoDB+compression to see if it helps this: https://jira.mariadb.org/browse/MDEV-9027 which we are suffering... [07:42:11] 06Operations, 10DBA: dbstore1002 in bad shape - https://phabricator.wikimedia.org/T162212#3156444 (10jcrespo) 05Open>03Resolved a:03jcrespo Sure. For now I will close this as it seems healthy again. [07:44:26] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 33 probes of 274 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [07:44:42] !log Migrate dbstore1002 enwiki.page and enwiki.categorylinks from TokuDB to InnoDB+compression - T159430 [07:44:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:49] T159430: convert dbstore1001 to InnoDB compressed by importing db shards to it - https://phabricator.wikimedia.org/T159430 [07:47:56] PROBLEM - dhclient process on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:48:46] RECOVERY - dhclient process on thumbor1001 is OK: PROCS OK: 0 processes with command name dhclient [07:49:26] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 18 probes of 274 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [07:50:49] _joe_: Hi, do you know about our hhvm settings regarding gc? [07:50:56] or anyone else? [07:54:54] 06Operations, 10ops-esams, 10netops: cr2-esams FPC 0 is dead - https://phabricator.wikimedia.org/T162239#3156449 (10faidon) [07:56:32] <_joe_> hoo: I don't think we have specialized settings for GC, but what do you refer to specifically? [07:58:37] _joe_: https://phabricator.wikimedia.org/T161695 [08:00:06] The saddest part is that HHVM still seems to leak memory if we force GC runs, just way slower [08:00:41] (03CR) 10Muehlenhoff: "While we have deprecated precise in production and labs by the end of March, support by Canonical extends until the 26th of April, so I th" [puppet] - 10https://gerrit.wikimedia.org/r/345838 (owner: 10Faidon Liambotis) [08:01:15] <_joe_> hoo: oh you mean actual GC within the execution of one script? that' not the way php is usually behaving [08:02:03] PHP is doing GC at some… I'm just not sure when it decides to do so [08:07:05] (03PS6) 10Ema: cache_upload: override CT updates on 304s [puppet] - 10https://gerrit.wikimedia.org/r/346304 (https://phabricator.wikimedia.org/T162035) [08:08:42] (03CR) 10Ema: cache_upload: override CT updates on 304s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/346304 (https://phabricator.wikimedia.org/T162035) (owner: 10Ema) [08:10:07] (03CR) 10Ema: [V: 032 C: 032] cache_upload: override CT updates on 304s [puppet] - 10https://gerrit.wikimedia.org/r/346304 (https://phabricator.wikimedia.org/T162035) (owner: 10Ema) [08:10:12] (03PS2) 10Jcrespo: Revert "mariadb: Depool db1034 temporarilly to run ANALYZE on revision" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346319 [08:12:10] _joe_: https://phabricator.wikimedia.org/T161695#3156472 :S [08:12:14] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db1034 temporarilly to run ANALYZE on revision" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346319 (owner: 10Jcrespo) [08:13:04] <_joe_> hoo: I'll take a look, but I think it's the usual cli-vs-fcgi-best-settings [08:13:19] hm [08:13:25] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1034 temporarilly to run ANALYZE on revision" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346319 (owner: 10Jcrespo) [08:13:37] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1034 temporarilly to run ANALYZE on revision" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346319 (owner: 10Jcrespo) [08:14:29] hoo: oh !!! [08:15:07] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1034 after maintenance (duration: 00m 40s) [08:15:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:20] hoo: thanks for the garbage collection hint. The Wikidata php5 job that fails due to leak memory, maybe that could be more or less fixed by enabling gc again [08:15:30] hoo: we had it disabled when running phpunit due to segfaulting [08:15:51] I thought that was only a php 5.3 hack [08:15:55] and long gone [08:16:07] yeah [08:18:10] 06Operations, 07HHVM: HHVM 3.18 deadlocks after 4-6 hours (stuck in in HPHP::Treadmill::getAgeOldestRequest() ) - https://phabricator.wikimedia.org/T161684#3156490 (10MoritzMuehlenhoff) This has been fixed upstream, I'm currently building a new package (and also rebasing to HHVM 3.18.2 while at it) to validate... [08:18:15] hashar: You can live hack it to occasionally collect and see how that goes, I guess [08:18:25] hoo: that segfaulted on trusty as well :(according to https://phabricator.wikimedia.org/T142158 [08:20:11] Oh, totally forgot about that one [08:21:04] (03PS1) 10Elukey: Increase Redis connection timeout for MediaWiki Jobrunners [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346508 (https://phabricator.wikimedia.org/T125735) [08:21:41] hashar: Hm, a lot has changed since… maybe be bold and just try it? Can hardly be worse than the current situation [08:21:52] it = enable gc again [08:25:44] (03PS1) 10Ema: cache_upload: lower keep from 3d to 1d on upload backends [puppet] - 10https://gerrit.wikimedia.org/r/346510 (https://phabricator.wikimedia.org/T162035) [08:25:59] <_joe_> hoo: what about we enable gc in hhvm cli just for dumps for now? [08:26:10] Sounds good to me [08:27:23] <_joe_> the point is GC is useful just for long-running scripts, not for web requests, as the whole memory for a request gets garbage collected at the end of the request itself [08:27:48] <_joe_> modulo some shared inner structures that won't be affected by that GC anyways [08:28:08] <_joe_> so it's pointless and potentially harmful to do gc on web requests [08:28:08] That makes sense [08:28:30] <_joe_> that's the beauty and ugliness of php for the web at the same time :P [08:29:14] Indeed [08:33:56] PROBLEM - puppet last run on contint1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:37:13] (03CR) 10Marostegui: "Just for the record: looks like this has been working fine and so far dbstore1002 hasn't complained about timeouts today." [puppet] - 10https://gerrit.wikimedia.org/r/345372 (owner: 10Marostegui) [08:39:22] contint1001 fails with Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Attempt to assign to a reserved variable name: 'trusted' on node contint1001.wikimedia.org [08:39:59] one time issue though [08:40:11] (03CR) 10Ema: [V: 032 C: 032] cache_upload: lower keep from 3d to 1d on upload backends [puppet] - 10https://gerrit.wikimedia.org/r/346510 (https://phabricator.wikimedia.org/T162035) (owner: 10Ema) [08:40:56] RECOVERY - puppet last run on contint1001 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [08:43:44] !log Ran scap pull on mwdebug1001 to revert local changes to Wikibase maintenance scripts [08:43:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:30] (03CR) 10Giuseppe Lavagetto: Add tasks for stage 0 (033 comments) [switchdc] - 10https://gerrit.wikimedia.org/r/346305 (owner: 10Giuseppe Lavagetto) [08:47:29] <_joe_> hoo: I don't really have time to work on that now though [08:48:01] _joe_: Ok, I'll open a task then… maybe I can also look into that myself, let's ee [08:48:59] <_joe_> hoo: sorry, I really have my hands full :( [08:49:14] I can relate to that [08:51:31] 06Operations, 10Datasets-General-or-Unknown, 07HHVM, 10Wikidata: Enable GC for HHVM CLI (at least for dump runners) - https://phabricator.wikimedia.org/T162245#3156596 (10hoo) [08:53:41] (03PS1) 10Giuseppe Lavagetto: role::cluster::management: add profile::conftool::client [puppet] - 10https://gerrit.wikimedia.org/r/346511 [08:54:06] (03CR) 10Jcrespo: "> Just for the record: looks like this has been working fine and so" [puppet] - 10https://gerrit.wikimedia.org/r/345372 (owner: 10Marostegui) [08:54:50] (03CR) 10Marostegui: "> > Just for the record: looks like this has been working fine and so" [puppet] - 10https://gerrit.wikimedia.org/r/345372 (owner: 10Marostegui) [08:56:28] (03CR) 10Giuseppe Lavagetto: [C: 032] role::cluster::management: add profile::conftool::client [puppet] - 10https://gerrit.wikimedia.org/r/346511 (owner: 10Giuseppe Lavagetto) [08:58:05] (03CR) 10Jcrespo: "> > > Just for the record: looks like this has been working fine and" [puppet] - 10https://gerrit.wikimedia.org/r/345372 (owner: 10Marostegui) [09:03:13] (03PS3) 10Giuseppe Lavagetto: Add tasks for stage 0 [switchdc] - 10https://gerrit.wikimedia.org/r/346305 [09:04:41] !log deleted the 2 swift thumbs that were making swiftrepl stuck in a loop: T162122 [09:04:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:48] T162122: Swiftrepl was stuck in an infinite loop since days - https://phabricator.wikimedia.org/T162122 [09:11:43] !log reimage analytics1057 to Debian Jessie [09:11:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:43] 06Operations, 06Analytics-Kanban, 15User-Elukey: Reimage all the Hadoop worker nodes to Debian Jessie - https://phabricator.wikimedia.org/T160333#3156645 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` ['analytics1057.eqiad.wmnet'] ``` The log can... [09:20:26] PROBLEM - puppet last run on mira is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:22:50] (03CR) 10Volans: [C: 031] "LGTM" [switchdc] - 10https://gerrit.wikimedia.org/r/346305 (owner: 10Giuseppe Lavagetto) [09:22:56] PROBLEM - puppet last run on wtp1021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:25:48] (03CR) 10Giuseppe Lavagetto: [C: 032] Add tasks for stage 0 [switchdc] - 10https://gerrit.wikimedia.org/r/346305 (owner: 10Giuseppe Lavagetto) [09:32:31] (03CR) 10Giuseppe Lavagetto: Fix the stop-maintenance task (031 comment) [switchdc] - 10https://gerrit.wikimedia.org/r/346306 (owner: 10Giuseppe Lavagetto) [09:36:29] 06Operations, 06Analytics-Kanban, 15User-Elukey: Reimage all the Hadoop worker nodes to Debian Jessie - https://phabricator.wikimedia.org/T160333#3156720 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['analytics1057.eqiad.wmnet'] ``` and were **ALL** successful. [09:36:33] 06Operations, 10Mail, 10Wikimedia-Mailing-lists: Sender email spoofing - https://phabricator.wikimedia.org/T160529#3156721 (10Nemo_bis) [09:48:23] !log deleted a third swift thumb that was making swiftrepl stuck in a loop: T162122 [09:48:26] RECOVERY - puppet last run on mira is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [09:48:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:31] T162122: Swiftrepl was stuck in an infinite loop since days - https://phabricator.wikimedia.org/T162122 [09:48:46] PROBLEM - puppet last run on mw1194 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:49:05] 06Operations, 10media-storage: Swiftrepl was stuck in an infinite loop since days - https://phabricator.wikimedia.org/T162122#3156746 (10Volans) The third one was: ``` wikipedia-commons-local-thumb.3b 3/3b/Hendrick_de_Keyser_-_gulden_cabinet.png/85px-Hendrick_de_Keyser_-_gulden_cabinet.png E-Tag m... [09:49:56] RECOVERY - puppet last run on wtp1021 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [10:16:56] RECOVERY - puppet last run on mw1194 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [10:18:38] (03PS3) 10Giuseppe Lavagetto: Fix the stop-maintenance task [switchdc] - 10https://gerrit.wikimedia.org/r/346306 [10:23:19] (03CR) 10Volans: [C: 031] "LGTM" [switchdc] - 10https://gerrit.wikimedia.org/r/346306 (owner: 10Giuseppe Lavagetto) [10:24:04] (03CR) 10Giuseppe Lavagetto: Add task to restore the TTL of discovery entries to 5 minutes (032 comments) [switchdc] - 10https://gerrit.wikimedia.org/r/346311 (owner: 10Giuseppe Lavagetto) [10:24:25] (03CR) 10Giuseppe Lavagetto: [C: 032] Fix the stop-maintenance task [switchdc] - 10https://gerrit.wikimedia.org/r/346306 (owner: 10Giuseppe Lavagetto) [10:24:39] (03PS3) 10Giuseppe Lavagetto: Propery re-reference the redis task [switchdc] - 10https://gerrit.wikimedia.org/r/346307 [10:24:54] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Propery re-reference the redis task [switchdc] - 10https://gerrit.wikimedia.org/r/346307 (owner: 10Giuseppe Lavagetto) [10:25:10] (03PS3) 10Giuseppe Lavagetto: Update the varnish task to use the new puppet scripts [switchdc] - 10https://gerrit.wikimedia.org/r/346308 [10:25:30] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Update the varnish task to use the new puppet scripts [switchdc] - 10https://gerrit.wikimedia.org/r/346308 (owner: 10Giuseppe Lavagetto) [10:26:21] (03PS3) 10Giuseppe Lavagetto: Modify the start maintenance script [switchdc] - 10https://gerrit.wikimedia.org/r/346309 [10:26:52] (03CR) 10Giuseppe Lavagetto: [C: 032] Modify the start maintenance script [switchdc] - 10https://gerrit.wikimedia.org/r/346309 (owner: 10Giuseppe Lavagetto) [10:27:54] 06Operations, 10ops-eqiad: decommission ms1003 - https://phabricator.wikimedia.org/T157975#3156876 (10ArielGlenn) @Cmjohnson AFAIK there's only removing it from dhcp. I should go ahead and do that then? Anything I missed? [10:31:34] 06Operations, 10Ops-Access-Requests, 10Traffic, 13Patch-For-Review: Ops Onboarding for Arzhel Younsi - https://phabricator.wikimedia.org/T162073#3151609 (10MoritzMuehlenhoff) @ayounsi I've added you to pwstore and re-encrypted the password files. Docs can be found at https://office.wikimedia.org/wiki/Pwsto... [10:31:45] 06Operations, 10Ops-Access-Requests, 10Traffic, 13Patch-For-Review: Ops Onboarding for Arzhel Younsi - https://phabricator.wikimedia.org/T162073#3156881 (10MoritzMuehlenhoff) [10:53:56] PROBLEM - puppet last run on wdqs1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:54:24] 06Operations, 07HHVM: HHVM 3.18 crashes when Cirrus tries to fetch another wiki config via maint script - https://phabricator.wikimedia.org/T161520#3156972 (10MoritzMuehlenhoff) 05Open>03Resolved a:03MoritzMuehlenhoff So this is a generic problem with depleting the HHVM byte code cache and would've happe... [10:54:26] 06Operations, 07HHVM, 13Patch-For-Review, 07Upstream: Build / migrate to HHVM 3.18 - https://phabricator.wikimedia.org/T158176#3156977 (10MoritzMuehlenhoff) [10:54:59] (03PS1) 10Addshore: wmgUseInterwikiSorting true for wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346523 (https://phabricator.wikimedia.org/T162253) [10:56:17] (03CR) 10Addshore: [C: 04-2] "Waiting for the 24th" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346523 (https://phabricator.wikimedia.org/T162253) (owner: 10Addshore) [10:56:49] (03PS2) 10Addshore: wmgUseInterwikiSorting true for wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346523 (https://phabricator.wikimedia.org/T162253) [10:57:01] 06Operations, 07HHVM: HHVM 3.18 deadlocks after 4-6 hours (stuck in in HPHP::Treadmill::getAgeOldestRequest() ) - https://phabricator.wikimedia.org/T161684#3156988 (10MoritzMuehlenhoff) [10:57:04] 06Operations, 07HHVM, 13Patch-For-Review, 07Upstream: Build / migrate to HHVM 3.18 - https://phabricator.wikimedia.org/T158176#3156987 (10MoritzMuehlenhoff) [10:59:02] (03PS1) 10Addshore: Deploy Cognate to production wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346524 (https://phabricator.wikimedia.org/T150182) [11:04:02] (03PS3) 10Addshore: Remove Wikibase vs Interwikisorting checks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341127 (https://phabricator.wikimedia.org/T150183) [11:06:13] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2054" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346526 [11:06:19] (03PS2) 10Marostegui: Revert "db-codfw.php: Depool db2054" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346526 [11:07:01] (03CR) 10Addshore: [C: 04-2] Deploy Cognate to production wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346524 (https://phabricator.wikimedia.org/T150182) (owner: 10Addshore) [11:08:47] (03PS1) 10Jcrespo: mariadb: Depool db1055 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346527 (https://phabricator.wikimedia.org/T159319) [11:09:19] (03PS7) 10Muehlenhoff: Provide a systemd override unit for memcached [puppet] - 10https://gerrit.wikimedia.org/r/319820 [11:10:47] <_joe_> moritzm: what is the upstream bug for the hhvm issue? [11:10:59] (03PS3) 10Giuseppe Lavagetto: Add phase-9 varnish puppet run to restore order to dc_from [switchdc] - 10https://gerrit.wikimedia.org/r/346310 [11:13:00] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "I'm not merging this because there are pending changes to the procedure." [switchdc] - 10https://gerrit.wikimedia.org/r/346310 (owner: 10Giuseppe Lavagetto) [11:13:09] _joe_: which one? the stat_cache crash or the stat_cache deadlock? [11:13:16] <_joe_> the latter [11:13:46] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [11:14:02] _joe_: https://github.com/facebook/hhvm/issues/7756 I'm currently building a new 3.18.2 package with the patch on top [11:14:54] 06Operations, 13Patch-For-Review, 15User-Elukey, 07Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#3157002 (10elukey) @aaron thanks a lot for the feedback, I created a code change that... [11:15:54] (03CR) 10Jcrespo: [C: 032] "Heads up" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346527 (https://phabricator.wikimedia.org/T159319) (owner: 10Jcrespo) [11:16:14] (03CR) 10Marostegui: "thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346527 (https://phabricator.wikimedia.org/T159319) (owner: 10Jcrespo) [11:16:26] <_joe_> moritzm: nice, thanks [11:17:00] (03CR) 10Jcrespo: [C: 032] "I will take 3 days more or less :-/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346527 (https://phabricator.wikimedia.org/T159319) (owner: 10Jcrespo) [11:17:12] (03Merged) 10jenkins-bot: mariadb: Depool db1055 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346527 (https://phabricator.wikimedia.org/T159319) (owner: 10Jcrespo) [11:17:21] (03CR) 10jenkins-bot: mariadb: Depool db1055 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346527 (https://phabricator.wikimedia.org/T159319) (owner: 10Jcrespo) [11:17:29] (03CR) 10Milimetric: [C: 032] "The dependent change was merged." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344209 (owner: 10Milimetric) [11:18:11] (03PS2) 10Milimetric: Revert "Restore Dashiki config in CommonSettings for now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344209 [11:18:46] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 18 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [11:18:57] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1055 for maintenance (duration: 00m 40s) [11:19:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:47] (03CR) 10jenkins-bot: Revert "Restore Dashiki config in CommonSettings for now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344209 (owner: 10Milimetric) [11:20:02] moritzm: ping me when you want to do the postgres update, sorry I was a bit sidetracked in the morning [11:20:20] * moritzm too, we can do it now if you want [11:21:07] (03PS3) 10Marostegui: Revert "db-codfw.php: Depool db2054" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346526 [11:21:12] sure, if the situation is stable [11:22:02] are you taking care of the upgrade of the postgres on the puppetdb master/slave hosts and I just have to take care of puppetdb service restart? [11:22:20] do we have a procedure (for the first part)? (apart doing the slave first ofc ) :D [11:22:56] RECOVERY - puppet last run on wdqs1002 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [11:24:10] (03PS2) 10Volans: Puppet: do not deactivate hosts in PuppetDB automatically [puppet] - 10https://gerrit.wikimedia.org/r/346110 (https://phabricator.wikimedia.org/T159163) [11:24:36] PROBLEM - puppet last run on db1063 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:24:41] we can doublecheck with akosiaris, but from what I can tell the procedure boils down to "shut up the icinga-bot and upgrade postgres" :-) [11:24:52] for that part yes [11:24:56] PROBLEM - puppet last run on pc1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:25:03] I was wondering for the postgres replication [11:25:06] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:25:06] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:25:56] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [11:25:56] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [11:27:39] it's just a minor version bump, there should be no difference in replication, but we can wait for akosiaris to comment [11:28:43] like in mysql/mariadb usually we depool + stop the replica before stopping mysql for a cleaner shutdown [11:29:04] not sure for the equivalent here, if any [11:30:39] (03PS2) 10Elukey: Apply role memcached to the new mc1* hosts [puppet] - 10https://gerrit.wikimedia.org/r/337010 (https://phabricator.wikimedia.org/T137345) [11:33:24] moritzm: anyway I can check the replication status and connections on postgres [11:33:37] ok [11:36:32] let me know if you want do it now or wait for a feedback ;) [11:37:54] I think we can do it now [11:38:09] agree [11:38:35] given that I need to run puppet after the merge either you upgrade before or after to avoid puppet doing stuff while upgrading [11:38:38] any preference? [11:39:18] I'm fine both ways [11:39:53] same here, merge first, run puppet and disable puppet so then you're free to upgrade? [11:40:10] _joe_ is https://gerrit.wikimedia.org/r/337010 good to go right? We'll flip it to mediawiki::memcached after the switchover [11:40:15] volans: sounds good [11:40:25] ok proceeding [11:40:39] <_joe_> elukey: +1 [11:40:45] super [11:40:52] elukey: let me merge this change please [11:40:54] one sec :D [11:40:56] (03CR) 10Volans: [C: 032] Puppet: do not deactivate hosts in PuppetDB automatically [puppet] - 10https://gerrit.wikimedia.org/r/346110 (https://phabricator.wikimedia.org/T159163) (owner: 10Volans) [11:41:03] (03CR) 10Giuseppe Lavagetto: [C: 031] Apply role memcached to the new mc1* hosts [puppet] - 10https://gerrit.wikimedia.org/r/337010 (https://phabricator.wikimedia.org/T137345) (owner: 10Elukey) [11:42:20] moritzm: yeah, no way that upgrade is gonna affect replication [11:42:33] !log disabling ircecho for the merge of gerrit/346110 ( T159163 ) and postgres upgrade [11:42:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:41] T159163: PuppetDB is auto-deactivating hosts - https://phabricator.wikimedia.org/T159163 [11:43:29] (03PS15) 10Gehel: elasticsearch - move role::elasticsearch::common to a profile [puppet] - 10https://gerrit.wikimedia.org/r/342248 (https://phabricator.wikimedia.org/T147718) [11:45:16] 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files (e.g. djvu, jpg, png, webm) on Commons and other sites - https://phabricator.wikimedia.org/T161836#3157056 (10Superzerocool) >>! In T161836#3146753, @Superzerocool wrote: > Hi, I'll add another one: https://commons.wikim... [11:45:41] moritzm: all yours [11:45:47] puppet disabled on nihal/nitrogen [11:46:31] ok, starting with the slave, then (nihal) [11:47:22] volans: nihal upgraded, could you briefly check the replication status? [11:47:26] disabled also on einsteinium to not start again ircecho in few minutes [11:47:35] * volans checking moritzm [11:48:01] mmmh no replication [11:48:41] wait [11:49:07] the replication is on nitrogen [11:49:43] yeah sorry my bad [11:49:50] postgres is the other way around [11:50:16] ok, proceeding with update on nitrogen, then [11:50:31] ok, looks good [11:50:45] done [11:51:10] you sure they got restarted? :D [11:51:39] too quick to be true :D [11:51:49] yeah, all the postgres procs are from 11:50 [11:52:07] backend_start | 2017-04-05 11:50:52.335469+00 [11:52:08] yep [11:52:13] looks good so far [11:52:23] great [11:52:31] I see clients connections [11:53:15] yeah, I think we can re-enable puppet [11:54:19] sure on those 2 [11:54:28] I'm gathering a list of failed puppet to force a run [11:54:47] and then I can re-enable it in einsteinium and start ircecho [11:58:01] sounds good! [11:58:07] already running cumin :D [11:59:35] (03PS16) 10Gehel: elasticsearch - move role::elasticsearch::common to a profile [puppet] - 10https://gerrit.wikimedia.org/r/342248 (https://phabricator.wikimedia.org/T147718) [12:00:01] ok, back to normal [12:00:20] !log re-enabled puppet on nitrogen/nihal/einsteinium, restarted ircecho [12:00:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:33] back to "puppet normal" :-) [12:02:42] ofc :D [12:02:46] nothing more [12:03:01] (03PS4) 10Muehlenhoff: Add explicit dependency on ghostscript [puppet] - 10https://gerrit.wikimedia.org/r/313963 [12:03:50] and I can confirm seeing updated catalogs in the slave, as expected [12:04:40] 06Operations, 07Puppet, 13Patch-For-Review: PuppetDB is auto-deactivating hosts - https://phabricator.wikimedia.org/T159163#3157065 (10Volans) 05Open>03Resolved [12:04:52] !log upgrade remaining ca-certificates from jessie point update [12:04:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:40] (03PS3) 10Giuseppe Lavagetto: Add task to restore the TTL of discovery entries to 5 minutes [switchdc] - 10https://gerrit.wikimedia.org/r/346311 [12:09:56] PROBLEM - puppet last run on restbase1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:10:01] (03PS17) 10Gehel: elasticsearch - move role::elasticsearch::common to a profile [puppet] - 10https://gerrit.wikimedia.org/r/342248 (https://phabricator.wikimedia.org/T147718) [12:11:35] 06Operations, 10media-storage: Swiftrepl was stuck in an infinite loop since days - https://phabricator.wikimedia.org/T162122#3157072 (10Volans) The first run of the `swiftrepl` has finally completed! It is now in the 2 hour sleep between runs, I'll check the next one completes without manual intevention. [12:19:22] (03CR) 10Hoo man: [C: 031] "I find the setting name rather weird, but that's ok for now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345387 (https://phabricator.wikimedia.org/T159828) (owner: 10Daniel Kinzler) [12:19:49] (03PS18) 10Gehel: elasticsearch - move role::elasticsearch::common to a profile [puppet] - 10https://gerrit.wikimedia.org/r/342248 (https://phabricator.wikimedia.org/T147718) [12:23:44] (03PS19) 10Gehel: elasticsearch - move role::elasticsearch::common to a profile [puppet] - 10https://gerrit.wikimedia.org/r/342248 (https://phabricator.wikimedia.org/T147718) [12:31:40] RECOVERY - MariaDB Slave Lag: s5 on db1070 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [12:32:55] marostegui, jynus, I guess was expected this one ^^^ [12:34:26] PROBLEM - puppet last run on phab2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:36:31] volans: yep :) [12:37:17] (03PS3) 10Elukey: Apply role memcached to the new mc1* hosts [puppet] - 10https://gerrit.wikimedia.org/r/337010 (https://phabricator.wikimedia.org/T137345) [12:37:22] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2054" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346526 (owner: 10Marostegui) [12:37:27] (03CR) 10Elukey: [C: 032] "No op checking https://puppet-compiler.wmflabs.org/6027/" [puppet] - 10https://gerrit.wikimedia.org/r/337010 (https://phabricator.wikimedia.org/T137345) (owner: 10Elukey) [12:37:57] RECOVERY - puppet last run on restbase1015 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [12:38:07] 06Operations, 10hardware-requests: EQIAD: 2 hardware access request for kubernetes-staging - https://phabricator.wikimedia.org/T162257#3157106 (10akosiaris) [12:38:26] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2054" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346526 (owner: 10Marostegui) [12:38:36] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2054" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346526 (owner: 10Marostegui) [12:38:38] 06Operations, 10hardware-requests: EQIAD: 2 hardware access request for kubernetes-staging - https://phabricator.wikimedia.org/T162257#3157106 (10akosiaris) [12:38:40] 06Operations, 05Goal, 07kubernetes: Design and implement a Kubernetes-based staging environment. (stretch) - https://phabricator.wikimedia.org/T162045#3157120 (10akosiaris) [12:40:10] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2054 - T160390 (duration: 00m 44s) [12:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:17] T160390: Unify revision table on s7 - https://phabricator.wikimedia.org/T160390 [12:40:21] (03PS1) 10Marostegui: db-eqiad.php: Depool db2047 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346534 (https://phabricator.wikimedia.org/T160390) [12:41:38] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db2047 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346534 (https://phabricator.wikimedia.org/T160390) (owner: 10Marostegui) [12:42:40] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db2047 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346534 (https://phabricator.wikimedia.org/T160390) (owner: 10Marostegui) [12:42:53] (03CR) 10jenkins-bot: db-eqiad.php: Depool db2047 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346534 (https://phabricator.wikimedia.org/T160390) (owner: 10Marostegui) [12:44:08] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2047 - T160390 (duration: 00m 41s) [12:44:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:22] !log Deploy schema change db2047 (s7) - T160390 [12:44:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:28] (03PS20) 10Gehel: elasticsearch - move role::elasticsearch::common to a profile [puppet] - 10https://gerrit.wikimedia.org/r/342248 (https://phabricator.wikimedia.org/T147718) [12:46:22] (03CR) 10Gehel: [C: 032] elasticsearch - move role::elasticsearch::common to a profile [puppet] - 10https://gerrit.wikimedia.org/r/342248 (https://phabricator.wikimedia.org/T147718) (owner: 10Gehel) [12:49:56] PROBLEM - puppet last run on mw1179 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:49:56] PROBLEM - Check systemd state on relforge1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:50:00] (03PS1) 10Gehel: elasticsearch - ferm hosts need to be space separated, not coma separated [puppet] - 10https://gerrit.wikimedia.org/r/346537 [12:50:24] ^relforge alert is me, fix on the way [12:51:18] (03CR) 10Gehel: [C: 032] elasticsearch - ferm hosts need to be space separated, not coma separated [puppet] - 10https://gerrit.wikimedia.org/r/346537 (owner: 10Gehel) [12:54:57] PROBLEM - Check systemd state on relforge1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:57:37] !log reimage analytics1035 (journal node) to Debian Jessie [12:57:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:14] (03PS1) 10Gehel: elasticsearch - maintenance_hosts is actually already resolved to IPs [puppet] - 10https://gerrit.wikimedia.org/r/346538 [12:58:38] 06Operations, 06Analytics-Kanban, 15User-Elukey: Reimage all the Hadoop worker nodes to Debian Jessie - https://phabricator.wikimedia.org/T160333#3157141 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` ['analytics1035.eqiad.wmnet'] ``` The log can... [12:59:08] (03Abandoned) 10Giuseppe Lavagetto: cache::text: remove direct route to mediawiki from eqiad [puppet] - 10https://gerrit.wikimedia.org/r/346322 (owner: 10Giuseppe Lavagetto) [12:59:20] (03CR) 10Gehel: [C: 032] elasticsearch - maintenance_hosts is actually already resolved to IPs [puppet] - 10https://gerrit.wikimedia.org/r/346538 (owner: 10Gehel) [13:00:05] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170405T1300). [13:00:05] Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:01:06] RECOVERY - Check systemd state on relforge1002 is OK: OK - running: The system is fully operational [13:01:06] PROBLEM - puppet last run on cp3032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:01:23] o/ [13:01:37] (03PS2) 10Hashar: Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346324 (https://phabricator.wikimedia.org/T162089) (owner: 10Urbanecm) [13:01:38] <_joe_> hashar: are you SWATTING? [13:01:49] _joe_: yes [13:01:49] 06Operations, 10ops-eqiad: rack and set up analyics1058-1069 - https://phabricator.wikimedia.org/T162216#3157143 (10Ottomata) [13:01:53] <_joe_> can I ask you to merge two patches of mine during this window? [13:01:54] unless there is something bad going on [13:01:56] RECOVERY - Check systemd state on relforge1001 is OK: OK - running: The system is fully operational [13:02:06] yeah totally [13:02:10] <_joe_> no, I just forgot to add myself to the calendar, sorry :P [13:02:15] let me push the simple throttle rule [13:02:22] no worries [13:02:25] monday is rather busy [13:02:26] RECOVERY - puppet last run on phab2001 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [13:02:40] but other days we usually have only 2-3 patches [13:02:56] PROBLEM - puppet last run on mw1287 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[ca-certificates] [13:03:02] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346324 (https://phabricator.wikimedia.org/T162089) (owner: 10Urbanecm) [13:03:40] _joe_: what are the patches? :} [13:04:14] (03Merged) 10jenkins-bot: Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346324 (https://phabricator.wikimedia.org/T162089) (owner: 10Urbanecm) [13:04:27] (03CR) 10jenkins-bot: Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346324 (https://phabricator.wikimedia.org/T162089) (owner: 10Urbanecm) [13:04:58] <_joe_> hashar: https://gerrit.wikimedia.org/r/#/c/316317/ and https://gerrit.wikimedia.org/r/#/c/345510/ [13:05:21] <_joe_> the first one is a bit more complex [13:05:53] !log hashar@tin Synchronized wmf-config/throttle.php: Add new throttle rule - T162089 (duration: 00m 40s) [13:05:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:00] T162089: Lift IP rate limit - Workshop - 2017-04-06 - https://phabricator.wikimedia.org/T162089 [13:06:03] 06Operations, 10ops-eqiad: rack and set up analyics1058-1069 - https://phabricator.wikimedia.org/T162216#3157157 (10Ottomata) RAID config should be identical to other nodes, e.g. analytics1057. I think /dev/sda is Hardware RAID 1 on the 2 2.5" flex bay drives. The rest 12 drives are JBOD, so you can leave th... [13:06:44] _joe_: guess we can do the switch of ores to discovery first [13:06:49] though I have no clue how to validate that one [13:06:51] <_joe_> hashar: nope [13:06:59] <_joe_> hashar: that basically includes the other one [13:07:03] ah [13:07:13] <_joe_> the dangerous part, that is [13:07:22] <_joe_> e.g. calling ores on its internal url [13:07:54] <_joe_> in fact, I just realized this is not needed atm [13:08:02] <_joe_> hashar: no need to merge this [13:08:10] <_joe_> I have to think it through a bit more [13:08:27] <_joe_> sorry, just hit me that this way we're bypassing the varnish cache [13:08:41] <_joe_> and that's not good [13:09:52] <_joe_> I'm clarifying that with brandon; also, this means that switching traffic == switching ores for mediawiki [13:09:56] <_joe_> so for now it's ok [13:10:12] <_joe_> hashar: so, thanks but I'll merge tomorrow in case [13:10:18] sure :-} [13:10:24] we can do it in our morning if you want [13:10:30] <_joe_> ok [13:10:38] <_joe_> that might be a good idea too [13:11:13] and if someone from traffic is needed, deploy anytime one of them show up [13:12:13] 06Operations, 10Analytics, 10DBA: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3157170 (10Ottomata) > How's the process to decommission db1047 going? I guess ok! I think we should just dump all the user created databases to a file and archive it before... [13:14:32] 06Operations, 10Analytics, 10DBA: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3157185 (10Marostegui) >>! In T156844#3157170, @Ottomata wrote: >> How's the process to decommission db1047 going? > > I guess ok! I think we should just dump all the user c... [13:14:33] (03PS4) 10Ottomata: Improvements to eventlogging_sync.sh script [puppet] - 10https://gerrit.wikimedia.org/r/345646 (https://phabricator.wikimedia.org/T124307) [13:15:48] (03CR) 10Ottomata: [V: 032 C: 032] Improvements to eventlogging_sync.sh script [puppet] - 10https://gerrit.wikimedia.org/r/345646 (https://phabricator.wikimedia.org/T124307) (owner: 10Ottomata) [13:17:56] RECOVERY - puppet last run on mw1179 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [13:20:07] PROBLEM - puppet last run on ms-fe3002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:20:17] (03PS1) 10Hoo man: Temporarily enable change dispatch logging on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346540 (https://phabricator.wikimedia.org/T159828) [13:21:18] (03PS1) 10Ottomata: Properly default to master database name when slave database not given [puppet] - 10https://gerrit.wikimedia.org/r/346541 (https://phabricator.wikimedia.org/T124307) [13:21:45] (03CR) 10Ottomata: [V: 032 C: 032] Properly default to master database name when slave database not given [puppet] - 10https://gerrit.wikimedia.org/r/346541 (https://phabricator.wikimedia.org/T124307) (owner: 10Ottomata) [13:24:42] (03PS1) 10DCausse: [cirrus] Increase max field count for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346542 [13:24:55] (03PS1) 10Gehel: wdqs: active/active public interface [puppet] - 10https://gerrit.wikimedia.org/r/346543 [13:25:18] 06Operations, 06Analytics-Kanban, 15User-Elukey: Reimage all the Hadoop worker nodes to Debian Jessie - https://phabricator.wikimedia.org/T160333#3157205 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['analytics1035.eqiad.wmnet'] ``` and were **ALL** successful. [13:25:28] (03CR) 10jerkins-bot: [V: 04-1] [cirrus] Increase max field count for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346542 (owner: 10DCausse) [13:25:31] (03CR) 10Gehel: [C: 04-1] "Waiting for full reimport of wdqs codfw cluster before merging this" [puppet] - 10https://gerrit.wikimedia.org/r/346543 (owner: 10Gehel) [13:26:24] (03PS2) 10Giuseppe Lavagetto: cache::text: switch all mediawiki to codfw [puppet] - 10https://gerrit.wikimedia.org/r/346320 [13:26:26] (03PS2) 10Giuseppe Lavagetto: discovery::app_routes: switch mediawiki to codfw [puppet] - 10https://gerrit.wikimedia.org/r/346321 [13:26:43] (03PS2) 10DCausse: [cirrus] Increase max field count for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346542 [13:28:04] ORES [13:28:59] any idea why the RC feed is doing this for en.wp ? ...performing the action "edit" on [[Madelaine Petsch]]. Actions taken: Interdire la modification ([[Special:AbuseLog/18211089|details]]) [13:29:43] the attempted edit has a little French, as it's a translation, would that explain the filter tripping with the summary "Interdire la modification" ? [13:30:16] RECOVERY - puppet last run on cp3032 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [13:31:14] (03PS1) 10Jcrespo: Make mediawiki-eqiad dc read-only before switchover to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346544 (https://phabricator.wikimedia.org/T154658) [13:31:16] RECOVERY - puppet last run on mw1287 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [13:31:56] (03CR) 10Jcrespo: [C: 04-2] "Do not deploy until April 19th." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346544 (https://phabricator.wikimedia.org/T154658) (owner: 10Jcrespo) [13:32:07] (03PS1) 10Hoo man: Temporarily disable the change dispatch cron for testwikidata [puppet] - 10https://gerrit.wikimedia.org/r/346545 (https://phabricator.wikimedia.org/T159828) [13:32:11] tiddlywink: arent you using french as the interface language? [13:32:28] ah no [13:32:29] nope [13:32:31] https://en.wikipedia.org/wiki/Special:AbuseLog/18211089?uselang=fr vs https://en.wikipedia.org/wiki/Special:AbuseLog/18211089?uselang=en [13:32:41] at least those have the proper text [13:32:49] it's the IRC feed, not the on-wiki logs [13:33:32] also get Actions taken: Avertir l’utilisateur ([[Special:AbuseLog/18211001|details]]) for the prior attempt [13:34:16] (03CR) 10Volans: "Are we keeping the "3 minutes"?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346544 (https://phabricator.wikimedia.org/T154658) (owner: 10Jcrespo) [13:35:40] tiddlywink: maybe the irc log message got formatted based on the user language [13:35:43] instead of the project lnaugage [13:36:21] that would make sense [13:37:22] (03PS1) 10Jcrespo: Make mediawiki codfw dc read-write after switchover to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346547 (https://phabricator.wikimedia.org/T154658) [13:38:21] (03CR) 10Jcrespo: [C: 04-2] "Do not deploy until April 19th." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346547 (https://phabricator.wikimedia.org/T154658) (owner: 10Jcrespo) [13:39:21] <_joe_> jynus: I already have a cumulative patch for all that, given we can sync one file at a time [13:39:40] <_joe_> https://gerrit.wikimedia.org/r/#/c/346251/ [13:39:40] oh [13:39:47] an if you do distinct ones, please follow the order, so they can be merged without rebase [13:39:47] I didn't know that [13:39:49] <_joe_> but don't throw yours away [13:40:05] I thought you needed help with that [13:40:11] the commonsettings one goes in the middle of the other 2 [13:40:16] yes [13:40:21] I was about to that that ones now [13:40:27] CommonSettings.php [13:40:54] but has to be in the middle, while you've already stacked the other 2 [13:40:56] it is literally the same thing [13:41:12] it doesn't matter- all will be premerged in advance [13:41:24] just synced in order [13:41:27] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 274 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [13:41:48] _joe_ but don't throw yours away- no reason to not, why? [13:42:10] <_joe_> jynus: there was some debate over merging a single patch and multiple ones, AIUI [13:42:20] and to pre-merge or not ;) [13:42:30] I say premerge [13:42:39] <_joe_> I agree with you jaime [13:42:42] but have the individual [13:42:48] reverts [13:42:55] which is the only reason to have them separate [13:43:13] <_joe_> if we need to revert in a hurry during the switchover, the best way is [13:43:13] this is the largest pain in time, we want to only wait 2-3 minutes [13:43:21] <_joe_> git checkout HEAD~1 -- file [13:43:24] yep [13:43:25] also I need to know if we stay with "3" minutes or use the "15" because the check will look for it [13:43:30] <_joe_> scap sync-file file [13:43:39] I would go with 3 as that would be my aim [13:43:53] <_joe_> jynus: no way we can pull it off [13:44:08] <_joe_> just the codfw warmup will take us more than that [13:44:09] IF the scripts works well, yes [13:44:12] <_joe_> or around that time [13:44:16] lately scap takes 1 minutes with the check [13:44:23] less without it [13:44:25] jynus: we still have puppet commits and rns [13:44:31] *runs [13:44:31] <_joe_> then we have to manually merge a puppet patch, and some other things [13:44:38] <_joe_> my personal goal is 10 minutes [13:44:49] _joe_, that is actually the deparment's goal [13:44:50] <_joe_> if we can make it, I'll be impressed [13:45:24] can we do a proper production test of the script at some point- that would tell us? [13:45:29] <_joe_> jynus: well, if we had etcd in mediawiki and etcd-controlled traffic switchovers, that would be easy [13:45:36] PROBLEM - Check Varnish expiry mailbox lag on cp1099 is CRITICAL: CRITICAL: expiry mailbox lag is 629278 [13:45:42] <_joe_> jynus: we can test some of the steps, sure [13:45:46] PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 675275 [13:45:48] you can get scap sync-file to skip the canaries check entirely : scap sync-file --force [13:45:57] would save a few more seconds [13:45:59] <_joe_> hashar: ok good to know [13:46:05] like testing the scap with a wrong commit [13:46:07] <_joe_> volans: ^^ can you add that? [13:46:11] yes I know, but this will skip the linting too right? [13:46:24] volans, it will skipp waiting for canary traffic [13:46:24] <_joe_> volans: linting? [13:46:29] linting is done my CI [13:46:32] <_joe_> volans: scap does linting? [13:46:40] the canary query logstash, sleep(20), query logstash [13:46:40] something like that [13:46:49] <_joe_> yeah let's not do that here [13:46:53] that is not needed here [13:46:56] <_joe_> logstash will also be a shitshow [13:46:59] <_joe_> :P [13:47:03] I told you that, volans about --force [13:47:17] yeah, it could fail due to read only mode [13:47:20] and be ok [13:47:44] what we need is produciton testing + thoroug manual review beforhand [13:47:57] and one can check with tyler/chad , but maybe there is a way to first deploy the files [13:48:01] which takes a while [13:48:16] RECOVERY - puppet last run on ms-fe3002 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [13:48:17] then as a second step do the switch from a version to another which would be quite faster [13:48:39] that would be nice if there are scap tricks, as it will be automatized [13:48:48] not something we can pull off on an emergency [13:48:52] but good to know [13:49:45] <_joe_> yeah I wouldn't focus on that [13:49:59] I've seem on the steps there is a mw_primary change, but I said to do that async- only monitoring uses it to know if to page or not [13:50:26] <_joe_> jynus: yeah that change will be puppet-merged toghether with a varnish one [13:50:39] <_joe_> that's the only merge (and manual step) needed during the switchover [13:51:00] hashar: it's scap sync-file --force OR scap --force sync-file ? [13:51:08] no need to run it for me, do the varnish one if you need it [13:51:49] scap sync-file --force [13:52:00] you can give it a try on beta cluster on deployment-tin [13:53:02] and if you pass it "--beta-only-change" [13:53:14] it does not touch InitialiseSettings.php so the conf is stall [13:53:24] so in theory a hack would be: scap sync-file --beta-only-change foobar.php [13:53:44] hashar, zeljkof, was my patch deployed? Sorry for my repeating lateness... [13:53:46] but no that is terrible idea. Forget me [13:53:47] Urbanecm: yes [13:53:54] Thank you! [13:54:15] Urbanecm: the throttle changes I don't mind deploying them without you being around. They are super easy to check [13:54:32] (03PS1) 10Volans: Scap: use --force to skip canaries checks [switchdc] - 10https://gerrit.wikimedia.org/r/346549 (https://phabricator.wikimedia.org/T160178) [13:55:00] if you're worried about a temporary logstash explosion then --force will not run the logstash checks, and yeah beta-only-change will leave all the appservers alone so they won't re-read initialisesettings.php [13:55:01] hashar, does it mean if I has only throttle changes I can just schedule them? [13:56:04] (03CR) 10Hashar: [C: 031] ""scap sync-file --force" would skip the sequence of: logstash, sleep(20, logstash check. So that should speed up the overall runtime." [switchdc] - 10https://gerrit.wikimedia.org/r/346549 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [13:56:25] ^is that right? doesn't that need to change the pwd? [13:56:26] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 274 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [13:57:03] jynus: AFAIK yes, not needed anymore, and tested few days ago [13:57:10] and that code is probably all wrong [13:57:11] ok, cool [13:57:12] (03PS4) 10Muehlenhoff: Disable wireshark-common/install-setuid to avoid debconf prompt [puppet] - 10https://gerrit.wikimedia.org/r/346162 [13:57:17] I didn't know [13:57:29] hashar: thcipriani can you confirm? [13:57:52] what does : remote.select('R:Class = Deployment::Rsync and R:Class%cron_ensure = absent').sync( foo ) d does? [13:58:02] is that selecting a bunch of host then run "sync" on them ? [13:58:10] selects the deployment host [13:58:13] oh [13:58:23] (03CR) 10jerkins-bot: [V: 04-1] Disable wireshark-common/install-setuid to avoid debconf prompt [puppet] - 10https://gerrit.wikimedia.org/r/346162 (owner: 10Muehlenhoff) [13:58:29] is ugly but is the way that from how it's puppetized can be selected, according to joe ;) [13:58:54] 06Operations, 10Datasets-General-or-Unknown, 07HHVM, 10Wikidata: Enable GC for HHVM CLI (at least for dump runners) - https://phabricator.wikimedia.org/T162245#3157282 (10MoritzMuehlenhoff) p:05Triage>03Normal [13:58:58] 06Operations, 10hardware-requests: EQIAD: 2 hardware access request for kubernetes-staging - https://phabricator.wikimedia.org/T162257#3157284 (10MoritzMuehlenhoff) p:05Triage>03Normal [13:59:05] hiera is the canonical place for the deployment server : deployment_server: tin.eqiad.wmnet [13:59:06] why is cron absemt from the primary dc? [13:59:18] or is it the secondary on purpose? [13:59:33] the secondary has the rsync cron I think, need to recheck though to be sure [13:59:49] I din't do that part :D [14:00:01] and for scripts usually one can just use the DNS entry deployment.eqiad.wmnet [14:00:03] I was mostly asking joe :-) [14:00:10] which should point to the right primary deployment server [14:00:24] 06Operations, 10Datasets-General-or-Unknown, 07HHVM, 10Wikidata, 15User-Elukey: Enable GC for HHVM CLI (at least for dump runners) - https://phabricator.wikimedia.org/T162245#3157285 (10elukey) [14:00:52] I am going to abandon my patches [14:01:02] hashar: can you confirm that scap don't need to change directory to be run right? [14:01:09] I think joe is better, and 100% clone of what I was going to do [14:01:50] volans: I see a patch by Chad that mentionned an issue when being run out of /srv/mediawiki-staging [14:02:01] so probably safer to change the cwd [14:02:20] (03CR) 10Thcipriani: [C: 031] Scap: use --force to skip canaries checks [switchdc] - 10https://gerrit.wikimedia.org/r/346549 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [14:02:48] and I dont think you need to sudo -u os.getlogin() [14:03:00] (03CR) 10Jcrespo: [C: 031] Switch master DC from eqiad to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346251 (owner: 10Giuseppe Lavagetto) [14:03:00] anyway, the scap sync-file --force looks good [14:03:18] (as long as filename does not have a space in it which it should not) [14:04:19] (03Abandoned) 10Jcrespo: Make mediawiki codfw dc read-write after switchover to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346547 (https://phabricator.wikimedia.org/T154658) (owner: 10Jcrespo) [14:04:21] yes, we already have 3 layers of quotes, would like to avoid the 4th ;) [14:04:52] (03Abandoned) 10Jcrespo: Make mediawiki-eqiad dc read-only before switchover to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346544 (https://phabricator.wikimedia.org/T154658) (owner: 10Jcrespo) [14:05:19] (03CR) 10Ema: [C: 031] Disable wireshark-common/install-setuid to avoid debconf prompt [puppet] - 10https://gerrit.wikimedia.org/r/346162 (owner: 10Muehlenhoff) [14:08:22] (03PS1) 10Jgreen: temporarily move fundraisingdbread.wmnet to db1025 for db maintenance [dns] - 10https://gerrit.wikimedia.org/r/346550 [14:09:36] (03CR) 10Volans: [C: 032] Scap: use --force to skip canaries checks [switchdc] - 10https://gerrit.wikimedia.org/r/346549 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [14:09:48] (03CR) 10Jgreen: [C: 032] temporarily move fundraisingdbread.wmnet to db1025 for db maintenance [dns] - 10https://gerrit.wikimedia.org/r/346550 (owner: 10Jgreen) [14:11:56] PROBLEM - puppet last run on wtp1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:14:07] hoo: o/ - I can try to help for T162245, so things will speed up a bit [14:14:07] T162245: Enable GC for HHVM CLI (at least for dump runners) - https://phabricator.wikimedia.org/T162245 [14:15:31] hoo: if I got it correctly, it should be a matter of adding hhvm.enable_gc=true to the /etc/hhvm/php.ini of snapshot100* right? [14:18:26] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 274 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [14:22:36] PROBLEM - puppet last run on wtp1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:23:26] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 274 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [14:27:30] https://upload.wikimedia.org/wikipedia/commons/thumb/e/e3/Incubator-logo.svg/13px-Incubator-logo.svg.png returns "Content Encoding Error" [14:27:44] (03PS2) 10Andrew Bogott: Keystonehooks: Create and delete sudoer rules in ldap [puppet] - 10https://gerrit.wikimedia.org/r/346489 (https://phabricator.wikimedia.org/T150091) [14:28:16] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:29:17] ema ^^ [14:30:38] paladox: when the message is this one and there aren't a ton of them, it's not an issue, transient known failure (although a bit noisy) [14:30:44] (03PS1) 10Elukey: Enable hhvm GC for CLI on snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/346554 (https://phabricator.wikimedia.org/T162245) [14:31:53] oh [14:31:54] sorry [14:32:14] nw, just FYI ;) [14:32:22] ok [14:33:56] PROBLEM - puppet last run on wtp1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:34:13] (03PS4) 10Jcrespo: [WIP]Remove $::mw_primary variable from puppet [puppet] - 10https://gerrit.wikimedia.org/r/345346 (https://phabricator.wikimedia.org/T156924) [14:37:24] 06Operations, 10ops-codfw, 10DBA: codfw rack/setup first 10 DB servers - https://phabricator.wikimedia.org/T162159#3157431 (10Papaul) [14:38:32] (03CR) 10Hoo man: "Why not use role/common/snapshot/dumper.yaml?" [puppet] - 10https://gerrit.wikimedia.org/r/346554 (https://phabricator.wikimedia.org/T162245) (owner: 10Elukey) [14:38:45] thanks for looking into they, elukey! [14:38:56] RECOVERY - puppet last run on wtp1018 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [14:39:36] hoo: thanks for the review! Will look into that :) [14:39:46] hoo: I was looking for that file, much better [14:40:46] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [14:41:24] (03PS5) 10Jcrespo: [WIP]Remove $::mw_primary variable from puppet [puppet] - 10https://gerrit.wikimedia.org/r/345346 (https://phabricator.wikimedia.org/T156924) [14:42:25] :) [14:42:52] (03PS1) 10Jgreen: fix reversed frdb1002/frdev1001 IPs, re-pool frdb1001 for now [dns] - 10https://gerrit.wikimedia.org/r/346556 [14:44:08] (03CR) 10Jgreen: [C: 032] fix reversed frdb1002/frdev1001 IPs, re-pool frdb1001 for now [dns] - 10https://gerrit.wikimedia.org/r/346556 (owner: 10Jgreen) [14:45:26] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 274 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [14:45:46] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [14:49:27] (03CR) 10Muehlenhoff: "Did you test these options on a 3.12 installation? According to http://hhvm.com/blog/2017/02/15/hhvm-3-18.html they were introduced with 3" [puppet] - 10https://gerrit.wikimedia.org/r/346554 (https://phabricator.wikimedia.org/T162245) (owner: 10Elukey) [14:49:56] (03PS2) 10Elukey: Enable hhvm GC for CLI on snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/346554 (https://phabricator.wikimedia.org/T162245) [14:50:26] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 274 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [14:50:36] RECOVERY - puppet last run on wtp1002 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [14:53:22] (03PS2) 10Matthias Mullie: Enable 3d extension on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341332 (https://phabricator.wikimedia.org/T159717) (owner: 10MarkTraceur) [14:53:31] (03CR) 10Matthias Mullie: Enable 3d extension on beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341332 (https://phabricator.wikimedia.org/T159717) (owner: 10MarkTraceur) [14:53:56] (03PS3) 10Matthias Mullie: Enable 3d extension on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341332 (https://phabricator.wikimedia.org/T159717) (owner: 10MarkTraceur) [14:55:22] (03CR) 10jerkins-bot: [V: 04-1] Enable 3d extension on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341332 (https://phabricator.wikimedia.org/T159717) (owner: 10MarkTraceur) [14:56:03] 06Operations, 10DNS, 10Traffic: mediawiki.org DNS TXT entry for Google Search Console site verification - https://phabricator.wikimedia.org/T161343#3157520 (10dr0ptp4kt) @Dzahn does Google Search Console for noc@ show that the site verification code matches what's already in DNS like you communicated here? I... [14:56:22] hoo: Moritz is right, there is no compatibility section in the hhvm docs so I got fooled, not sure if the GC options that I put are available for 3.12 [14:56:44] :/ [14:57:06] hoo: maybe we could test them in deploymnent-prep? [14:57:16] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [14:57:34] I suppose we could [14:57:41] Trusty is not going to get 3.18, is it? [14:58:07] hoo: probably not.. and we are still testing 3.18 because if shows some issues [14:58:12] hoo: maybe http://php.net/manual/en/info.configuration.php#ini.zend.enable-gc could wokr? [14:58:15] *work [14:58:34] hoo: no, let's better migrate the snapshot* hosts to jessie [14:58:35] There's even a user space function [14:59:31] elukey: HHVM upstream fixed the deadlock, currently building the package, then it's hopefully ready [14:59:48] I guess we could also hack this via mediawiki-config [15:00:00] (03CR) 10Reedy: [C: 04-1] Enable 3d extension on beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341332 (https://phabricator.wikimedia.org/T159717) (owner: 10MarkTraceur) [15:00:01] but I would still like for the settings to be set via hiera [15:00:34] hoo: what would be the best way to test the hhvm settings for you? I am a bit ignorant about snapshot* [15:01:56] RECOVERY - puppet last run on wtp1011 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [15:03:28] (03CR) 10Reedy: [C: 04-1] Enable 3d extension on beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341332 (https://phabricator.wikimedia.org/T159717) (owner: 10MarkTraceur) [15:03:54] elukey: Well, whatever works for me [15:03:58] for example mwdebug1001 [15:05:30] hoo: no no I mean where I can tweak hhvm settings (possibly not production) and let you check the GC settings [15:05:51] 06Operations, 10ops-codfw, 10DBA: codfw rack/setup first 10 DB servers - https://phabricator.wikimedia.org/T162159#3157547 (10Marostegui) ` install_server module update (mac address and partitioning info,) Please provide partition schema` Please create a RAID10 with the following options (https://wikitech.w... [15:06:12] Not sure we have enough data in beta to even see the memory leak, but it could be [15:06:37] that's why I suggested changing the settings on one of the mwdebugs and then test a dumper [15:06:43] well "memory leak" [15:06:59] ahhh we can do that in there too? If so mwdebug1001 should be good [15:07:04] Like do an actual partial dump of wikidata to /dev/null [15:07:30] can I ping you in ~1 hour after my meetings? [15:07:41] mwdebug are jessie, though [15:07:44] sure [15:15:13] hoo: in the meantime, can you test mwdebug1001 and see if you can repro the leak? [15:15:53] Is it live therE? [15:16:14] I mean, did you change the settings there? [15:16:23] nono [15:16:27] still not [15:16:39] but I want to make sure that we can repro before changing the settings [15:16:48] I tested this on mwdebug1002 earlier today and it indeed started to blow up [15:16:59] super [15:17:02] (03PS1) 10Jcrespo: Kill long running queries longer with shorter terms: [software] - 10https://gerrit.wikimedia.org/r/346559 (https://phabricator.wikimedia.org/T160984) [15:18:54] (03PS2) 10Jcrespo: Kill long running queries with stricter conditions [software] - 10https://gerrit.wikimedia.org/r/346559 (https://phabricator.wikimedia.org/T160984) [15:19:36] (03CR) 10Marostegui: [C: 031] Kill long running queries with stricter conditions [software] - 10https://gerrit.wikimedia.org/r/346559 (https://phabricator.wikimedia.org/T160984) (owner: 10Jcrespo) [15:20:04] !log playing with hhvm settings on mwdebug1002 [15:20:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:15] (03PS4) 10Matthias Mullie: Enable 3d extension on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341332 (https://phabricator.wikimedia.org/T159717) (owner: 10MarkTraceur) [15:20:19] (03CR) 10Jcrespo: "We are not going to just deploy this, it will need a very slow and progressive deployment." [software] - 10https://gerrit.wikimedia.org/r/346559 (https://phabricator.wikimedia.org/T160984) (owner: 10Jcrespo) [15:23:09] hoo: I added the enable_gc options, but those probably will fail.. can we make a test? [15:23:30] Sure, I'll start a dumper [15:24:02] running on mwdebug1001, I'm monitoring the memory use [15:25:03] hoo: also do you have a quick way to check var_dump(gc_enabled()) ? [15:25:31] (as you may have guessed I am a total newbie with php) [15:25:34] > var_dump(gc_enabled()); [15:25:35] bool(false) [15:26:07] dumper memory usage was also growing, killed it now [15:28:50] hoo: mmm I tried and then hhvm filename.php, I get bool true [15:29:14] $ sudo -u www-data php /srv/mediawiki/multiversion/MWScript.php eval.php --wiki wikidatawiki [15:29:40] I used that [15:29:51] (03CR) 10Jcrespo: "This may need some extra refactoring- servers are never going to be runing more than 10 queries at the same time due to the queuing system" [software] - 10https://gerrit.wikimedia.org/r/346559 (https://phabricator.wikimedia.org/T160984) (owner: 10Jcrespo) [15:30:01] elukey@mwdebug1002:~$ sudo -u www-data php /srv/mediawiki/multiversion/MWScript.php eval.php --wiki wikidatawiki [15:30:02] (can also be used with php5, in that case it has GC) [15:30:04] > var_dump(gc_enabled()); [15:30:07] bool(true) [15:30:26] argh you tried mwdebug1001 probably [15:30:30] yeah [15:30:47] sorry I saw "I tested this on mwdebug1002" and my brain did ssh mwdebug1002 [15:32:12] 06Operations, 10DBA, 10Wikimedia-General-or-Unknown: Spurious completely empty `image` table row on commonswiki - https://phabricator.wikimedia.org/T155769#3157631 (10matmarex) a:05matmarex>03None There is nothing else I can do myself to resolve this. I do not have the access to run the two queries I pos... [15:33:31] hoo: let's try on mwdebug1002 [15:33:37] ok [15:34:08] there it's true [15:36:46] (03PS3) 10Jcrespo: Kill long running queries with stricter conditions [software] - 10https://gerrit.wikimedia.org/r/346559 (https://phabricator.wikimedia.org/T160984) [15:38:52] 06Operations, 10DBA, 10Wikimedia-General-or-Unknown: Spurious completely empty `image` table row on commonswiki - https://phabricator.wikimedia.org/T155769#2953992 (10jcrespo) This was classified as a low priority task. It will be eventually done, do not worry, it is not forgotten, but at the cost of other,... [15:41:10] 06Operations, 10DBA, 10Wikimedia-General-or-Unknown: Spurious completely empty `image` table row on commonswiki - https://phabricator.wikimedia.org/T155769#3157662 (10Marostegui) For the record, I checked the "consistency" of that row across s4 (commons) and s1 (enwiki), and to make sure at least it is prese... [15:41:56] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 263.89 seconds [15:42:31] (03PS1) 10Marostegui: s1.hosts: Remove db1057 [software] - 10https://gerrit.wikimedia.org/r/346563 (https://phabricator.wikimedia.org/T162135) [15:42:48] (03CR) 10Jcrespo: [C: 031] s1.hosts: Remove db1057 [software] - 10https://gerrit.wikimedia.org/r/346563 (https://phabricator.wikimedia.org/T162135) (owner: 10Marostegui) [15:43:48] (03CR) 10Marostegui: [C: 032] s1.hosts: Remove db1057 [software] - 10https://gerrit.wikimedia.org/r/346563 (https://phabricator.wikimedia.org/T162135) (owner: 10Marostegui) [15:44:50] (03Merged) 10jenkins-bot: s1.hosts: Remove db1057 [software] - 10https://gerrit.wikimedia.org/r/346563 (https://phabricator.wikimedia.org/T162135) (owner: 10Marostegui) [15:51:51] 06Operations, 10DNS, 10Traffic: mediawiki.org DNS TXT entry for Google Search Console site verification - https://phabricator.wikimedia.org/T161343#3157734 (10Dzahn) @dr0ptp4kt I don't really see the code but it has a green check box next to "DNS". I gave full access to abaso@wikimedia.org for https://media... [15:52:28] 06Operations, 10Ops-Access-Requests, 10Traffic, 13Patch-For-Review: Ops Onboarding for Arzhel Younsi - https://phabricator.wikimedia.org/T162073#3157735 (10ayounsi) pwstore works fine! We should be good to close this task. [15:53:00] 06Operations: Integrate jessie 8.7 point release - https://phabricator.wikimedia.org/T155401#3157737 (10Halfak) [16:02:26] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [16:03:36] PROBLEM - puppet last run on mw1186 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:04:35] 06Operations, 10DNS, 10Traffic: mediawiki.org DNS TXT entry for Google Search Console site verification - https://phabricator.wikimedia.org/T161343#3157769 (10dr0ptp4kt) Bummer, it looks like that didn't do it. Would you please check the noc@ email to see if there's a site verification request that you can a... [16:07:26] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [16:07:41] 06Operations, 10DNS, 10Traffic: mediawiki.org DNS TXT entry for Google Search Console site verification - https://phabricator.wikimedia.org/T161343#3157775 (10Dzahn) I don't really know how to check noc@wikimedia.org email. If i try to login with the credentials i have on mail, i get the " Add Gmail to your... [16:08:15] 06Operations, 10DNS, 10Traffic: mediawiki.org DNS TXT entry for Google Search Console site verification - https://phabricator.wikimedia.org/T161343#3157776 (10Deskana) >>! In T161343#3157734, @Dzahn wrote: > Also note an existing owner you share full access with is "searchteam+gwt@wikimedia.org". Do you know... [16:08:46] PROBLEM - Check Varnish expiry mailbox lag on cp1072 is CRITICAL: CRITICAL: expiry mailbox lag is 612730 [16:09:50] 06Operations, 10DNS, 10Traffic: mediawiki.org DNS TXT entry for Google Search Console site verification - https://phabricator.wikimedia.org/T161343#3157778 (10Dzahn) checking the "messages" in Search Console itself, the last one is from 4/1/17. [16:10:46] PROBLEM - jenkins_service_running on contint2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/java .*-jar /usr/share/jenkins/jenkins.war [16:10:52] 06Operations, 10ops-esams, 10netops: cr2-esams FPC 0 is dead - https://phabricator.wikimedia.org/T162239#3157779 (10ayounsi) Juniper case 2017-0405-0571 opened. [16:13:15] ACKNOWLEDGEMENT - jenkins_service_running on contint2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/java .*-jar /usr/share/jenkins/jenkins.war amusso Jenkins is not active on contint2001 yet. [16:14:26] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [16:14:58] hoo|away: did you manage to run the job on mwdebug1002? [16:15:00] 06Operations, 10DNS, 10Traffic: mediawiki.org DNS TXT entry for Google Search Console site verification - https://phabricator.wikimedia.org/T161343#3157854 (10Dzahn) @dr0ptp4kt I also added you http://mediawiki.org, https://www.mediawiki.org and https://www.mediawiki.org with Full access.. any difference... [16:18:06] PROBLEM - Keyholder SSH agent on mira is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [16:18:46] RECOVERY - Check Varnish expiry mailbox lag on cp1072 is OK: OK: expiry mailbox lag is 0 [16:19:26] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [16:20:08] (03PS2) 10Subramanya Sastry: Delink new parsoid-vd test runs from updates to parsoid git repo [puppet] - 10https://gerrit.wikimedia.org/r/346196 [16:20:40] (03PS2) 10Dzahn: Remove Apache across the tree [puppet] - 10https://gerrit.wikimedia.org/r/346128 (owner: 10Faidon Liambotis) [16:21:30] _joe_ mutante volans can one of you review and +2 https://gerrit.wikimedia.org/r/#/c/346196/ ... thanks. [16:23:20] 06Operations, 10Analytics, 10DBA: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3157874 (10Tbayer) >>! In T156844#3157170, @Ottomata wrote: >> How's the process to decommission db1047 going? > > I guess ok! I think we should just dump all the user creat... [16:26:19] 06Operations, 10Analytics, 10DBA: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3157875 (10Ottomata) > I thought the plan was to import them (in particular the "staging" database) to dbstore1002, so that they can be queried there as before? Ah sure we can... [16:27:58] 06Operations, 06Analytics-Kanban, 15User-Elukey: Reimage all the Hadoop worker nodes to Debian Jessie - https://phabricator.wikimedia.org/T160333#3157878 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by otto on neodymium.eqiad.wmnet for hosts: ``` ['analytics1028.eqiad.wmnet'] ``` The log can b... [16:32:36] RECOVERY - puppet last run on mw1186 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [16:36:26] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [16:38:21] elukey: Didn't yet try, will do immediately [16:40:39] super thanks! [16:41:26] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [16:41:31] Not working memory usage is growing steadily [16:42:16] well let's wait a bit [16:44:24] (03CR) 10Faidon Liambotis: "I think we can silence stderr though, it's getting a little bit too spammy." [puppet] - 10https://gerrit.wikimedia.org/r/346116 (owner: 10Alexandros Kosiaris) [16:44:58] I'm at 12% of the 4gb now… with GC enabled I was at maybe 4 [16:45:06] and still growing [16:45:33] hoo: but you were actively calling gc_collect_cycles right? [16:45:58] Yeah, I'm not sure I tried calling the function to enable it [16:47:25] I am pretty sure that hhvm.enable_gc is the zend circular ref collector, so 3.18 might carry a better GC alg [16:47:29] I'm not sure how hhvm's GC works, but php's kicks in rather early [16:47:48] (per default at least) [16:51:00] hoo: so you're saying that with php5 the GC kicked in earlier? [16:51:24] Yeah, definitely [16:51:52] but with or without gc_collect_cycles ? [16:52:56] PROBLEM - puppet last run on analytics1032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:53:26] elukey: Without [16:53:26] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [16:53:34] I only tried collect cycles on hhvm [16:53:51] ah snap [16:53:52] 06Operations, 06Analytics-Kanban, 15User-Elukey: Reimage all the Hadoop worker nodes to Debian Jessie - https://phabricator.wikimedia.org/T160333#3157937 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['analytics1028.eqiad.wmnet'] ``` and were **ALL** successful. [16:53:55] this is weird [16:54:49] hoo: let's try zend.enable_gc, maybe it is different [16:54:52] changing the config [16:55:46] RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 2207 [16:56:50] Ok [16:57:00] the script is still accumulating memory [16:57:07] hoo: can you stop the jobs? [16:57:19] (on mwdebug) [16:57:50] !log rearmed keyholder on mira after reboot [16:57:53] done [16:57:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:17] 06Operations, 10Ops-Access-Requests, 10Traffic, 13Patch-For-Review: Ops Onboarding for Arzhel Younsi - https://phabricator.wikimedia.org/T162073#3157942 (10MoritzMuehlenhoff) 05Open>03Resolved [16:58:44] RECOVERY - Keyholder SSH agent on mira is OK: OK: Keyholder is armed with all configured keys. [17:05:34] hoo, moritzm - I am chatting with people in #hhvm and they said that the options mentioned in the docs were already present before 3.12 but not really reliable/working [17:05:40] until 3.18 [17:05:50] (03PS1) 10BBlack: cache_misc: noc.wm.o active/active [puppet] - 10https://gerrit.wikimedia.org/r/346572 [17:06:03] so to solve this problem we'd probably need to wait until the 3.18 package is ready to be used [17:06:24] yeah, sounds like it. otherwise they wouldn't have mentioned it in the release notes I guess [17:07:03] I see [17:07:05] BUT it is weird that returns true on hhvm [17:07:13] not sure what it does in the background [17:07:37] (03PS3) 10Dzahn: Remove Apache across the tree [puppet] - 10https://gerrit.wikimedia.org/r/346128 (owner: 10Faidon Liambotis) [17:07:38] hoo: you said though that calling collect cycles was working [17:07:52] but only on 3.18, didn't try it on 3.12 [17:07:56] so either we do some hack with 3.12 to call collect cycles periodically or we wait for 3.18 [17:08:00] ahhhh okok [17:08:01] though it might just work there [17:08:24] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [17:08:52] hoo: from what the hhvm people are saying probably not, only from 3.15 onwards [17:09:25] Shoot [17:09:31] How did this work a couple of weeks ago [17:09:39] How did this ever work? [17:09:45] (03Abandoned) 10Elukey: Enable hhvm GC for CLI on snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/346554 (https://phabricator.wikimedia.org/T162245) (owner: 10Elukey) [17:10:08] (03PS1) 10BBlack: cache_misc: config-master.wm.o active/active [puppet] - 10https://gerrit.wikimedia.org/r/346573 [17:10:54] PROBLEM - puppet last run on dbproxy1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:10:59] hoo: with "this" you mean the wikidata script? [17:11:06] yes, the dumpers [17:11:12] they are on hhvm for quite a while now [17:11:17] and only suddenly it blew up [17:12:19] no idea :( [17:12:32] (03CR) 10BBlack: wdqs: active/active public interface (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/346543 (owner: 10Gehel) [17:12:34] Same here :/ [17:12:50] Nothing changed [17:14:51] (03CR) 10Gehel: [C: 04-1] wdqs: active/active public interface (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/346543 (owner: 10Gehel) [17:14:54] 06Operations, 10Datasets-General-or-Unknown, 07HHVM, 10Wikidata, and 2 others: Enable GC for HHVM CLI (at least for dump runners) - https://phabricator.wikimedia.org/T162245#3156596 (10elukey) Tested with @hoo the settings outlined in https://docs.hhvm.com/hhvm/configuration/INI-settings on mwdebug1002. A... [17:15:08] updated the task [17:15:54] elukey: Did you undo your changes on mwdebug1002, yet? [17:16:04] > var_dump(gc_enabled()); -> bool(false) [17:16:06] (03PS2) 10Gehel: wdqs: active/active public interface [puppet] - 10https://gerrit.wikimedia.org/r/346543 (https://phabricator.wikimedia.org/T162111) [17:16:09] (03CR) 10Dzahn: [C: 032] "double-checked one more time, doing it now" [puppet] - 10https://gerrit.wikimedia.org/r/346128 (owner: 10Faidon Liambotis) [17:17:08] 06Operations, 10Datasets-General-or-Unknown, 07HHVM, 10Wikidata, and 2 others: Enable GC for HHVM CLI (at least for dump runners) - https://phabricator.wikimedia.org/T162245#3156596 (10MoritzMuehlenhoff) Note that also requires a migration of the snapshot hosts to jessie (which was blocked so far by a bug... [17:18:48] (03CR) 10BBlack: "It's tricky to quantify. For most users, most of the time, they'll consistently be routed to one side or the other. For lots of users, r" [puppet] - 10https://gerrit.wikimedia.org/r/346543 (https://phabricator.wikimedia.org/T162111) (owner: 10Gehel) [17:18:54] (03PS1) 10Andrew Bogott: labs_bootstrapvz: Don't include mlocate or ecryptfs-utils [puppet] - 10https://gerrit.wikimedia.org/r/346575 [17:19:05] hoo: yep [17:19:21] a lot of misc Apaches are reloading, removing the <2.4 config snippets.. i am watching it [17:19:43] elukey: Ok, so that explains that [17:20:01] hoo: I tried zend.enable_gc but of course it doesn't work :) [17:20:07] the change is for gerrit.wikimedia.org which i just tested and is still working. So gerrit should be unaffected :) [17:20:46] elukey: :S [17:20:55] One last thing, I can try the user space function [17:21:01] (03CR) 10Andrew Bogott: [C: 032] labs_bootstrapvz: Don't include mlocate or ecryptfs-utils [puppet] - 10https://gerrit.wikimedia.org/r/346575 (owner: 10Andrew Bogott) [17:21:22] yes, it's all fine, no problems [17:21:38] 06Operations, 10ops-codfw, 10DBA: setup tempdb2001(WMF6407) - https://phabricator.wikimedia.org/T162290#3158019 (10RobH) [17:21:55] RECOVERY - puppet last run on analytics1032 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [17:22:51] (03PS4) 10Paladox: Gerrit: Disable md5 in ssh [puppet] - 10https://gerrit.wikimedia.org/r/346180 [17:23:04] 06Operations, 10ops-codfw, 10DBA, 10hardware-requests: codfw: (1) spare pool system for temp allocation as database failover - https://phabricator.wikimedia.org/T161712#3140543 (10RobH) 05Open>03stalled p:05Triage>03Normal I'm setting this to stalled and normal priority, as this task will also serv... [17:24:54] PROBLEM - puppet last run on elastic1023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:27:10] (03CR) 10Andrew Bogott: [C: 032] Keystonehooks: Create and delete sudoer rules in ldap [puppet] - 10https://gerrit.wikimedia.org/r/346489 (https://phabricator.wikimedia.org/T150091) (owner: 10Andrew Bogott) [17:27:16] (03PS3) 10Andrew Bogott: Keystonehooks: Create and delete sudoer rules in ldap [puppet] - 10https://gerrit.wikimedia.org/r/346489 (https://phabricator.wikimedia.org/T150091) [17:27:32] (03PS1) 10RobH: setting dns for tempdb2001 [dns] - 10https://gerrit.wikimedia.org/r/346576 [17:28:02] (03CR) 10RobH: [C: 032] setting dns for tempdb2001 [dns] - 10https://gerrit.wikimedia.org/r/346576 (owner: 10RobH) [17:32:18] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: setup tempdb2001(WMF6407) - https://phabricator.wikimedia.org/T162290#3158082 (10jcrespo) To clarify the state of this, we still need this ASAP for service implementation ahead of the switchover (that can take quite some time, it is more than just runn... [17:33:04] elukey: Why did you abandon the change for setting the config? [17:33:10] Shouldn't we at least set it? [17:33:27] (03CR) 10EBernhardson: [C: 031] "Not sure if its there, but certainly the wikidata documentation probably also needs to include mention of this configuration setting." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346542 (owner: 10DCausse) [17:34:04] PROBLEM - puppet last run on labtestservices2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py] [17:37:24] PROBLEM - Router interfaces on cr2-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 52, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/0/0: down - Core: asw-esams:xe-3/0/42 (GBLX leg 2) {#14007} [10Gbps DF CWDM C49]BR [17:39:54] RECOVERY - puppet last run on dbproxy1004 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [17:40:24] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [17:43:21] (03PS1) 10RobH: tempdb2001 install parameters [puppet] - 10https://gerrit.wikimedia.org/r/346577 [17:43:32] (03Abandoned) 10EBernhardson: Prevent wikidata dumps from taking all memory on snapshot1007 [puppet] - 10https://gerrit.wikimedia.org/r/345170 (https://phabricator.wikimedia.org/T161577) (owner: 10EBernhardson) [17:43:41] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: setup tempdb2001(WMF6407) - https://phabricator.wikimedia.org/T162290#3158106 (10RobH) I'm getting the OS installed today and handed off. [17:44:12] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: setup tempdb2001(WMF6407) - https://phabricator.wikimedia.org/T162290#3158108 (10RobH) [17:44:40] (03PS2) 10RobH: tempdb2001 install parameters [puppet] - 10https://gerrit.wikimedia.org/r/346577 [17:44:53] (03CR) 10RobH: [C: 032] tempdb2001 install parameters [puppet] - 10https://gerrit.wikimedia.org/r/346577 (owner: 10RobH) [17:49:27] (03PS5) 10Madhuvishy: tools: job to copytruncate logs in place [puppet] - 10https://gerrit.wikimedia.org/r/326153 (https://phabricator.wikimedia.org/T152235) (owner: 10Rush) [17:52:54] RECOVERY - puppet last run on elastic1023 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [17:55:00] (03PS1) 10Thcipriani: Scap: update version to 3.5.4-1 [puppet] - 10https://gerrit.wikimedia.org/r/346579 (https://phabricator.wikimedia.org/T127762) [17:55:24] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [17:56:06] (03PS1) 10Jcrespo: Indicate install recipes for newest db1* and db2* DB servers [puppet] - 10https://gerrit.wikimedia.org/r/346580 (https://phabricator.wikimedia.org/T162159) [17:58:54] PROBLEM - puppet last run on mw1177 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:59:25] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: codfw rack/setup first 10 DB servers - https://phabricator.wikimedia.org/T162159#3158194 (10jcrespo) ^the above should be enough for the recipe. In addition to what Manuel stated, given problems we had in the past, we need to check: * IPMI calls work a... [17:59:44] PROBLEM - Host lvs2002 is DOWN: PING CRITICAL - Packet loss = 100% [18:00:04] PROBLEM - puppet last run on mw2177 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:00:05] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170405T1800). [18:00:19] ema: ^ and there it goes again.. lvs2002 is just broken [18:00:24] PROBLEM - puppet last run on ms-fe2005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:01:04] PROBLEM - puppet last run on suhail is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:01:11] 06Operations, 10ops-codfw, 10Traffic: lvs2002 random shut down - https://phabricator.wikimedia.org/T162099#3158202 (10Dzahn) it went down again: 11:02 < icinga-wm> PROBLEM - Host lvs2002 is DOWN: PING CRITICAL - Packet loss = 100% [18:01:29] (03PS2) 10Jcrespo: Indicate install recipes for newest db1* and db2* DB servers [puppet] - 10https://gerrit.wikimedia.org/r/346580 (https://phabricator.wikimedia.org/T162159) [18:02:31] (03CR) 10Marostegui: [C: 031] "Thanks for taking care of this!" [puppet] - 10https://gerrit.wikimedia.org/r/346580 (https://phabricator.wikimedia.org/T162159) (owner: 10Jcrespo) [18:04:04] RECOVERY - puppet last run on labtestservices2001 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [18:04:27] !log lvs2002 - power off via mgmt (it was down but still showed power as on) [18:04:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:50] 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3158209 (10jcrespo) The guidance is the same as T162159#3157547 (documented for databases on https://wikitech.wikimedia.org/wiki/Raid_and_MegaCli#Raid_setup_at_Wikimedia ).... [18:05:52] 06Operations, 10ops-codfw, 10Traffic: lvs2002 random shut down - https://phabricator.wikimedia.org/T162099#3158211 (10Dzahn) < bblack> perhaps power it off to make sure it doesn't blip back on, for now ``` Server Power: On hpiLO-> power off status=0 status_tag=COMMAND COMPLETED Wed Apr 5 18:03:48 2017... [18:06:01] 06Operations, 10ops-codfw, 10Traffic: lvs2002 random shut down - https://phabricator.wikimedia.org/T162099#3158212 (10Dzahn) p:05Normal>03High [18:07:24] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [18:08:09] 06Operations, 10ops-codfw, 10Traffic: lvs2002 random shut down - https://phabricator.wikimedia.org/T162099#3158216 (10Dzahn) @papaul Could you take a look at this. It seems we might have to call HP. We should make this a priority since we'll soon be moving all our traffic to codfw temporarily. [18:11:51] (03CR) 10Chad: [C: 031] Scap: update version to 3.5.4-1 [puppet] - 10https://gerrit.wikimedia.org/r/346579 (https://phabricator.wikimedia.org/T127762) (owner: 10Thcipriani) [18:12:24] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [18:13:44] (03PS1) 10Jdlrobson: Update Russian Wikipedia logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346581 (https://phabricator.wikimedia.org/T162036) [18:17:00] (03CR) 10Rush: tools: job to copytruncate logs in place (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/326153 (https://phabricator.wikimedia.org/T152235) (owner: 10Rush) [18:17:28] (03PS1) 10Jdlrobson: Deploy Page previews to stable on Hungrian and Hebrew Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346584 (https://phabricator.wikimedia.org/T162162) [18:24:24] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [18:25:54] RECOVERY - puppet last run on mw1177 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [18:28:04] RECOVERY - puppet last run on mw2177 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [18:28:26] (03PS1) 10RobH: update tempdb2001 partitioning [puppet] - 10https://gerrit.wikimedia.org/r/346587 [18:28:40] (03PS2) 10RobH: update tempdb2001 partitioning [puppet] - 10https://gerrit.wikimedia.org/r/346587 [18:28:54] (03CR) 10RobH: [C: 032] update tempdb2001 partitioning [puppet] - 10https://gerrit.wikimedia.org/r/346587 (owner: 10RobH) [18:29:04] RECOVERY - puppet last run on suhail is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [18:29:24] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [18:29:24] RECOVERY - puppet last run on ms-fe2005 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [18:38:22] 06Operations, 10ops-esams, 10netops: cr2-esams FPC 0 is dead - https://phabricator.wikimedia.org/T162239#3158350 (10ayounsi) Juniper is ready to proceed with an RMA. We need to sync up with the DC's remote hands for that. [18:41:24] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [18:41:52] (03PS5) 10Andrew Bogott: wmfkeystonehooks: Maintain project page on wikitech [puppet] - 10https://gerrit.wikimedia.org/r/323117 (https://phabricator.wikimedia.org/T150091) [18:43:25] (03PS6) 10Andrew Bogott: wmfkeystonehooks: Maintain project page on wikitech [puppet] - 10https://gerrit.wikimedia.org/r/323117 (https://phabricator.wikimedia.org/T150091) [18:45:48] (03CR) 10Subramanya Sastry: "Clarification comment for benefit of reviewers:" [puppet] - 10https://gerrit.wikimedia.org/r/346196 (owner: 10Subramanya Sastry) [18:49:31] (03PS3) 10Dzahn: installserver: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345549 (owner: 10Faidon Liambotis) [18:50:04] PROBLEM - puppet last run on db1068 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:50:37] mutante lol you just rebased and it is showing as merge conflicts ^^ [18:50:37] (03PS7) 10Andrew Bogott: wmfkeystonehooks: Maintain project page on wikitech [puppet] - 10https://gerrit.wikimedia.org/r/323117 (https://phabricator.wikimedia.org/T150091) [18:50:39] (03PS1) 10Andrew Bogott: Add wikitechstatusconfig for labtest [puppet] - 10https://gerrit.wikimedia.org/r/346590 [18:51:14] (03PS4) 10Dzahn: installserver: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345549 (owner: 10Faidon Liambotis) [18:51:26] paladox: yea, it happens, the first was to remove dependencies between patches [18:51:35] yep [18:51:54] then i have to click "rebase" again in web ui, but that time it doesnt have to be manual... [18:52:59] (03CR) 10Andrew Bogott: [C: 032] Add wikitechstatusconfig for labtest [puppet] - 10https://gerrit.wikimedia.org/r/346590 (owner: 10Andrew Bogott) [18:53:09] (03PS2) 10Andrew Bogott: Add wikitechstatusconfig for labtest [puppet] - 10https://gerrit.wikimedia.org/r/346590 [18:57:12] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:57:32] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:57:43] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:57:43] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:57:52] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:57:52] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:57:52] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:57:52] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:57:53] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:58:02] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:58:02] PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:58:02] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:58:02] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:58:02] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:58:03] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:58:03] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:58:04] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:58:05] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:58:11] (03PS1) 10Legoktm: Deploy Linter to medium wikis too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346591 (https://phabricator.wikimedia.org/T148609) [18:58:18] not expected, but do not worry too much [18:58:22] PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:58:30] thanks jynus, I was about to start yelling [18:58:42] jouncebot: next [18:58:42] In 0 hour(s) and 1 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170405T1900) [18:58:52] PROBLEM - puppet last run on puppetmaster1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:59:22] RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [18:59:42] RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [18:59:42] RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [18:59:42] RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave [18:59:42] RECOVERY - MariaDB Slave IO: s7 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [18:59:42] RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [18:59:42] RECOVERY - MariaDB Slave SQL: x1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [18:59:52] RECOVERY - MariaDB Slave IO: s6 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [18:59:52] RECOVERY - MariaDB Slave SQL: s4 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [18:59:52] RECOVERY - MariaDB Slave SQL: s2 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [18:59:52] RECOVERY - MariaDB Slave IO: s1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [18:59:52] RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave [18:59:53] RECOVERY - MariaDB Slave SQL: m3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [18:59:54] RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [18:59:54] RECOVERY - MariaDB Slave IO: x1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [18:59:54] RECOVERY - MariaDB Slave SQL: s7 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [19:00:02] RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [19:00:03] RECOVERY - MariaDB Slave SQL: s6 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [19:00:04] RainbowSprinkles: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170405T1900). Please do the needful. [19:00:12] RECOVERY - MariaDB Slave IO: s4 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [19:00:26] (03CR) 10Jcrespo: "Sorry:" [puppet] - 10https://gerrit.wikimedia.org/r/345372 (owner: 10Marostegui) [19:00:51] (03PS1) 10Chad: group1 to wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346592 [19:04:42] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [19:05:20] (03CR) 10Chad: [C: 032] group1 to wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346592 (owner: 10Chad) [19:06:34] 06Operations, 10ops-codfw, 06Performance-Team: Plug in ex-graphite2001 SSDs to recover coal data - https://phabricator.wikimedia.org/T161538#3158445 (10Krinkle) [19:08:35] (03Merged) 10jenkins-bot: group1 to wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346592 (owner: 10Chad) [19:08:46] (03CR) 10jenkins-bot: group1 to wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346592 (owner: 10Chad) [19:11:42] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [19:12:21] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 to wmf.19 [19:12:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:50] (03PS8) 10Andrew Bogott: wmfkeystonehooks: Create project page on wikitech on project creation [puppet] - 10https://gerrit.wikimedia.org/r/323117 (https://phabricator.wikimedia.org/T150091) [19:16:42] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [19:18:02] RECOVERY - puppet last run on db1068 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [19:18:04] (03CR) 10Andrew Bogott: "I've tested this as best I can, and it works fine on labtest. The liberty/mitaka changes are duplicates." [puppet] - 10https://gerrit.wikimedia.org/r/323117 (https://phabricator.wikimedia.org/T150091) (owner: 10Andrew Bogott) [19:20:52] RECOVERY - puppet last run on puppetmaster1001 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [19:23:35] mutante: can you something about this? https://phabricator.wikimedia.org/T161082 [19:26:34] 06Operations, 10ops-eqiad: rack and setup boron replacement frpm1001 - https://phabricator.wikimedia.org/T162298#3158473 (10Cmjohnson) [19:30:01] Nemo_bis: indirectly, i can replace the admin (since i see philippe@ and assume he is not it anymore) if you have one and then they can handle filters [19:30:43] mutante: I can offer to be admin of wikipedia-l, but I can't admin all the mailing lists :) [19:30:47] what i'm not willing to do is handle filters for individual lists myself, just doesn't scale [19:30:49] So it would be nice to have sane defaults [19:31:01] i'll help if the admin needs to be replaced or password resdt [19:31:06] which needs master password [19:33:06] Nemo_bis: a ticket about transferring admin ship would be ideal (maybe others who philippe used to do ).. but i have to step outside.. being picked up right now [19:33:42] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [19:34:19] (03PS1) 10Jgreen: flip fundraisingdb-read back to db1025 to clone frdb1001 [dns] - 10https://gerrit.wikimedia.org/r/346595 [19:35:12] (03CR) 10Jgreen: [C: 032] flip fundraisingdb-read back to db1025 to clone frdb1001 [dns] - 10https://gerrit.wikimedia.org/r/346595 (owner: 10Jgreen) [19:37:20] RainbowSprinkles: uh, revert train: https://phabricator.wikimedia.org/T162300 [19:37:41] Whole train or just donatewiki? [19:37:48] just donatewiki [19:38:09] (I think) [19:38:27] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: roll back donatewiki to wmf.18 [19:38:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:42] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [19:38:44] hmm https://fog.ccsf.edu/~msapiro/scripts/set_attributes [19:38:50] (03PS1) 10Chad: Rolling back donatewiki to wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346596 (https://phabricator.wikimedia.org/T162300) [19:39:02] (03CR) 10Chad: [C: 032] Rolling back donatewiki to wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346596 (https://phabricator.wikimedia.org/T162300) (owner: 10Chad) [19:39:47] might also want to check if foundationwiki used rawhtml in its system messages anywhere [19:40:06] (03Merged) 10jenkins-bot: Rolling back donatewiki to wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346596 (https://phabricator.wikimedia.org/T162300) (owner: 10Chad) [19:40:16] (03CR) 10jenkins-bot: Rolling back donatewiki to wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346596 (https://phabricator.wikimedia.org/T162300) (owner: 10Chad) [19:40:45] 06Operations, 10Wikimedia-General-or-Unknown: Job queue is growing and growing - https://phabricator.wikimedia.org/T124194#3158574 (10aaron) [19:41:39] p858snake: I hit special:random a few times and didn't see anything wrong [19:41:53] I can roll back foundationwiki too [19:41:56] Just to be safe for now [19:42:44] !log ppchelko@tin Started deploy [trending-edits/deploy@d8ca758]: Providing a debug endpoint [19:42:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:08] 06Operations, 10ops-codfw, 06Performance-Team: Plug in ex-graphite2001 SSDs to recover coal data - https://phabricator.wikimedia.org/T161538#3158583 (10RobH) a:05RobH>03fgiunchedi So this system is booted to OS with NO networking. The usb stick is mounted as /mnt/sde/ All the data can be copied over, b... [19:43:40] 06Operations, 10ops-codfw, 10Traffic: lvs2002 random shut down - https://phabricator.wikimedia.org/T162099#3152568 (10ayounsi) Pushed the following to cr1/2.codfw. When lvs2002 comes back online for troubleshooting it should not receive any traffic. ``` [edit routing-options rib inet6.0 static route 2620:0... [19:43:54] !log pushing https://www.irccloud.com/pastebin/Kecy61aZ/ to cr1/2.codfw for T162099 [19:44:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:08] T162099: lvs2002 random shut down - https://phabricator.wikimedia.org/T162099 [19:44:27] (03CR) 10BryanDavis: wmfkeystonehooks: Create project page on wikitech on project creation (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/323117 (https://phabricator.wikimedia.org/T150091) (owner: 10Andrew Bogott) [19:44:55] (which is slightly faster to use than http://manpages.ubuntu.com/manpages/precise/man8/config_list.8.html ) [19:47:22] PROBLEM - puppet last run on mwdebug1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:48:54] (03CR) 10Andrew Bogott: wmfkeystonehooks: Create project page on wikitech on project creation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/323117 (https://phabricator.wikimedia.org/T150091) (owner: 10Andrew Bogott) [19:50:40] !log ppchelko@tin Finished deploy [trending-edits/deploy@d8ca758]: Providing a debug endpoint (duration: 07m 56s) [19:50:43] !log ppchelko@tin Started deploy [trending-edits/deploy@d8ca758]: Providing a debug endpoint [19:50:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:45] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: setup tempdb2001(WMF6407) - https://phabricator.wikimedia.org/T162290#3158631 (10RobH) a:05RobH>03jcrespo [19:55:02] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: setup tempdb2001(WMF6407) - https://phabricator.wikimedia.org/T162290#3158633 (10RobH) a:05jcrespo>03None So this is now ready for puppet key/salt key and service implementation by the #DBA team. This already has their tag for #DBA on the task, I... [19:55:42] RECOVERY - Check Varnish expiry mailbox lag on cp1099 is OK: OK: expiry mailbox lag is 0 [19:55:43] !log ppchelko@tin Finished deploy [trending-edits/deploy@d8ca758]: Providing a debug endpoint (duration: 04m 59s) [19:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:04] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170405T2000). Please do the needful. [20:00:14] Nothing for ORES today [20:03:34] sss [20:07:05] 06Operations, 10ops-esams, 10netops: cr2-esams FPC 0 is dead - https://phabricator.wikimedia.org/T162239#3156449 (10RobH) I've emailed evoswitch to open an inbound shipment ticket. Once I have that reference, I'll update this task so @ayounsi can have Juniper dispatch the replacement part. [20:07:12] 06Operations, 10ops-esams, 10netops: cr2-esams FPC 0 is dead - https://phabricator.wikimedia.org/T162239#3158709 (10RobH) a:03RobH [20:07:20] !log arlolra@tin Started deploy [parsoid/deploy@f2d4eee]: Updating Parsoid to 32b7c677 [20:07:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:42] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [20:11:43] 06Operations, 10ops-esams, 10netops: cr2-esams FPC 0 is dead - https://phabricator.wikimedia.org/T162239#3158717 (10ayounsi) Step by step instructions for the remote hands: # Locate the chassis: http://www.juniper.net/techpubs/en_US/release-independent/junos/topics/concept/mpc-mx480-description.html # L... [20:14:30] 06Operations, 10ops-esams, 10netops: cr2-esams FPC 0 is dead - https://phabricator.wikimedia.org/T162239#3158719 (10RobH) Inbound ticket # is 7326745, please go ahead and have them dispatch the part. Update this task with the tracking # and assign to me, and I'll get the inbound ticket updated. [20:15:42] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [20:15:47] (03PS3) 10Zppix: Adding a few more typos that could break things if they aren't tested for. [puppet] - 10https://gerrit.wikimedia.org/r/346282 [20:16:25] can an ops member tell me if i need to have that above change swatted since its so minor? [20:17:06] Zppix if you doint find someone to merge it, you can add it to puppet swat [20:17:07] which is seperate from mediawiki swat. [20:17:23] though that looks so minor [20:17:29] paladox: its so minor i dont really want to waste a spot on a swat [20:18:46] !log arlolra@tin Finished deploy [parsoid/deploy@f2d4eee]: Updating Parsoid to 32b7c677 (duration: 11m 26s) [20:18:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:21] yep [20:19:44] addshore: do you have a moment? [20:19:47] (03PS3) 10Mobrovac: RESTBase: Migrate to Scap3 deployment [puppet] - 10https://gerrit.wikimedia.org/r/346248 (https://phabricator.wikimedia.org/T116335) [20:27:28] !log Updated Parsoid to 32b7c677 (T112043, T161936) [20:27:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:37] T112043: Parsoid converts to [ ] - https://phabricator.wikimedia.org/T112043 [20:27:37] T161936: Parsoid gallery implementation doesn't apply the style attribute - https://phabricator.wikimedia.org/T161936 [20:31:26] (03CR) 10Marostegui: "> Sorry:" [puppet] - 10https://gerrit.wikimedia.org/r/345372 (owner: 10Marostegui) [20:36:12] PROBLEM - puppet last run on restbase-dev1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:41:32] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:41:52] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:41:52] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:41:52] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:41:52] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:41:52] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:41:53] !log ppchelko@tin Started deploy [trending-edits/deploy@475a5c0]: Fix edit scorer [20:41:53] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:41:53] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:42:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:02] PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:42:02] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:42:03] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:42:03] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:42:03] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:42:03] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:42:03] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:42:04] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:42:04] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:42:12] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:42:13] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:42:22] PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:43:12] RECOVERY - MariaDB Slave IO: s4 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [20:43:12] RECOVERY - MariaDB Slave SQL: s6 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [20:43:12] RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [20:43:22] RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [20:43:42] RECOVERY - MariaDB Slave IO: s7 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [20:43:42] RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave [20:43:42] RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [20:43:42] RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [20:43:42] RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [20:43:42] RECOVERY - MariaDB Slave SQL: x1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [20:43:43] RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [20:43:52] RECOVERY - MariaDB Slave IO: s6 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [20:43:52] RECOVERY - MariaDB Slave SQL: s4 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [20:43:52] RECOVERY - MariaDB Slave IO: s1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [20:43:52] RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave [20:43:52] RECOVERY - MariaDB Slave SQL: s2 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [20:43:53] RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [20:43:53] RECOVERY - MariaDB Slave IO: x1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [20:43:54] RECOVERY - MariaDB Slave SQL: s7 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [20:43:54] RECOVERY - MariaDB Slave SQL: m3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [20:44:44] !log ppchelko@tin Finished deploy [trending-edits/deploy@475a5c0]: Fix edit scorer (duration: 02m 51s) [20:44:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:31] !log ppchelko@tin Started deploy [trending-edits/deploy@475a5c0]: Fix edit scorer [20:47:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:38] is it me or did you just deploy the same thing twice Pchelolo [20:49:12] Zppix: no, that's indeed me [20:49:27] Pchelolo: thats not what i meant but ok [20:49:54] Pchelolo: i was confused cause the two deployments looked the same [20:50:14] 06Operations, 10ops-esams, 10netops: cr2-esams FPC 0 is dead - https://phabricator.wikimedia.org/T162239#3158836 (10ayounsi) From Juniper: > Thank you for the information on provided, and the RMA request has been processed for the FPC: > - RMA number: R200119594 > - Product ID: MPC5E-40G10... [20:51:25] Zppix: they're the same, it's an experimental service so the deploy fails sometimes (it executes a very involved calculation on startup and the checks can fail). We're gonna fix it [20:52:18] Pchelolo: i figured, i was just making sure, i've seen accidents happen and didnt want anything bad to happen so i thought it would be better to say something then not [20:52:42] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [20:53:06] !log ppchelko@tin Finished deploy [trending-edits/deploy@475a5c0]: Fix edit scorer (duration: 05m 34s) [20:53:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:42] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [21:05:12] RECOVERY - puppet last run on restbase-dev1001 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [21:09:59] when an operation member gets time can they review https://gerrit.wikimedia.org/r/346282 its very minor change [21:10:45] Minor puppet changes go on puppet swat [21:12:22] RainbowSprinkles: its to the typo file is wasting a puppet swat slot really necessary [21:12:36] That's what puppet swat is *for* [21:12:38] Minor changes [21:13:00] (03PS1) 10Mobrovac: [Beta] RESTBase: Set the Cassandra seeds correctly [puppet] - 10https://gerrit.wikimedia.org/r/346639 [21:13:06] RainbowSprinkles: do i have to be around for it cause i cant promise i will be able to [21:13:19] Yes, you do [21:13:40] let me look at deployment calender and ill see if i can try [21:14:48] RainbowSprinkles: the time that puppet swat is at isnt a good time for me is there any other way? [21:15:38] Getting puppet changes merged takes 1 of three things: puppet swat, bugging someone, or becoming a root [21:18:54] RainbowSprinkles: i mean if eu swat is not doing anything revelant (or they are done but there still time) could i have it done then or no? [21:19:06] correction not revelent i meant at the time [21:19:16] I mean, you gotta just find someone willing to merge [21:19:24] alright thanks [21:19:32] (03CR) 10Reedy: "Are these plausible typos?" [puppet] - 10https://gerrit.wikimedia.org/r/346282 (owner: 10Zppix) [21:20:26] (03CR) 10Zppix: "> Are these plausible typos?" [puppet] - 10https://gerrit.wikimedia.org/r/346282 (owner: 10Zppix) [21:20:40] (03CR) 10Chad: "The puppet ones seem viable to me, but I haven't seen those scap typos before." [puppet] - 10https://gerrit.wikimedia.org/r/346282 (owner: 10Zppix) [21:20:43] (03CR) 10Mobrovac: [C: 031] "Cherry-picked in beta, works." [puppet] - 10https://gerrit.wikimedia.org/r/346639 (owner: 10Mobrovac) [21:21:59] (03CR) 10Reedy: "How would it take a lot of work?" [puppet] - 10https://gerrit.wikimedia.org/r/346282 (owner: 10Zppix) [21:22:45] (03CR) 10Chad: [C: 031] [Beta] RESTBase: Set the Cassandra seeds correctly [puppet] - 10https://gerrit.wikimedia.org/r/346639 (owner: 10Mobrovac) [21:23:03] (03CR) 10Zppix: "say if for some reason we needed to scap something automatically if you mis spell scap you'll have patches that werent scapped automatical" [puppet] - 10https://gerrit.wikimedia.org/r/346282 (owner: 10Zppix) [21:24:19] (03CR) 10Chad: "That doesn't even make sense. We don't automatically scap things--nor does such a thing have anything to do with rebasing or conflicts. Fi" [puppet] - 10https://gerrit.wikimedia.org/r/346282 (owner: 10Zppix) [21:25:27] (03PS4) 10Zppix: Adding a few more typos that could break things if they aren't tested for. [puppet] - 10https://gerrit.wikimedia.org/r/346282 [21:25:41] (03CR) 10Zppix: "ps4 removes the scap typos and adds other puppet typos" [puppet] - 10https://gerrit.wikimedia.org/r/346282 (owner: 10Zppix) [21:27:11] (03CR) 10Bmansurov: [C: 031] Update Russian Wikipedia logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346581 (https://phabricator.wikimedia.org/T162036) (owner: 10Jdlrobson) [21:34:38] (03CR) 10Chad: "Is this still desired? Seems trivial enough" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280003 (owner: 10Jforrester) [21:35:00] 06Operations, 10ops-esams, 10netops: cr2-esams FPC 0 is dead - https://phabricator.wikimedia.org/T162239#3158919 (10RobH) I'm also being CC'd on those emails from Juniper. Once they reply back with the tracking #, I'll update EvoSwitch for the open shipment ticket and open the ticket for the smart hands req... [21:39:54] (03PS2) 10Chad: Use directly wgGalleryOptions without wmg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331819 (owner: 10Dereckson) [21:41:49] (03CR) 10Chad: "Want this to land? Easy enough" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334401 (https://phabricator.wikimedia.org/T156185) (owner: 10Brion VIBBER) [21:43:03] RainbowSprinkles: hah i forgot about that one [21:43:21] Just scanning the backlog :) [21:43:35] anything i need to poke on it? [21:44:52] PROBLEM - parsoid on wtp2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:44:54] Mainly just wondering if you want it live. I can do the sync easy enough [21:45:42] RECOVERY - parsoid on wtp2004 is OK: HTTP OK: HTTP/1.1 200 OK - 1014 bytes in 0.137 second response time [21:46:17] RainbowSprinkles: yeah go for it [21:46:20] some files need the extra time [21:46:32] thanks :D [21:46:51] Aww, merge conflict [21:46:59] RainbowSprinkles: i was just about to point that out [21:48:16] poopers [21:49:44] Fixing [21:49:50] Needed a manual rebase locally [21:50:00] (03PS2) 10Chad: Increase video transcode max time from 8 to 16 hours [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334401 (https://phabricator.wikimedia.org/T156185) (owner: 10Brion VIBBER) [21:50:32] i really want a 'git rebase' emoji [21:50:42] it would roughly resemble that painting "The Scream" [21:50:46] I think that's 😖 [21:50:52] hehe [21:52:18] lol [21:52:20] brion: just look up "hell" on google images [21:52:22] PROBLEM - parsoid on wtp2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:52:26] :) [21:53:12] RECOVERY - parsoid on wtp2009 is OK: HTTP OK: HTTP/1.1 200 OK - 1014 bytes in 1.448 second response time [21:54:05] (03CR) 10Zppix: "Is PS4 okay or is there anymore changes needed?" [puppet] - 10https://gerrit.wikimedia.org/r/346282 (owner: 10Zppix) [21:54:16] (03CR) 10Chad: [C: 032] Increase video transcode max time from 8 to 16 hours [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334401 (https://phabricator.wikimedia.org/T156185) (owner: 10Brion VIBBER) [21:54:23] \o/ woo [21:56:40] (03Merged) 10jenkins-bot: Increase video transcode max time from 8 to 16 hours [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334401 (https://phabricator.wikimedia.org/T156185) (owner: 10Brion VIBBER) [21:56:49] (03CR) 10jenkins-bot: Increase video transcode max time from 8 to 16 hours [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334401 (https://phabricator.wikimedia.org/T156185) (owner: 10Brion VIBBER) [21:58:20] !log demon@tin Synchronized wmf-config/CommonSettings.php: bump video transcode timeouts, brion made me do it (duration: 00m 40s) [21:58:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:32] hehe [21:59:12] PROBLEM - parsoid on wtp2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:00:02] RECOVERY - parsoid on wtp2007 is OK: HTTP OK: HTTP/1.1 200 OK - 1014 bytes in 0.088 second response time [22:02:04] !log ppchelko@tin Started deploy [trending-edits/deploy@46544de]: Correctly calculate since parameter and allow to change decay for debugging [22:02:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:21] ok i gotta run some errands, i'll be back in the evening to work on schema bits [22:02:35] going to try a split schedule since i often end up poking at the computer in the evening anyway ;) [22:04:34] !log ppchelko@tin Finished deploy [trending-edits/deploy@46544de]: Correctly calculate since parameter and allow to change decay for debugging (duration: 02m 29s) [22:04:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:52] PROBLEM - parsoid on wtp2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:05:15] !log ppchelko@tin Started deploy [trending-edits/deploy@46544de]: Correctly calculate since parameter and allow to change decay for debugging, attempt 2 [22:05:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:42] RECOVERY - parsoid on wtp2003 is OK: HTTP OK: HTTP/1.1 200 OK - 1014 bytes in 0.116 second response time [22:12:21] !log ppchelko@tin Finished deploy [trending-edits/deploy@46544de]: Correctly calculate since parameter and allow to change decay for debugging, attempt 2 (duration: 07m 06s) [22:12:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:28:51] (03CR) 10Jforrester: "> Is this still desired? Seems trivial enough" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280003 (owner: 10Jforrester) [22:30:18] (03CR) 10Jforrester: [C: 031] Deploy Linter to medium wikis too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346591 (https://phabricator.wikimedia.org/T148609) (owner: 10Legoktm) [22:37:36] !log restbase deploying a8d4d027 [22:37:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:55] 06Operations, 10ops-codfw, 10Traffic: lvs2002 random shut down - https://phabricator.wikimedia.org/T162099#3159129 (10Papaul) a:03Papaul [23:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170405T2300). Please do the needful. [23:00:04] Jdlrobson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:41] \o [23:01:57] I can SWAT [23:02:11] I'm also self-swatting something [23:02:14] Last minute [23:02:48] thcipriani: when your done with jdlrobson mind going ahead and swatting a puppet patch for me as i wont be able to be around for puppet swat? [23:02:59] RainbowSprinkles: ok :) [23:03:15] Zppix: I don't have +2 on operations/puppet, so I can't sorry :( [23:03:24] thcipriani: no worries [23:04:02] PROBLEM - puppet last run on mc1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:04:24] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346584 (https://phabricator.wikimedia.org/T162162) (owner: 10Jdlrobson) [23:06:12] (03Merged) 10jenkins-bot: Deploy Page previews to stable on Hungrian and Hebrew Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346584 (https://phabricator.wikimedia.org/T162162) (owner: 10Jdlrobson) [23:06:29] (03CR) 10jenkins-bot: Deploy Page previews to stable on Hungrian and Hebrew Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346584 (https://phabricator.wikimedia.org/T162162) (owner: 10Jdlrobson) [23:07:42] RainbowSprinkles: are you in the middle of SWATting? or can I go ahead with ^ [23:07:53] Go ahead, I'm still waiting on jenkins [23:08:00] okie doke [23:09:08] jdlrobson: page previews on hewiki and huwiki is on mwdebug1002, check please [23:09:16] thcipriani: on it [23:11:43] thcipriani: you can proceed! [23:11:52] jdlrobson: ok, going live [23:12:22] Ah crud, thought you were done [23:12:25] Yay mid-sync [23:12:31] gimme 10 seconds [23:12:52] !log demon@tin Synchronized php-1.29.0-wmf.19/extensions/Dashiki/: swattttttt (duration: 00m 41s) [23:13:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:04] Crap. [23:13:21] thcipriani: Continue. [23:13:24] I'm out of your way [23:13:47] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:346584|Deploy Page previews to stable on Hungrian and Hebrew Wikipedias]] T162162 (duration: 00m 40s) [23:13:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:54] T162162: Deploy page previews to Hungarian and Hebrew wikipedias - https://phabricator.wikimedia.org/T162162 [23:13:55] ^ jdlrobson live now [23:13:59] yay [23:14:43] RainbowSprinkles: somehow I must've missed your sync :) [23:15:04] (03PS2) 10Thcipriani: Update Russian Wikipedia logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346581 (https://phabricator.wikimedia.org/T162036) (owner: 10Jdlrobson) [23:15:13] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346581 (https://phabricator.wikimedia.org/T162036) (owner: 10Jdlrobson) [23:16:21] (03PS1) 10Milimetric: Revert "Revert "Restore Dashiki config in CommonSettings for now"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346661 [23:16:33] (03PS2) 10Milimetric: Revert "Revert "Restore Dashiki config in CommonSettings for now"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346661 [23:17:19] (03Merged) 10jenkins-bot: Update Russian Wikipedia logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346581 (https://phabricator.wikimedia.org/T162036) (owner: 10Jdlrobson) [23:17:22] (03CR) 10Chad: [V: 032 C: 032] Revert "Revert "Restore Dashiki config in CommonSettings for now"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346661 (owner: 10Milimetric) [23:17:29] (03CR) 10jenkins-bot: Update Russian Wikipedia logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346581 (https://phabricator.wikimedia.org/T162036) (owner: 10Jdlrobson) [23:18:36] !log demon@tin Synchronized wmf-config/CommonSettings.php: unbreak dashiki again (duration: 00m 40s) [23:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:00] (03CR) 10jenkins-bot: Revert "Revert "Restore Dashiki config in CommonSettings for now"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346661 (owner: 10Milimetric) [23:20:12] PROBLEM - puppet last run on mc1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:21:25] jdlrobson: I fetched the image down on mwdebug1002, dunno if I'm seeing it or not :) [23:23:04] thcipriani: lemme see [23:23:49] im seeing something different [23:23:56] MaxSem: you around? [23:24:01] need a Russian speaker :) [23:24:42] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [23:25:08] jdlrobson, ? [23:25:35] MaxSem: could you check the logo on Russian Wikipedia on mwdebug1002 and tell me if it looks normal to you? https://ru.m.wikipedia.org/ [23:26:18] jdlrobson, logo in the footer? [23:26:21] header [23:26:31] next to the hamburger [23:26:33] ah [23:26:45] that's wordmar or something :P [23:26:56] it should be fine, but it would be nice to hear from a true Russian that it's an improvement :) [23:27:10] compared to what we show without mwdebug1002 [23:27:10] lgtm [23:27:23] thanks MaxSem on behalf of all russians everywhere! go for it thcipriani [23:27:37] :) [23:27:39] going live [23:27:51] what's the difference? [23:29:42] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 18 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [23:29:49] !log thcipriani@tin Synchronized static/images/mobile/copyright/wikipedia-wordmark-ru.svg: SWAT: [[gerrit:346581|Update Russian Wikipedia logo]] T162036 (duration: 00m 40s) [23:29:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:57] T162036: Rendering issues with logo in Russian Wikipedia on mobile - https://phabricator.wikimedia.org/T162036 [23:31:11] jdlrobson: ok, sync'd and purged [23:32:02] RECOVERY - puppet last run on mc1034 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [23:33:46] (03CR) 10Dzahn: [C: 032] installserver: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345549 (owner: 10Faidon Liambotis) [23:33:50] (03PS5) 10Dzahn: installserver: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345549 (owner: 10Faidon Liambotis) [23:47:15] (03PS6) 10Dzahn: DHCP: remove backup4001 [puppet] - 10https://gerrit.wikimedia.org/r/345356 (https://phabricator.wikimedia.org/T158220) [23:47:28] (03CR) 10jerkins-bot: [V: 04-1] DHCP: remove backup4001 [puppet] - 10https://gerrit.wikimedia.org/r/345356 (https://phabricator.wikimedia.org/T158220) (owner: 10Dzahn) [23:48:12] RECOVERY - puppet last run on mc1036 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [23:51:18] (03PS7) 10Dzahn: DHCP: remove backup4001 [puppet] - 10https://gerrit.wikimedia.org/r/345356 (https://phabricator.wikimedia.org/T158220) [23:54:21] 06Operations, 10Monitoring: certspotter on einsteinium has issues talking to external - https://phabricator.wikimedia.org/T162327#3159322 (10Dzahn) [23:56:36] 06Operations, 10DNS, 10Traffic: mediawiki.org DNS TXT entry for Google Search Console site verification - https://phabricator.wikimedia.org/T161343#3159330 (10Dzahn) re: mail to noc@ I was stupid of course i can check that, it's just an alias for root@ and all ops get that. but .. i can still not see one fr... [23:57:08] (03PS8) 10Dzahn: DHCP: remove backup4001 [puppet] - 10https://gerrit.wikimedia.org/r/345356 (https://phabricator.wikimedia.org/T161904) [23:57:44] (03CR) 10Dzahn: [C: 032] "this is now per the new decom task" [puppet] - 10https://gerrit.wikimedia.org/r/345356 (https://phabricator.wikimedia.org/T161904) (owner: 10Dzahn)