[00:00:41] <wikibugs>	 (03PS1) 10Chad: Scap clean: ensure proper quotation of deletion commands in keep-static [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346481
[00:02:16] <wikibugs_>	 06Operations, 10Parsoid: Upload of Parsoid deb package 0.7.0 failed - https://phabricator.wikimedia.org/T162200#3155844 (10Dzahn) I moved the old files above out of "incoming", on tin i [tin:~] $ sudo rm /tmp/parsoid_0.7.0all_amd64.bromine.eqiad.wmnet.upload to be able to repeat the upload.  I deleted the pack...
[00:03:46] <icinga-wm>	 RECOVERY - puppet last run on labtestservices2001 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures
[00:08:11] <logmsgbot>	 !log tstarling@tin Synchronized php-1.29.0-wmf.18/extensions/ParserMigration/includes/MigrationEditPage.php: for bug fix gerrit 346478 (duration: 00m 56s)
[00:08:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:18:24] <wikibugs_>	 (03PS1) 10Cmjohnson: Adding mgmt dns entries for new hadoop nodes analytics1058-1069 [dns] - 10https://gerrit.wikimedia.org/r/346483
[00:19:36] <icinga-wm>	 PROBLEM - puppet last run on dataset1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[00:20:09] <wikibugs_>	 (03PS4) 10Dzahn: releases: remove the precise suite [puppet] - 10https://gerrit.wikimedia.org/r/345838 (owner: 10Faidon Liambotis)
[00:22:31] <wikibugs>	 (03CR) 10Cmjohnson: [C: 032] Adding mgmt dns entries for new hadoop nodes analytics1058-1069 [dns] - 10https://gerrit.wikimedia.org/r/346483 (owner: 10Cmjohnson)
[00:25:18] <wikibugs>	 06Operations, 10ops-eqiad: rack and set up analyics1058-1069 - https://phabricator.wikimedia.org/T162216#3155871 (10Cmjohnson)
[00:25:36] <icinga-wm>	 PROBLEM - puppet last run on mc1030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[00:29:59] <logmsgbot>	 !log tstarling@tin Synchronized php-1.29.0-wmf.18/extensions/ParserMigration/includes: scap test only, no code changes (duration: 01m 21s)
[00:30:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:30:41] <logmsgbot>	 !log demon@tin Started scap: wmf.14 again, testing testing
[00:30:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:31:04] <wikibugs_>	 (03PS2) 10Chad: Scap clean: ensure proper quotation of deletion commands in keep-static [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346481
[00:31:09] <wikibugs>	 (03CR) 10Chad: [C: 032] Scap clean: ensure proper quotation of deletion commands in keep-static [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346481 (owner: 10Chad)
[00:32:10] <wikibugs>	 (03Merged) 10jenkins-bot: Scap clean: ensure proper quotation of deletion commands in keep-static [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346481 (owner: 10Chad)
[00:32:24] <wikibugs_>	 (03CR) 10jenkins-bot: Scap clean: ensure proper quotation of deletion commands in keep-static [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346481 (owner: 10Chad)
[00:36:56] <icinga-wm>	 PROBLEM - puppet last run on cp1061 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[00:42:46] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[00:47:36] <icinga-wm>	 RECOVERY - puppet last run on dataset1001 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures
[00:47:46] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[00:52:36] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1288 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[00:52:46] <icinga-wm>	 PROBLEM - Apache HTTP on mw1288 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[00:52:47] <icinga-wm>	 PROBLEM - HHVM rendering on mw1288 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[00:53:36] <icinga-wm>	 RECOVERY - puppet last run on mc1030 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures
[00:53:46] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1165 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[00:55:06] <icinga-wm>	 PROBLEM - HHVM rendering on mw1223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[00:55:16] <icinga-wm>	 PROBLEM - Apache HTTP on mw1223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[00:55:36] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[00:55:56] <icinga-wm>	 PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[00:57:30] <logmsgbot>	 !log demon@tin Finished scap: wmf.14 again, testing testing (duration: 26m 48s)
[00:57:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:04:46] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[01:04:56] <icinga-wm>	 RECOVERY - puppet last run on cp1061 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures
[01:09:46] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[01:12:31] <wikibugs_>	 (03CR) 10Zppix: [C: 031] "Considering we depricated precise in prod and as well as releng i say we go ahead and merge this." [puppet] - 10https://gerrit.wikimedia.org/r/345838 (owner: 10Faidon Liambotis)
[01:16:46] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[01:17:44] <logmsgbot>	 !log demon@tin Synchronized scap/plugins/clean.py: fixes (duration: 00m 41s)
[01:17:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:18:26] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 273 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[01:21:46] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[01:23:26] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 273 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[01:24:56] <icinga-wm>	 RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures
[01:26:37] <logmsgbot>	 !log tstarling@tin Synchronized php-1.29.0-wmf.18/extensions/ParserMigration/includes: scap test only, no code changes (duration: 00m 40s)
[01:26:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:27:48] <logmsgbot>	 !log tstarling@tin Synchronized php-1.29.0-wmf.18/extensions/ParserMigration/includes: scap test only, no code changes (duration: 00m 39s)
[01:27:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:29:46] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 81583.267689 Seconds
[01:29:56] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 82325.387609 Seconds
[01:32:26] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 82477.859562 Seconds
[01:32:26] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 82479.834204 Seconds
[01:32:36] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 81751.161079 Seconds
[01:32:37] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 81754.352817 Seconds
[01:35:26] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 274 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[01:38:26] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 0.0 Seconds
[01:41:26] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 83019.890733 Seconds
[01:50:26] <icinga-wm>	 PROBLEM - puppet last run on dbmonitor2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:50:36] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds
[01:50:36] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds
[01:53:36] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 83014.244867 Seconds
[01:54:26] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 20.106499 Seconds
[01:54:26] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 22.203747 Seconds
[01:54:36] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 12.285725 Seconds
[01:54:46] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 21.221803 Seconds
[01:54:56] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 47.820263 Seconds
[01:59:24] <wikibugs_>	 06Operations, 10Parsoid: Upload of Parsoid deb package 0.7.0 failed - https://phabricator.wikimedia.org/T162200#3156004 (10ssastry) 05Open>03Resolved p:05Triage>03Normal a:03Dzahn
[01:59:52] <wikibugs>	 06Operations, 10Parsoid: Upload of Parsoid deb package 0.7.0 failed - https://phabricator.wikimedia.org/T162200#3155315 (10ssastry) Confirmed. apt-get install parsoid installs the newer version now.
[02:05:08] <wikibugs>	 (03PS1) 10Andrew Bogott: Keystonehooks:  Create and delete sudoer rules in ldap [puppet] - 10https://gerrit.wikimedia.org/r/346489 (https://phabricator.wikimedia.org/T150091)
[02:10:06] <icinga-wm>	 RECOVERY - Keystone admin and observer projects exist on labtestnet2001 is OK: Keystone projects exist and have matching names and ids.
[02:18:26] <icinga-wm>	 RECOVERY - puppet last run on dbmonitor2001 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures
[02:31:05] <logmsgbot>	 !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.18) (duration: 08m 47s)
[02:31:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:32:12] <wikibugs_>	 06Operations, 10Traffic, 06WMF-Design, 10Wikimedia-General-or-Unknown, 07Design: Better WMF error pages - https://phabricator.wikimedia.org/T76560#3156017 (10Krinkle)
[02:32:24] <wikibugs>	 06Operations, 10Traffic, 06WMF-Design, 10Wikimedia-General-or-Unknown, 07Design: Better WMF error pages - https://phabricator.wikimedia.org/T76560#805779 (10Krinkle)
[02:49:46] <icinga-wm>	 RECOVERY - Hadoop DataNode on analytics1054 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode
[02:50:26] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 274 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[02:57:25] <logmsgbot>	 !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.19) (duration: 07m 22s)
[02:57:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:03:18] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Apr  5 03:03:18 UTC 2017 (duration 5m 53s)
[03:03:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:04:56] <icinga-wm>	 PROBLEM - puppet last run on wtp1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[03:18:10] <wikibugs_>	 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files (e.g. djvu, jpg, png, webm) on Commons and other sites - https://phabricator.wikimedia.org/T161836#3156034 (10Revent) https://commons.wikimedia.org/wiki/File:Walking_Keage_Incline.webm reappeared, and has been reset
[03:31:56] <icinga-wm>	 RECOVERY - puppet last run on wtp1003 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures
[03:35:56] <icinga-wm>	 PROBLEM - HP RAID on db2037 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds.
[03:55:46] <icinga-wm>	 RECOVERY - HP RAID on db2037 is OK: OK: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12, Controller, Battery/Capacitor
[04:35:56] <icinga-wm>	 PROBLEM - HP RAID on db2037 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds.
[04:38:56] <icinga-wm>	 PROBLEM - puppet last run on mc1033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[04:42:26] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 274 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[04:52:26] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 274 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[04:55:36] <icinga-wm>	 RECOVERY - HP RAID on db2037 is OK: OK: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12, Controller, Battery/Capacitor
[05:05:56] <icinga-wm>	 RECOVERY - puppet last run on mc1033 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures
[05:07:56] <icinga-wm>	 PROBLEM - puppet last run on labvirt1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[05:35:56] <icinga-wm>	 RECOVERY - puppet last run on labvirt1006 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures
[05:38:48] <icinga-wm>	 PROBLEM - MariaDB disk space on db1047 is CRITICAL: DISK CRITICAL - free space: / 419 MB (5% inode=72%)
[05:44:47] <wikibugs_>	 06Operations, 10Traffic, 10media-storage, 13Patch-For-Review, 15User-Urbanecm: Some PNG thumbnails and JPEG originals delivered as [text/html] content-type and hence not rendered in browser - https://phabricator.wikimedia.org/T162035#3150362 (10Pokefan95) All files listed here works for me except https:/...
[05:47:26] <wikibugs>	 06Operations, 10Traffic, 10media-storage, 13Patch-For-Review, 15User-Urbanecm: Some PNG thumbnails and JPEG originals delivered as [text/html] content-type and hence not rendered in browser - https://phabricator.wikimedia.org/T162035#3156223 (10Pokefan95) https://upload.wikimedia.org/wikipedia/commons/th...
[05:57:48] <icinga-wm>	 RECOVERY - MariaDB disk space on db1047 is OK: DISK OK
[05:59:20] <wikibugs_>	 06Operations, 10DBA: dbstore1002 in bad shape - https://phabricator.wikimedia.org/T162212#3156234 (10Marostegui) Might be related to the work that has been done by some analysts with some SUPER heavy queries in the last few days...
[05:59:26] <icinga-wm>	 PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=3205.70 Read Requests/Sec=5661.00 Write Requests/Sec=9.70 KBytes Read/Sec=22650.80 KBytes_Written/Sec=2910.00
[06:01:06] <wikibugs_>	 06Operations, 10Traffic, 05MW-1.28-release (WMF-deploy-2016-08-09_(1.28.0-wmf.14)), 13Patch-For-Review: Decom bits.wikimedia.org hostname - https://phabricator.wikimedia.org/T107430#3156237 (10Tbayer) >>! In T107430#2886903, @Tbayer wrote: >>>! In T107430#2882009, @fgiunchedi wrote: >>>>! In T107430#288195...
[06:03:11] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2061" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346497
[06:03:15] <wikibugs_>	 (03PS2) 10Marostegui: Revert "db-codfw.php: Depool db2061" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346497
[06:05:11] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2061" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346497 (owner: 10Marostegui)
[06:06:34] <wikibugs_>	 (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2061" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346497 (owner: 10Marostegui)
[06:06:44] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2061" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346497 (owner: 10Marostegui)
[06:07:36] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2061 - T160390 (duration: 00m 40s)
[06:07:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:07:44] <stashbot>	 T160390: Unify revision table on s7 - https://phabricator.wikimedia.org/T160390
[06:09:26] <icinga-wm>	 RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=0.40 Read Requests/Sec=0.20 Write Requests/Sec=0.90 KBytes Read/Sec=1.20 KBytes_Written/Sec=21.20
[06:15:36] <wikibugs_>	 (03PS1) 10Marostegui: db-codfw.php: Depool db2054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346498 (https://phabricator.wikimedia.org/T160390)
[06:20:53] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346498 (https://phabricator.wikimedia.org/T160390) (owner: 10Marostegui)
[06:21:53] <wikibugs>	 (03Merged) 10jenkins-bot: db-codfw.php: Depool db2054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346498 (https://phabricator.wikimedia.org/T160390) (owner: 10Marostegui)
[06:22:02] <wikibugs>	 (03CR) 10jenkins-bot: db-codfw.php: Depool db2054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346498 (https://phabricator.wikimedia.org/T160390) (owner: 10Marostegui)
[06:22:47] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2054 - T160390 (duration: 00m 43s)
[06:22:48] <marostegui>	 !log Deploy schema change db2054 (s7) - https://phabricator.wikimedia.org/T160390
[06:22:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:22:54] <stashbot>	 T160390: Unify revision table on s7 - https://phabricator.wikimedia.org/T160390
[06:23:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:24:39] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s5 on db1070 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 645.98 seconds
[06:24:49] <marostegui>	 came back from downtime
[06:25:28] <marostegui>	 downtimed again
[06:25:37] <marostegui>	 it is depooled anyways
[06:32:00] <elukey>	 checking mw1223 and mw1288
[06:33:36] <icinga-wm>	 PROBLEM - Host cr2-esams is DOWN: CRITICAL - Network Unreachable (91.198.174.244)
[06:33:43] <elukey>	 !log restart hhvm on mw1223 (hhvm-dump-debug in /tmp/hhvm.2164.bt.)
[06:33:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:33:55] <elukey>	 cr2-esams down??
[06:34:02] <elukey>	 paravoid: ---^
[06:34:06] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/1/3: down - Core: cr2-esams:xe-0/1/3 (Level3, BDFS2448, 84ms) {#2013} [10Gbps wave]BR
[06:35:06] <icinga-wm>	 RECOVERY - Apache HTTP on mw1223 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.024 second response time
[06:35:06] <icinga-wm>	 RECOVERY - HHVM rendering on mw1223 is OK: HTTP OK: HTTP/1.1 200 OK - 74750 bytes in 0.183 second response time
[06:35:26] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1223 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.026 second response time
[06:36:55] <elukey>	 !log restart hhvm on mw1288 (hhvm-dump-debug in /tmp/hhvm.92520.bt.)
[06:37:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:37:56] <icinga-wm>	 PROBLEM - Host cr2-esams IPv6 is DOWN: CRITICAL - Destination Unreachable (2620:0:862:ffff::3)
[06:38:36] <icinga-wm>	 RECOVERY - Apache HTTP on mw1288 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.037 second response time
[06:38:36] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1288 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.040 second response time
[06:39:06] <icinga-wm>	 RECOVERY - HHVM rendering on mw1288 is OK: HTTP OK: HTTP/1.1 200 OK - 74749 bytes in 0.110 second response time
[06:40:36] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[06:40:36] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[06:43:19] <wikibugs>	 06Operations, 10ops-eqiad, 10DBA: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3156329 (10Marostegui) p:05Triage>03Normal
[06:43:34] <wikibugs_>	 (03PS1) 10Muehlenhoff: Remove access for adavenport [puppet] - 10https://gerrit.wikimedia.org/r/346502
[06:44:36] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[06:45:36] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[06:48:06] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346503 (https://phabricator.wikimedia.org/T161088)
[06:48:46] <wikibugs_>	 (03CR) 10Muehlenhoff: [C: 032] Remove access for adavenport [puppet] - 10https://gerrit.wikimedia.org/r/346502 (owner: 10Muehlenhoff)
[06:49:55] <wikibugs>	 (03PS1) 10Elukey: Depool esams due to networking failures [dns] - 10https://gerrit.wikimedia.org/r/346504
[06:50:39] <wikibugs_>	 (03CR) 10Elukey: [C: 04-1] "Not needed at the moment." [dns] - 10https://gerrit.wikimedia.org/r/346504 (owner: 10Elukey)
[06:52:36] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346503 (https://phabricator.wikimedia.org/T161088) (owner: 10Marostegui)
[06:53:56] <wikibugs_>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346503 (https://phabricator.wikimedia.org/T161088) (owner: 10Marostegui)
[06:54:09] <wikibugs_>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346503 (https://phabricator.wikimedia.org/T161088) (owner: 10Marostegui)
[06:55:03] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db1081 - T161088 (duration: 00m 39s)
[06:55:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:55:10] <stashbot>	 T161088: Defragment s4: db1091, db1084, db1081, d1059 and probably the rest - https://phabricator.wikimedia.org/T161088
[06:55:54] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1081 - T161088 (duration: 00m 39s)
[06:56:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:56:41] <marostegui>	 !log Stop replication on db1081 for maintenance - T161088
[06:56:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:00:06] <icinga-wm>	 PROBLEM - Host mr1-esams.oob is DOWN: PING CRITICAL - Packet loss = 100%
[07:21:16] <icinga-wm>	 RECOVERY - Host mr1-esams.oob is UP: PING OK - Packet loss = 0%, RTA = 86.47 ms
[07:37:09] <wikibugs_>	 06Operations, 10DBA: dbstore1002 in bad shape - https://phabricator.wikimedia.org/T162212#3156441 (10Marostegui) As I mentioned here: T159430#3153285 I would like to convert a couple of enwiki tables to InnoDB+compression to see if it helps this: https://jira.mariadb.org/browse/MDEV-9027 which we are suffering...
[07:42:11] <wikibugs>	 06Operations, 10DBA: dbstore1002 in bad shape - https://phabricator.wikimedia.org/T162212#3156444 (10jcrespo) 05Open>03Resolved a:03jcrespo Sure. For now I will close this as it seems healthy again.
[07:44:26] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 33 probes of 274 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[07:44:42] <marostegui>	 !log Migrate dbstore1002 enwiki.page and enwiki.categorylinks from TokuDB to InnoDB+compression - T159430
[07:44:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:44:49] <stashbot>	 T159430: convert dbstore1001 to InnoDB compressed by importing db shards to it - https://phabricator.wikimedia.org/T159430
[07:47:56] <icinga-wm>	 PROBLEM - dhclient process on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[07:48:46] <icinga-wm>	 RECOVERY - dhclient process on thumbor1001 is OK: PROCS OK: 0 processes with command name dhclient
[07:49:26] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 18 probes of 274 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[07:50:49] <hoo>	 _joe_: Hi, do you know about our hhvm settings regarding gc?
[07:50:56] <hoo>	 or anyone else?
[07:54:54] <wikibugs>	 06Operations, 10ops-esams, 10netops: cr2-esams FPC 0 is dead - https://phabricator.wikimedia.org/T162239#3156449 (10faidon)
[07:56:32] <_joe_>	 hoo: I don't think we have specialized settings for GC, but what do you refer to specifically?
[07:58:37] <hoo>	 _joe_: https://phabricator.wikimedia.org/T161695
[08:00:06] <hoo>	 The saddest part is that HHVM still seems to leak memory if we force GC runs, just way slower
[08:00:41] <wikibugs_>	 (03CR) 10Muehlenhoff: "While we have deprecated precise in production and labs by the end of March, support by Canonical extends until the 26th of April, so I th" [puppet] - 10https://gerrit.wikimedia.org/r/345838 (owner: 10Faidon Liambotis)
[08:01:15] <_joe_>	 hoo: oh you mean actual GC within the execution of one script? that' not the way php is usually behaving
[08:02:03] <hoo>	 PHP is doing GC at some… I'm just not sure when it decides to do so
[08:07:05] <wikibugs>	 (03PS6) 10Ema: cache_upload: override CT updates on 304s [puppet] - 10https://gerrit.wikimedia.org/r/346304 (https://phabricator.wikimedia.org/T162035)
[08:08:42] <wikibugs>	 (03CR) 10Ema: cache_upload: override CT updates on 304s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/346304 (https://phabricator.wikimedia.org/T162035) (owner: 10Ema)
[08:10:07] <wikibugs_>	 (03CR) 10Ema: [V: 032 C: 032] cache_upload: override CT updates on 304s [puppet] - 10https://gerrit.wikimedia.org/r/346304 (https://phabricator.wikimedia.org/T162035) (owner: 10Ema)
[08:10:12] <wikibugs>	 (03PS2) 10Jcrespo: Revert "mariadb: Depool db1034 temporarilly to run ANALYZE on revision" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346319
[08:12:10] <hoo>	 _joe_: https://phabricator.wikimedia.org/T161695#3156472 :S
[08:12:14] <wikibugs_>	 (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db1034 temporarilly to run ANALYZE on revision" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346319 (owner: 10Jcrespo)
[08:13:04] <_joe_>	 hoo: I'll take a look, but I think it's the usual cli-vs-fcgi-best-settings
[08:13:19] <hoo>	 hm
[08:13:25] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1034 temporarilly to run ANALYZE on revision" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346319 (owner: 10Jcrespo)
[08:13:37] <wikibugs_>	 (03CR) 10jenkins-bot: Revert "mariadb: Depool db1034 temporarilly to run ANALYZE on revision" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346319 (owner: 10Jcrespo)
[08:14:29] <hashar>	 hoo: oh !!!
[08:15:07] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1034 after maintenance (duration: 00m 40s)
[08:15:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:15:20] <hashar>	 hoo: thanks for the garbage collection hint.  The Wikidata php5 job that fails due to leak memory, maybe that could be more or less fixed by enabling gc again
[08:15:30] <hashar>	 hoo: we had it disabled when running phpunit due to segfaulting
[08:15:51] <hoo>	 I thought that was only a php 5.3 hack
[08:15:55] <hoo>	 and long gone
[08:16:07] <hashar>	 yeah
[08:18:10] <wikibugs>	 06Operations, 07HHVM: HHVM 3.18 deadlocks after 4-6 hours (stuck in in HPHP::Treadmill::getAgeOldestRequest() ) - https://phabricator.wikimedia.org/T161684#3156490 (10MoritzMuehlenhoff) This has been fixed upstream, I'm currently building a new package (and also rebasing to HHVM 3.18.2 while at it) to validate...
[08:18:15] <hoo>	 hashar: You can live hack it to occasionally collect and see how that goes, I guess
[08:18:25] <hashar>	 hoo: that segfaulted on trusty as well :(according to https://phabricator.wikimedia.org/T142158
[08:20:11] <hoo>	 Oh, totally forgot about that one
[08:21:04] <wikibugs>	 (03PS1) 10Elukey: Increase Redis connection timeout for MediaWiki Jobrunners [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346508 (https://phabricator.wikimedia.org/T125735)
[08:21:41] <hoo>	 hashar: Hm, a lot has changed since… maybe be bold and just try it? Can hardly be worse than the current situation
[08:21:52] <hoo>	 it = enable gc again
[08:25:44] <wikibugs>	 (03PS1) 10Ema: cache_upload: lower keep from 3d to 1d on upload backends [puppet] - 10https://gerrit.wikimedia.org/r/346510 (https://phabricator.wikimedia.org/T162035)
[08:25:59] <_joe_>	 hoo: what about we enable gc in hhvm cli just for dumps for now?
[08:26:10] <hoo>	 Sounds good to me
[08:27:23] <_joe_>	 the point is GC is useful just for long-running scripts, not for web requests, as the whole memory for a request gets garbage collected at the end of the request itself
[08:27:48] <_joe_>	 modulo some shared inner structures that won't be affected by that GC anyways
[08:28:08] <_joe_>	 so it's pointless and potentially harmful to do gc on web requests
[08:28:08] <hoo>	 That makes sense
[08:28:30] <_joe_>	 that's the beauty and ugliness of php for the web at the same time :P
[08:29:14] <hoo>	 Indeed
[08:33:56] <icinga-wm>	 PROBLEM - puppet last run on contint1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:37:13] <wikibugs>	 (03CR) 10Marostegui: "Just for the record: looks like this has been working fine and so far dbstore1002 hasn't complained about timeouts today." [puppet] - 10https://gerrit.wikimedia.org/r/345372 (owner: 10Marostegui)
[08:39:22] <hashar>	 contint1001 fails with Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Attempt to assign to a reserved variable name: 'trusted' on node contint1001.wikimedia.org
[08:39:59] <hashar>	 one time issue though
[08:40:11] <wikibugs>	 (03CR) 10Ema: [V: 032 C: 032] cache_upload: lower keep from 3d to 1d on upload backends [puppet] - 10https://gerrit.wikimedia.org/r/346510 (https://phabricator.wikimedia.org/T162035) (owner: 10Ema)
[08:40:56] <icinga-wm>	 RECOVERY - puppet last run on contint1001 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures
[08:43:44] <hoo>	 !log Ran scap pull on mwdebug1001 to revert local changes to Wikibase maintenance scripts
[08:43:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:45:30] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: Add tasks for stage 0 (033 comments) [switchdc] - 10https://gerrit.wikimedia.org/r/346305 (owner: 10Giuseppe Lavagetto)
[08:47:29] <_joe_>	 hoo: I don't really have time to work on that now though
[08:48:01] <hoo>	 _joe_: Ok, I'll open a task then… maybe I can also look into that myself, let's ee
[08:48:59] <_joe_>	 hoo: sorry, I really have my hands full :(
[08:49:14] <hoo>	 I can relate to that
[08:51:31] <wikibugs>	 06Operations, 10Datasets-General-or-Unknown, 07HHVM, 10Wikidata: Enable GC for HHVM CLI (at least for dump runners) - https://phabricator.wikimedia.org/T162245#3156596 (10hoo)
[08:53:41] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: role::cluster::management: add profile::conftool::client [puppet] - 10https://gerrit.wikimedia.org/r/346511
[08:54:06] <wikibugs_>	 (03CR) 10Jcrespo: "> Just for the record: looks like this has been working fine and so" [puppet] - 10https://gerrit.wikimedia.org/r/345372 (owner: 10Marostegui)
[08:54:50] <wikibugs>	 (03CR) 10Marostegui: "> > Just for the record: looks like this has been working fine and so" [puppet] - 10https://gerrit.wikimedia.org/r/345372 (owner: 10Marostegui)
[08:56:28] <wikibugs_>	 (03CR) 10Giuseppe Lavagetto: [C: 032] role::cluster::management: add profile::conftool::client [puppet] - 10https://gerrit.wikimedia.org/r/346511 (owner: 10Giuseppe Lavagetto)
[08:58:05] <wikibugs_>	 (03CR) 10Jcrespo: "> > > Just for the record: looks like this has been working fine and" [puppet] - 10https://gerrit.wikimedia.org/r/345372 (owner: 10Marostegui)
[09:03:13] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: Add tasks for stage 0 [switchdc] - 10https://gerrit.wikimedia.org/r/346305
[09:04:41] <volans>	 !log deleted the 2 swift thumbs that were making swiftrepl stuck in a loop: T162122
[09:04:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:04:48] <stashbot>	 T162122: Swiftrepl was stuck in an infinite loop since days - https://phabricator.wikimedia.org/T162122
[09:11:43] <elukey>	 !log reimage analytics1057 to Debian Jessie
[09:11:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:12:43] <wikibugs>	 06Operations, 06Analytics-Kanban, 15User-Elukey: Reimage all the Hadoop worker nodes to Debian Jessie - https://phabricator.wikimedia.org/T160333#3156645 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` ['analytics1057.eqiad.wmnet'] ``` The log can...
[09:20:26] <icinga-wm>	 PROBLEM - puppet last run on mira is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:22:50] <wikibugs>	 (03CR) 10Volans: [C: 031] "LGTM" [switchdc] - 10https://gerrit.wikimedia.org/r/346305 (owner: 10Giuseppe Lavagetto)
[09:22:56] <icinga-wm>	 PROBLEM - puppet last run on wtp1021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:25:48] <wikibugs_>	 (03CR) 10Giuseppe Lavagetto: [C: 032] Add tasks for stage 0 [switchdc] - 10https://gerrit.wikimedia.org/r/346305 (owner: 10Giuseppe Lavagetto)
[09:32:31] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: Fix the stop-maintenance task (031 comment) [switchdc] - 10https://gerrit.wikimedia.org/r/346306 (owner: 10Giuseppe Lavagetto)
[09:36:29] <wikibugs>	 06Operations, 06Analytics-Kanban, 15User-Elukey: Reimage all the Hadoop worker nodes to Debian Jessie - https://phabricator.wikimedia.org/T160333#3156720 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['analytics1057.eqiad.wmnet'] ```  and were **ALL** successful.
[09:36:33] <wikibugs_>	 06Operations, 10Mail, 10Wikimedia-Mailing-lists: Sender email spoofing - https://phabricator.wikimedia.org/T160529#3156721 (10Nemo_bis)
[09:48:23] <volans>	 !log deleted a third swift thumb that was making swiftrepl stuck in a loop: T162122
[09:48:26] <icinga-wm>	 RECOVERY - puppet last run on mira is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures
[09:48:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:48:31] <stashbot>	 T162122: Swiftrepl was stuck in an infinite loop since days - https://phabricator.wikimedia.org/T162122
[09:48:46] <icinga-wm>	 PROBLEM - puppet last run on mw1194 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:49:05] <wikibugs_>	 06Operations, 10media-storage: Swiftrepl was stuck in an infinite loop since days - https://phabricator.wikimedia.org/T162122#3156746 (10Volans) The third one was: ``` wikipedia-commons-local-thumb.3b        3/3b/Hendrick_de_Keyser_-_gulden_cabinet.png/85px-Hendrick_de_Keyser_-_gulden_cabinet.png       E-Tag m...
[09:49:56] <icinga-wm>	 RECOVERY - puppet last run on wtp1021 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures
[10:16:56] <icinga-wm>	 RECOVERY - puppet last run on mw1194 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures
[10:18:38] <wikibugs_>	 (03PS3) 10Giuseppe Lavagetto: Fix the stop-maintenance task [switchdc] - 10https://gerrit.wikimedia.org/r/346306
[10:23:19] <wikibugs_>	 (03CR) 10Volans: [C: 031] "LGTM" [switchdc] - 10https://gerrit.wikimedia.org/r/346306 (owner: 10Giuseppe Lavagetto)
[10:24:04] <wikibugs_>	 (03CR) 10Giuseppe Lavagetto: Add task to restore the TTL of discovery entries to 5 minutes (032 comments) [switchdc] - 10https://gerrit.wikimedia.org/r/346311 (owner: 10Giuseppe Lavagetto)
[10:24:25] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] Fix the stop-maintenance task [switchdc] - 10https://gerrit.wikimedia.org/r/346306 (owner: 10Giuseppe Lavagetto)
[10:24:39] <wikibugs_>	 (03PS3) 10Giuseppe Lavagetto: Propery re-reference the redis task [switchdc] - 10https://gerrit.wikimedia.org/r/346307
[10:24:54] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Propery re-reference the redis task [switchdc] - 10https://gerrit.wikimedia.org/r/346307 (owner: 10Giuseppe Lavagetto)
[10:25:10] <wikibugs_>	 (03PS3) 10Giuseppe Lavagetto: Update the varnish task to use the new puppet scripts [switchdc] - 10https://gerrit.wikimedia.org/r/346308
[10:25:30] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Update the varnish task to use the new puppet scripts [switchdc] - 10https://gerrit.wikimedia.org/r/346308 (owner: 10Giuseppe Lavagetto)
[10:26:21] <wikibugs_>	 (03PS3) 10Giuseppe Lavagetto: Modify the start maintenance script [switchdc] - 10https://gerrit.wikimedia.org/r/346309
[10:26:52] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] Modify the start maintenance script [switchdc] - 10https://gerrit.wikimedia.org/r/346309 (owner: 10Giuseppe Lavagetto)
[10:27:54] <wikibugs>	 06Operations, 10ops-eqiad: decommission ms1003 - https://phabricator.wikimedia.org/T157975#3156876 (10ArielGlenn) @Cmjohnson AFAIK there's only removing it from dhcp.  I should go ahead and do that then?  Anything I missed?
[10:31:34] <wikibugs_>	 06Operations, 10Ops-Access-Requests, 10Traffic, 13Patch-For-Review: Ops Onboarding for Arzhel Younsi - https://phabricator.wikimedia.org/T162073#3151609 (10MoritzMuehlenhoff) @ayounsi I've added you to pwstore and re-encrypted the password files. Docs can be found at https://office.wikimedia.org/wiki/Pwsto...
[10:31:45] <wikibugs>	 06Operations, 10Ops-Access-Requests, 10Traffic, 13Patch-For-Review: Ops Onboarding for Arzhel Younsi - https://phabricator.wikimedia.org/T162073#3156881 (10MoritzMuehlenhoff)
[10:53:56] <icinga-wm>	 PROBLEM - puppet last run on wdqs1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:54:24] <wikibugs>	 06Operations, 07HHVM: HHVM 3.18 crashes when Cirrus tries to fetch another wiki config via maint script - https://phabricator.wikimedia.org/T161520#3156972 (10MoritzMuehlenhoff) 05Open>03Resolved a:03MoritzMuehlenhoff So this is a generic problem with depleting the HHVM byte code cache and would've happe...
[10:54:26] <wikibugs_>	 06Operations, 07HHVM, 13Patch-For-Review, 07Upstream: Build / migrate to HHVM 3.18 - https://phabricator.wikimedia.org/T158176#3156977 (10MoritzMuehlenhoff)
[10:54:59] <wikibugs_>	 (03PS1) 10Addshore: wmgUseInterwikiSorting true for wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346523 (https://phabricator.wikimedia.org/T162253)
[10:56:17] <wikibugs_>	 (03CR) 10Addshore: [C: 04-2] "Waiting for the 24th" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346523 (https://phabricator.wikimedia.org/T162253) (owner: 10Addshore)
[10:56:49] <wikibugs>	 (03PS2) 10Addshore: wmgUseInterwikiSorting true for wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346523 (https://phabricator.wikimedia.org/T162253)
[10:57:01] <wikibugs_>	 06Operations, 07HHVM: HHVM 3.18 deadlocks after 4-6 hours (stuck in in HPHP::Treadmill::getAgeOldestRequest() ) - https://phabricator.wikimedia.org/T161684#3156988 (10MoritzMuehlenhoff)
[10:57:04] <wikibugs>	 06Operations, 07HHVM, 13Patch-For-Review, 07Upstream: Build / migrate to HHVM 3.18 - https://phabricator.wikimedia.org/T158176#3156987 (10MoritzMuehlenhoff)
[10:59:02] <wikibugs_>	 (03PS1) 10Addshore: Deploy Cognate to production wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346524 (https://phabricator.wikimedia.org/T150182)
[11:04:02] <wikibugs_>	 (03PS3) 10Addshore: Remove Wikibase vs Interwikisorting checks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341127 (https://phabricator.wikimedia.org/T150183)
[11:06:13] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2054" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346526
[11:06:19] <wikibugs_>	 (03PS2) 10Marostegui: Revert "db-codfw.php: Depool db2054" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346526
[11:07:01] <wikibugs_>	 (03CR) 10Addshore: [C: 04-2] Deploy Cognate to production wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346524 (https://phabricator.wikimedia.org/T150182) (owner: 10Addshore)
[11:08:47] <wikibugs_>	 (03PS1) 10Jcrespo: mariadb: Depool db1055 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346527 (https://phabricator.wikimedia.org/T159319)
[11:09:19] <wikibugs_>	 (03PS7) 10Muehlenhoff: Provide a systemd override unit for memcached [puppet] - 10https://gerrit.wikimedia.org/r/319820
[11:10:47] <_joe_>	 moritzm: what is the upstream bug for the hhvm issue?
[11:10:59] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: Add phase-9 varnish puppet run to restore order to dc_from [switchdc] - 10https://gerrit.wikimedia.org/r/346310
[11:13:00] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-2] "I'm not merging this because there are pending changes to the procedure." [switchdc] - 10https://gerrit.wikimedia.org/r/346310 (owner: 10Giuseppe Lavagetto)
[11:13:09] <moritzm>	 _joe_: which one? the stat_cache crash or the stat_cache deadlock?
[11:13:16] <_joe_>	 the latter
[11:13:46] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[11:14:02] <moritzm>	 _joe_: https://github.com/facebook/hhvm/issues/7756 I'm currently building a new 3.18.2 package with the patch on top
[11:14:54] <wikibugs_>	 06Operations, 13Patch-For-Review, 15User-Elukey, 07Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#3157002 (10elukey) @aaron thanks a lot for the feedback, I created a code change that...
[11:15:54] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] "Heads up" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346527 (https://phabricator.wikimedia.org/T159319) (owner: 10Jcrespo)
[11:16:14] <wikibugs_>	 (03CR) 10Marostegui: "thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346527 (https://phabricator.wikimedia.org/T159319) (owner: 10Jcrespo)
[11:16:26] <_joe_>	 moritzm: nice, thanks
[11:17:00] <wikibugs_>	 (03CR) 10Jcrespo: [C: 032] "I will take 3 days more or less :-/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346527 (https://phabricator.wikimedia.org/T159319) (owner: 10Jcrespo)
[11:17:12] <wikibugs>	 (03Merged) 10jenkins-bot: mariadb: Depool db1055 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346527 (https://phabricator.wikimedia.org/T159319) (owner: 10Jcrespo)
[11:17:21] <wikibugs_>	 (03CR) 10jenkins-bot: mariadb: Depool db1055 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346527 (https://phabricator.wikimedia.org/T159319) (owner: 10Jcrespo)
[11:17:29] <wikibugs_>	 (03CR) 10Milimetric: [C: 032] "The dependent change was merged." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344209 (owner: 10Milimetric)
[11:18:11] <wikibugs_>	 (03PS2) 10Milimetric: Revert "Restore Dashiki config in CommonSettings for now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344209
[11:18:46] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 18 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[11:18:57] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1055 for maintenance (duration: 00m 40s)
[11:19:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:19:47] <wikibugs_>	 (03CR) 10jenkins-bot: Revert "Restore Dashiki config in CommonSettings for now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344209 (owner: 10Milimetric)
[11:20:02] <volans>	 moritzm: ping me when you want to do the postgres update, sorry I was a bit sidetracked in the morning
[11:20:20] * moritzm too, we can do it now if you want
[11:21:07] <wikibugs>	 (03PS3) 10Marostegui: Revert "db-codfw.php: Depool db2054" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346526
[11:21:12] <volans>	 sure, if the situation is stable
[11:22:02] <volans>	 are you taking care of the upgrade of the postgres on the puppetdb master/slave hosts and I just have to take care of puppetdb service restart?
[11:22:20] <volans>	 do we have a procedure (for the first part)? (apart doing the slave first ofc ) :D
[11:22:56] <icinga-wm>	 RECOVERY - puppet last run on wdqs1002 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures
[11:24:10] <wikibugs_>	 (03PS2) 10Volans: Puppet: do not deactivate hosts in PuppetDB automatically [puppet] - 10https://gerrit.wikimedia.org/r/346110 (https://phabricator.wikimedia.org/T159163)
[11:24:36] <icinga-wm>	 PROBLEM - puppet last run on db1063 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:24:41] <moritzm>	 we can doublecheck with akosiaris, but from what I can tell the procedure boils down to "shut up the icinga-bot and upgrade postgres" :-)
[11:24:52] <volans>	 for that part yes
[11:24:56] <icinga-wm>	 PROBLEM - puppet last run on pc1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:25:03] <volans>	 I was wondering for the postgres replication
[11:25:06] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:25:06] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:25:56] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy
[11:25:56] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy
[11:27:39] <moritzm>	 it's just a minor version bump, there should be no difference in replication, but we can wait for akosiaris to comment
[11:28:43] <volans>	 like in mysql/mariadb usually we depool + stop the replica before stopping mysql for a cleaner shutdown
[11:29:04] <volans>	 not sure for the equivalent here, if any
[11:30:39] <wikibugs_>	 (03PS2) 10Elukey: Apply role memcached to the new mc1* hosts [puppet] - 10https://gerrit.wikimedia.org/r/337010 (https://phabricator.wikimedia.org/T137345)
[11:33:24] <volans>	 moritzm: anyway I can check the replication status and connections on postgres
[11:33:37] <moritzm>	 ok
[11:36:32] <volans>	 let me know if you want do it now or wait for a feedback ;)
[11:37:54] <moritzm>	 I think we can do it now
[11:38:09] <volans>	 agree
[11:38:35] <volans>	 given that I need to run puppet after the merge either you upgrade before or after to avoid puppet doing stuff while upgrading
[11:38:38] <volans>	 any preference?
[11:39:18] <moritzm>	 I'm fine both ways
[11:39:53] <volans>	 same here, merge first, run puppet and disable puppet so then you're free to upgrade?
[11:40:10] <elukey>	 _joe_ is https://gerrit.wikimedia.org/r/337010 good to go right? We'll flip it to mediawiki::memcached after the switchover
[11:40:15] <moritzm>	 volans: sounds good
[11:40:25] <volans>	 ok proceeding
[11:40:39] <_joe_>	 elukey: +1
[11:40:45] <elukey>	 super
[11:40:52] <volans>	 elukey: let me merge this change please
[11:40:54] <volans>	 one sec :D
[11:40:56] <wikibugs>	 (03CR) 10Volans: [C: 032] Puppet: do not deactivate hosts in PuppetDB automatically [puppet] - 10https://gerrit.wikimedia.org/r/346110 (https://phabricator.wikimedia.org/T159163) (owner: 10Volans)
[11:41:03] <wikibugs_>	 (03CR) 10Giuseppe Lavagetto: [C: 031] Apply role memcached to the new mc1* hosts [puppet] - 10https://gerrit.wikimedia.org/r/337010 (https://phabricator.wikimedia.org/T137345) (owner: 10Elukey)
[11:42:20] <akosiaris>	 moritzm: yeah, no way that upgrade is gonna affect replication
[11:42:33] <volans>	 !log disabling ircecho for the merge of gerrit/346110 ( T159163 ) and postgres upgrade
[11:42:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:42:41] <stashbot>	 T159163: PuppetDB is auto-deactivating hosts - https://phabricator.wikimedia.org/T159163
[11:43:29] <wikibugs_>	 (03PS15) 10Gehel: elasticsearch - move role::elasticsearch::common to a profile [puppet] - 10https://gerrit.wikimedia.org/r/342248 (https://phabricator.wikimedia.org/T147718)
[11:45:16] <wikibugs>	 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files (e.g. djvu, jpg, png, webm) on Commons and other sites - https://phabricator.wikimedia.org/T161836#3157056 (10Superzerocool) >>! In T161836#3146753, @Superzerocool wrote: > Hi, I'll add another one: https://commons.wikim...
[11:45:41] <volans>	 moritzm: all yours
[11:45:47] <volans>	 puppet disabled on nihal/nitrogen
[11:46:31] <moritzm>	 ok, starting with the slave, then (nihal)
[11:47:22] <moritzm>	 volans: nihal upgraded, could you briefly check the replication status?
[11:47:26] <volans>	 disabled also on einsteinium to not start again ircecho in few minutes
[11:47:35] * volans checking moritzm 
[11:48:01] <volans>	 mmmh no replication
[11:48:41] <volans>	 wait
[11:49:07] <volans>	 the replication is on nitrogen
[11:49:43] <volans>	 yeah sorry my bad
[11:49:50] <volans>	 postgres is the other way around
[11:50:16] <moritzm>	 ok, proceeding with update on nitrogen, then
[11:50:31] <volans>	 ok, looks good 
[11:50:45] <moritzm>	 done
[11:51:10] <volans>	 you sure they got restarted? :D
[11:51:39] <volans>	 too quick to be true :D
[11:51:49] <moritzm>	 yeah, all the postgres procs are from 11:50
[11:52:07] <volans>	 backend_start    | 2017-04-05 11:50:52.335469+00
[11:52:08] <volans>	 yep
[11:52:13] <volans>	 looks good so far
[11:52:23] <moritzm>	 great
[11:52:31] <volans>	 I see clients connections
[11:53:15] <moritzm>	 yeah, I think we can re-enable puppet
[11:54:19] <volans>	 sure on those 2
[11:54:28] <volans>	 I'm gathering a list of failed puppet to force a run
[11:54:47] <volans>	 and then I can re-enable it in einsteinium and start ircecho
[11:58:01] <moritzm>	 sounds good!
[11:58:07] <volans>	 already running cumin :D
[11:59:35] <wikibugs>	 (03PS16) 10Gehel: elasticsearch - move role::elasticsearch::common to a profile [puppet] - 10https://gerrit.wikimedia.org/r/342248 (https://phabricator.wikimedia.org/T147718)
[12:00:01] <volans>	 ok, back to normal
[12:00:20] <volans>	 !log re-enabled puppet on nitrogen/nihal/einsteinium, restarted ircecho
[12:00:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:01:33] <moritzm>	 back to "puppet normal" :-)
[12:02:42] <volans>	 ofc :D
[12:02:46] <volans>	 nothing more
[12:03:01] <wikibugs_>	 (03PS4) 10Muehlenhoff: Add explicit dependency on ghostscript [puppet] - 10https://gerrit.wikimedia.org/r/313963
[12:03:50] <volans>	 and I can confirm seeing updated catalogs in the slave, as expected
[12:04:40] <wikibugs_>	 06Operations, 07Puppet, 13Patch-For-Review: PuppetDB is auto-deactivating hosts - https://phabricator.wikimedia.org/T159163#3157065 (10Volans) 05Open>03Resolved
[12:04:52] <moritzm>	 !log upgrade remaining ca-certificates from jessie point update
[12:04:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:06:40] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: Add task to restore the TTL of discovery entries to 5 minutes [switchdc] - 10https://gerrit.wikimedia.org/r/346311
[12:09:56] <icinga-wm>	 PROBLEM - puppet last run on restbase1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[12:10:01] <wikibugs_>	 (03PS17) 10Gehel: elasticsearch - move role::elasticsearch::common to a profile [puppet] - 10https://gerrit.wikimedia.org/r/342248 (https://phabricator.wikimedia.org/T147718)
[12:11:35] <wikibugs_>	 06Operations, 10media-storage: Swiftrepl was stuck in an infinite loop since days - https://phabricator.wikimedia.org/T162122#3157072 (10Volans) The first run of the `swiftrepl` has finally completed! It is now in the 2 hour sleep between runs, I'll check the next one completes without manual intevention.
[12:19:22] <wikibugs_>	 (03CR) 10Hoo man: [C: 031] "I find the setting name rather weird, but that's ok for now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345387 (https://phabricator.wikimedia.org/T159828) (owner: 10Daniel Kinzler)
[12:19:49] <wikibugs>	 (03PS18) 10Gehel: elasticsearch - move role::elasticsearch::common to a profile [puppet] - 10https://gerrit.wikimedia.org/r/342248 (https://phabricator.wikimedia.org/T147718)
[12:23:44] <wikibugs_>	 (03PS19) 10Gehel: elasticsearch - move role::elasticsearch::common to a profile [puppet] - 10https://gerrit.wikimedia.org/r/342248 (https://phabricator.wikimedia.org/T147718)
[12:31:40] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s5 on db1070 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[12:32:55] <volans>	 marostegui, jynus, I guess was expected this one ^^^
[12:34:26] <icinga-wm>	 PROBLEM - puppet last run on phab2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[12:36:31] <marostegui>	 volans: yep :)
[12:37:17] <wikibugs_>	 (03PS3) 10Elukey: Apply role memcached to the new mc1* hosts [puppet] - 10https://gerrit.wikimedia.org/r/337010 (https://phabricator.wikimedia.org/T137345)
[12:37:22] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2054" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346526 (owner: 10Marostegui)
[12:37:27] <wikibugs_>	 (03CR) 10Elukey: [C: 032] "No op checking https://puppet-compiler.wmflabs.org/6027/" [puppet] - 10https://gerrit.wikimedia.org/r/337010 (https://phabricator.wikimedia.org/T137345) (owner: 10Elukey)
[12:37:57] <icinga-wm>	 RECOVERY - puppet last run on restbase1015 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures
[12:38:07] <wikibugs>	 06Operations, 10hardware-requests: EQIAD: 2 hardware access request for kubernetes-staging - https://phabricator.wikimedia.org/T162257#3157106 (10akosiaris)
[12:38:26] <wikibugs_>	 (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2054" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346526 (owner: 10Marostegui)
[12:38:36] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2054" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346526 (owner: 10Marostegui)
[12:38:38] <wikibugs_>	 06Operations, 10hardware-requests: EQIAD: 2 hardware access request for kubernetes-staging - https://phabricator.wikimedia.org/T162257#3157106 (10akosiaris)
[12:38:40] <wikibugs>	 06Operations, 05Goal, 07kubernetes: Design and implement a Kubernetes-based staging environment. (stretch) - https://phabricator.wikimedia.org/T162045#3157120 (10akosiaris)
[12:40:10] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2054 - T160390 (duration: 00m 44s)
[12:40:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:40:17] <stashbot>	 T160390: Unify revision table on s7 - https://phabricator.wikimedia.org/T160390
[12:40:21] <wikibugs_>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db2047 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346534 (https://phabricator.wikimedia.org/T160390)
[12:41:38] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db2047 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346534 (https://phabricator.wikimedia.org/T160390) (owner: 10Marostegui)
[12:42:40] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db2047 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346534 (https://phabricator.wikimedia.org/T160390) (owner: 10Marostegui)
[12:42:53] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db2047 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346534 (https://phabricator.wikimedia.org/T160390) (owner: 10Marostegui)
[12:44:08] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2047 - T160390 (duration: 00m 41s)
[12:44:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:44:22] <marostegui>	 !log Deploy schema change db2047 (s7) - T160390
[12:44:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:44:28] <wikibugs>	 (03PS20) 10Gehel: elasticsearch - move role::elasticsearch::common to a profile [puppet] - 10https://gerrit.wikimedia.org/r/342248 (https://phabricator.wikimedia.org/T147718)
[12:46:22] <wikibugs_>	 (03CR) 10Gehel: [C: 032] elasticsearch - move role::elasticsearch::common to a profile [puppet] - 10https://gerrit.wikimedia.org/r/342248 (https://phabricator.wikimedia.org/T147718) (owner: 10Gehel)
[12:49:56] <icinga-wm>	 PROBLEM - puppet last run on mw1179 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[12:49:56] <icinga-wm>	 PROBLEM - Check systemd state on relforge1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[12:50:00] <wikibugs>	 (03PS1) 10Gehel: elasticsearch - ferm hosts need to be space separated, not coma separated [puppet] - 10https://gerrit.wikimedia.org/r/346537
[12:50:24] <gehel>	 ^relforge alert is me, fix on the way
[12:51:18] <wikibugs_>	 (03CR) 10Gehel: [C: 032] elasticsearch - ferm hosts need to be space separated, not coma separated [puppet] - 10https://gerrit.wikimedia.org/r/346537 (owner: 10Gehel)
[12:54:57] <icinga-wm>	 PROBLEM - Check systemd state on relforge1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[12:57:37] <elukey>	 !log reimage analytics1035 (journal node) to Debian Jessie
[12:57:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:58:14] <wikibugs>	 (03PS1) 10Gehel: elasticsearch - maintenance_hosts is actually already resolved to IPs [puppet] - 10https://gerrit.wikimedia.org/r/346538
[12:58:38] <wikibugs_>	 06Operations, 06Analytics-Kanban, 15User-Elukey: Reimage all the Hadoop worker nodes to Debian Jessie - https://phabricator.wikimedia.org/T160333#3157141 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` ['analytics1035.eqiad.wmnet'] ``` The log can...
[12:59:08] <wikibugs>	 (03Abandoned) 10Giuseppe Lavagetto: cache::text: remove direct route to mediawiki from eqiad [puppet] - 10https://gerrit.wikimedia.org/r/346322 (owner: 10Giuseppe Lavagetto)
[12:59:20] <wikibugs_>	 (03CR) 10Gehel: [C: 032] elasticsearch - maintenance_hosts is actually already resolved to IPs [puppet] - 10https://gerrit.wikimedia.org/r/346538 (owner: 10Gehel)
[13:00:05] <jouncebot>	 addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170405T1300).
[13:00:05] <jouncebot>	 Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process.
[13:01:06] <icinga-wm>	 RECOVERY - Check systemd state on relforge1002 is OK: OK - running: The system is fully operational
[13:01:06] <icinga-wm>	 PROBLEM - puppet last run on cp3032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:01:23] <hashar>	 o/
[13:01:37] <wikibugs>	 (03PS2) 10Hashar: Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346324 (https://phabricator.wikimedia.org/T162089) (owner: 10Urbanecm)
[13:01:38] <_joe_>	 hashar: are you SWATTING?
[13:01:49] <hashar>	 _joe_: yes
[13:01:49] <wikibugs>	 06Operations, 10ops-eqiad: rack and set up analyics1058-1069 - https://phabricator.wikimedia.org/T162216#3157143 (10Ottomata)
[13:01:53] <_joe_>	 can I ask you to merge two patches of mine during this window?
[13:01:54] <hashar>	 unless there is something bad going on
[13:01:56] <icinga-wm>	 RECOVERY - Check systemd state on relforge1001 is OK: OK - running: The system is fully operational
[13:02:06] <hashar>	 yeah totally
[13:02:10] <_joe_>	 no, I just forgot to add myself to the calendar, sorry :P
[13:02:15] <hashar>	 let me push the simple throttle rule
[13:02:22] <hashar>	 no worries
[13:02:25] <hashar>	 monday is rather busy
[13:02:26] <icinga-wm>	 RECOVERY - puppet last run on phab2001 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures
[13:02:40] <hashar>	 but other days we usually have only 2-3 patches
[13:02:56] <icinga-wm>	 PROBLEM - puppet last run on mw1287 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[ca-certificates]
[13:03:02] <wikibugs>	 (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346324 (https://phabricator.wikimedia.org/T162089) (owner: 10Urbanecm)
[13:03:40] <hashar>	 _joe_: what are the patches? :}
[13:04:14] <wikibugs>	 (03Merged) 10jenkins-bot: Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346324 (https://phabricator.wikimedia.org/T162089) (owner: 10Urbanecm)
[13:04:27] <wikibugs>	 (03CR) 10jenkins-bot: Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346324 (https://phabricator.wikimedia.org/T162089) (owner: 10Urbanecm)
[13:04:58] <_joe_>	 hashar: https://gerrit.wikimedia.org/r/#/c/316317/ and https://gerrit.wikimedia.org/r/#/c/345510/
[13:05:21] <_joe_>	 the first one is a bit more complex
[13:05:53] <logmsgbot>	 !log hashar@tin Synchronized wmf-config/throttle.php: Add new throttle rule - T162089 (duration: 00m 40s)
[13:05:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:06:00] <stashbot>	 T162089: Lift IP rate limit - Workshop - 2017-04-06 - https://phabricator.wikimedia.org/T162089
[13:06:03] <wikibugs_>	 06Operations, 10ops-eqiad: rack and set up analyics1058-1069 - https://phabricator.wikimedia.org/T162216#3157157 (10Ottomata) RAID config should be identical to other nodes, e.g. analytics1057.  I think /dev/sda is Hardware RAID 1 on the 2 2.5" flex bay drives.  The rest 12 drives are JBOD, so you can leave th...
[13:06:44] <hashar>	 _joe_: guess we can do the switch of ores to discovery first
[13:06:49] <hashar>	 though I have no clue how to validate that one
[13:06:51] <_joe_>	 hashar: nope
[13:06:59] <_joe_>	 hashar: that basically includes the other one
[13:07:03] <hashar>	 ah
[13:07:13] <_joe_>	 the dangerous part, that is 
[13:07:22] <_joe_>	 e.g. calling ores on its internal url
[13:07:54] <_joe_>	 in fact, I just realized this is not needed atm
[13:08:02] <_joe_>	 hashar: no need to merge this
[13:08:10] <_joe_>	 I have to think it through a bit more
[13:08:27] <_joe_>	 sorry, just hit me that this way we're bypassing the varnish cache
[13:08:41] <_joe_>	 and that's not good
[13:09:52] <_joe_>	 I'm clarifying that with brandon; also, this means that switching traffic == switching ores for mediawiki
[13:09:56] <_joe_>	 so for now it's ok
[13:10:12] <_joe_>	 hashar: so, thanks but I'll merge tomorrow in case
[13:10:18] <hashar>	 sure :-}
[13:10:24] <hashar>	 we can do it in our morning if you want
[13:10:30] <_joe_>	 ok
[13:10:38] <_joe_>	 that might be a good idea too
[13:11:13] <hashar>	 and if someone from traffic is needed, deploy anytime one of them show up
[13:12:13] <wikibugs>	 06Operations, 10Analytics, 10DBA: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3157170 (10Ottomata) > How's the process to decommission db1047 going?  I guess ok!  I think we should just dump all the user created databases to a file and archive it before...
[13:14:32] <wikibugs_>	 06Operations, 10Analytics, 10DBA: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3157185 (10Marostegui) >>! In T156844#3157170, @Ottomata wrote: >> How's the process to decommission db1047 going? >  > I guess ok!  I think we should just dump all the user c...
[13:14:33] <wikibugs>	 (03PS4) 10Ottomata: Improvements to eventlogging_sync.sh script [puppet] - 10https://gerrit.wikimedia.org/r/345646 (https://phabricator.wikimedia.org/T124307)
[13:15:48] <wikibugs_>	 (03CR) 10Ottomata: [V: 032 C: 032] Improvements to eventlogging_sync.sh script [puppet] - 10https://gerrit.wikimedia.org/r/345646 (https://phabricator.wikimedia.org/T124307) (owner: 10Ottomata)
[13:17:56] <icinga-wm>	 RECOVERY - puppet last run on mw1179 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures
[13:20:07] <icinga-wm>	 PROBLEM - puppet last run on ms-fe3002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:20:17] <wikibugs>	 (03PS1) 10Hoo man: Temporarily enable change dispatch logging on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346540 (https://phabricator.wikimedia.org/T159828)
[13:21:18] <wikibugs>	 (03PS1) 10Ottomata: Properly default to master database name when slave database not given [puppet] - 10https://gerrit.wikimedia.org/r/346541 (https://phabricator.wikimedia.org/T124307)
[13:21:45] <wikibugs>	 (03CR) 10Ottomata: [V: 032 C: 032] Properly default to master database name when slave database not given [puppet] - 10https://gerrit.wikimedia.org/r/346541 (https://phabricator.wikimedia.org/T124307) (owner: 10Ottomata)
[13:24:42] <wikibugs_>	 (03PS1) 10DCausse: [cirrus] Increase max field count for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346542
[13:24:55] <wikibugs>	 (03PS1) 10Gehel: wdqs: active/active public interface [puppet] - 10https://gerrit.wikimedia.org/r/346543
[13:25:18] <wikibugs_>	 06Operations, 06Analytics-Kanban, 15User-Elukey: Reimage all the Hadoop worker nodes to Debian Jessie - https://phabricator.wikimedia.org/T160333#3157205 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['analytics1035.eqiad.wmnet'] ```  and were **ALL** successful.
[13:25:28] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [cirrus] Increase max field count for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346542 (owner: 10DCausse)
[13:25:31] <wikibugs_>	 (03CR) 10Gehel: [C: 04-1] "Waiting for full reimport of wdqs codfw cluster before merging this" [puppet] - 10https://gerrit.wikimedia.org/r/346543 (owner: 10Gehel)
[13:26:24] <wikibugs_>	 (03PS2) 10Giuseppe Lavagetto: cache::text: switch all mediawiki to codfw [puppet] - 10https://gerrit.wikimedia.org/r/346320
[13:26:26] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: discovery::app_routes: switch mediawiki to codfw [puppet] - 10https://gerrit.wikimedia.org/r/346321
[13:26:43] <wikibugs_>	 (03PS2) 10DCausse: [cirrus] Increase max field count for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346542
[13:28:04] <halfak>	 ORES
[13:28:59] <tiddlywink>	 any idea why the RC feed is doing this for en.wp ? ...performing the action "edit" on [[Madelaine Petsch]]. Actions taken: Interdire la modification ([[Special:AbuseLog/18211089|details]])
[13:29:43] <tiddlywink>	 the attempted edit has a little French, as it's a translation, would that explain the filter tripping with the summary "Interdire la modification" ?
[13:30:16] <icinga-wm>	 RECOVERY - puppet last run on cp3032 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures
[13:31:14] <wikibugs>	 (03PS1) 10Jcrespo: Make mediawiki-eqiad dc read-only before switchover to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346544 (https://phabricator.wikimedia.org/T154658)
[13:31:16] <icinga-wm>	 RECOVERY - puppet last run on mw1287 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures
[13:31:56] <wikibugs_>	 (03CR) 10Jcrespo: [C: 04-2] "Do not deploy until April 19th." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346544 (https://phabricator.wikimedia.org/T154658) (owner: 10Jcrespo)
[13:32:07] <wikibugs_>	 (03PS1) 10Hoo man: Temporarily disable the change dispatch cron for testwikidata [puppet] - 10https://gerrit.wikimedia.org/r/346545 (https://phabricator.wikimedia.org/T159828)
[13:32:11] <hashar>	 tiddlywink: arent you using french as the interface language?
[13:32:28] <hashar>	 ah no
[13:32:29] <tiddlywink>	 nope
[13:32:31] <hashar>	 https://en.wikipedia.org/wiki/Special:AbuseLog/18211089?uselang=fr   vs  https://en.wikipedia.org/wiki/Special:AbuseLog/18211089?uselang=en
[13:32:41] <hashar>	 at least those have the proper text
[13:32:49] <tiddlywink>	 it's the IRC feed, not the on-wiki logs
[13:33:32] <tiddlywink>	 also get Actions taken: Avertir l’utilisateur ([[Special:AbuseLog/18211001|details]]) for the prior attempt
[13:34:16] <wikibugs>	 (03CR) 10Volans: "Are we keeping the "3 minutes"?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346544 (https://phabricator.wikimedia.org/T154658) (owner: 10Jcrespo)
[13:35:40] <hashar>	 tiddlywink: maybe the irc log message got formatted based on the user language
[13:35:43] <hashar>	 instead of the project lnaugage
[13:36:21] <tiddlywink>	 that would make sense
[13:37:22] <wikibugs>	 (03PS1) 10Jcrespo: Make mediawiki codfw dc read-write after switchover to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346547 (https://phabricator.wikimedia.org/T154658)
[13:38:21] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-2] "Do not deploy until April 19th." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346547 (https://phabricator.wikimedia.org/T154658) (owner: 10Jcrespo)
[13:39:21] <_joe_>	 jynus: I already have a cumulative patch for all that, given we can sync one file at a time 
[13:39:40] <_joe_>	 https://gerrit.wikimedia.org/r/#/c/346251/
[13:39:40] <jynus>	 oh
[13:39:47] <volans>	 an if you do distinct ones, please follow the order, so they can be merged without rebase
[13:39:47] <jynus>	 I didn't know that
[13:39:49] <_joe_>	 but don't throw yours away
[13:40:05] <jynus>	 I thought you needed help with that
[13:40:11] <volans>	 the commonsettings one goes in the middle of the other 2
[13:40:16] <jynus>	 yes
[13:40:21] <jynus>	 I was about to that that ones now
[13:40:27] <jynus>	 CommonSettings.php
[13:40:54] <volans>	 but has to be in the middle, while you've already stacked the other 2
[13:40:56] <jynus>	 it is literally the same thing
[13:41:12] <jynus>	 it doesn't matter- all will be premerged in advance
[13:41:24] <jynus>	 just synced in order
[13:41:27] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 274 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[13:41:48] <jynus>	 _joe_ but don't throw yours away- no reason to not, why?
[13:42:10] <_joe_>	 jynus: there was some debate over merging a single patch and multiple ones, AIUI
[13:42:20] <volans>	 and to pre-merge or not ;)
[13:42:30] <jynus>	 I say premerge
[13:42:39] <_joe_>	 I agree with you jaime
[13:42:42] <jynus>	 but have the individual 
[13:42:48] <jynus>	 reverts
[13:42:55] <jynus>	 which is the only reason to have them separate
[13:43:13] <_joe_>	 if we need to revert in a hurry during the switchover, the best way is
[13:43:13] <jynus>	 this is the largest pain in time, we want to only wait 2-3 minutes
[13:43:21] <_joe_>	 git checkout HEAD~1 -- file
[13:43:24] <jynus>	 yep
[13:43:25] <volans>	 also I need to know if we stay with "3" minutes or use the "15" because the check will look for it
[13:43:30] <_joe_>	 scap sync-file file
[13:43:39] <jynus>	 I would go with 3 as that would be my aim
[13:43:53] <_joe_>	 jynus: no way we can pull it off
[13:44:08] <_joe_>	 just the codfw warmup will take us more than that
[13:44:09] <jynus>	 IF the scripts works well, yes
[13:44:12] <_joe_>	 or around that time
[13:44:16] <jynus>	 lately scap takes 1 minutes with the check
[13:44:23] <jynus>	 less without it
[13:44:25] <volans>	 jynus: we still have puppet commits and rns
[13:44:31] <volans>	 *runs
[13:44:31] <_joe_>	 then we have to manually merge a puppet patch, and some other things
[13:44:38] <_joe_>	 my personal goal is 10 minutes
[13:44:49] <jynus>	 _joe_, that is actually the deparment's goal
[13:44:50] <_joe_>	 if we can make it, I'll be impressed
[13:45:24] <jynus>	 can we do a proper production test of the script at some point- that would tell us?
[13:45:29] <_joe_>	 jynus: well, if we had etcd in mediawiki and etcd-controlled traffic switchovers, that would be easy
[13:45:36] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp1099 is CRITICAL: CRITICAL: expiry mailbox lag is 629278
[13:45:42] <_joe_>	 jynus: we can test some of the steps, sure
[13:45:46] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 675275
[13:45:48] <hashar>	 you can get scap sync-file to skip the canaries check entirely :  scap sync-file --force 
[13:45:57] <hashar>	 would save a few more seconds
[13:45:59] <_joe_>	 hashar: ok good to know
[13:46:05] <jynus>	 like testing the scap with a wrong commit
[13:46:07] <_joe_>	 volans: ^^ can you add that?
[13:46:11] <volans>	 yes I know, but this will skip the linting too right?
[13:46:24] <jynus>	 volans, it will skipp waiting for canary traffic
[13:46:24] <_joe_>	 volans: linting?
[13:46:29] <jynus>	 linting is done my CI
[13:46:32] <_joe_>	 volans: scap does linting?
[13:46:40] <hashar>	 the canary query logstash, sleep(20), query logstash
[13:46:40] <hashar>	 something like that
[13:46:49] <_joe_>	 yeah let's not do that here
[13:46:53] <jynus>	 that is not needed here
[13:46:56] <_joe_>	 logstash will also be a shitshow
[13:46:59] <_joe_>	 :P
[13:47:03] <jynus>	 I told you that, volans about --force
[13:47:17] <jynus>	 yeah, it could fail due to read only mode
[13:47:20] <jynus>	 and be ok
[13:47:44] <jynus>	 what we need is produciton testing + thoroug manual review beforhand
[13:47:57] <hashar>	 and one can check with tyler/chad  , but maybe there is a way to first deploy the files
[13:48:01] <hashar>	 which takes a while
[13:48:16] <icinga-wm>	 RECOVERY - puppet last run on ms-fe3002 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures
[13:48:17] <hashar>	 then as a second step do the switch from a version to another  which would be quite faster
[13:48:39] <jynus>	 that would be nice if there are scap tricks, as it will be automatized
[13:48:48] <jynus>	 not something we can pull off on an emergency
[13:48:52] <jynus>	 but good to know
[13:49:45] <_joe_>	 yeah I wouldn't focus on that
[13:49:59] <jynus>	 I've seem on the steps there is a mw_primary change, but I said to do that async- only monitoring uses it to know if to page or not
[13:50:26] <_joe_>	 jynus: yeah that change will be puppet-merged toghether with a varnish one
[13:50:39] <_joe_>	 that's the only merge (and manual step) needed during the switchover
[13:51:00] <volans>	 hashar: it's scap sync-file --force OR scap --force sync-file ?
[13:51:08] <jynus>	 no need to run it for me, do the varnish one if you need it
[13:51:49] <hashar>	 scap sync-file --force
[13:52:00] <hashar>	 you can give it a try on beta cluster   on deployment-tin
[13:53:02] <hashar>	 and if you pass it "--beta-only-change"
[13:53:14] <hashar>	 it does not touch InitialiseSettings.php so the conf is stall
[13:53:24] <hashar>	 so in theory a hack would be:  scap sync-file --beta-only-change  foobar.php
[13:53:44] <Urbanecm>	 hashar, zeljkof, was my patch deployed? Sorry for my repeating lateness...
[13:53:46] <hashar>	 but no that is terrible idea. Forget me
[13:53:47] <hashar>	 Urbanecm: yes
[13:53:54] <Urbanecm>	 Thank you!
[13:54:15] <hashar>	 Urbanecm: the throttle changes I don't mind deploying them without you being around.  They are super easy to check
[13:54:32] <wikibugs>	 (03PS1) 10Volans: Scap: use --force to skip canaries checks [switchdc] - 10https://gerrit.wikimedia.org/r/346549 (https://phabricator.wikimedia.org/T160178)
[13:55:00] <thcipriani>	 if you're worried about a temporary logstash explosion then --force will not run the logstash checks, and yeah beta-only-change will leave all the appservers alone so they won't re-read initialisesettings.php
[13:55:01] <Urbanecm>	 hashar, does it mean if I has only throttle changes I can just schedule them? 
[13:56:04] <wikibugs>	 (03CR) 10Hashar: [C: 031] ""scap sync-file --force"  would skip the sequence of:  logstash, sleep(20, logstash check. So that should speed up the overall runtime." [switchdc] - 10https://gerrit.wikimedia.org/r/346549 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans)
[13:56:25] <jynus>	 ^is that right? doesn't that need to change the pwd?
[13:56:26] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 274 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[13:57:03] <volans>	 jynus: AFAIK yes, not needed anymore, and tested few days ago
[13:57:10] <hashar>	 and that code is probably all wrong 
[13:57:11] <jynus>	 ok, cool
[13:57:12] <wikibugs_>	 (03PS4) 10Muehlenhoff: Disable wireshark-common/install-setuid to avoid debconf prompt [puppet] - 10https://gerrit.wikimedia.org/r/346162
[13:57:17] <jynus>	 I didn't know
[13:57:29] <volans>	 hashar: thcipriani can you confirm?
[13:57:52] <hashar>	 what does : remote.select('R:Class = Deployment::Rsync and R:Class%cron_ensure = absent').sync( foo )  d does?
[13:58:02] <hashar>	 is that selecting a bunch of host then run "sync" on them ?
[13:58:10] <volans>	 selects the deployment host
[13:58:13] <hashar>	 oh
[13:58:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Disable wireshark-common/install-setuid to avoid debconf prompt [puppet] - 10https://gerrit.wikimedia.org/r/346162 (owner: 10Muehlenhoff)
[13:58:29] <volans>	 is ugly but is the way that from how it's puppetized can be selected, according to joe ;)
[13:58:54] <wikibugs_>	 06Operations, 10Datasets-General-or-Unknown, 07HHVM, 10Wikidata: Enable GC for HHVM CLI (at least for dump runners) - https://phabricator.wikimedia.org/T162245#3157282 (10MoritzMuehlenhoff) p:05Triage>03Normal
[13:58:58] <wikibugs>	 06Operations, 10hardware-requests: EQIAD: 2 hardware access request for kubernetes-staging - https://phabricator.wikimedia.org/T162257#3157284 (10MoritzMuehlenhoff) p:05Triage>03Normal
[13:59:05] <hashar>	 hiera is the canonical place for the deployment server :   deployment_server: tin.eqiad.wmnet
[13:59:06] <jynus>	 why is cron absemt from the primary dc?
[13:59:18] <jynus>	 or is it the secondary on purpose?
[13:59:33] <volans>	 the secondary has the rsync cron I think, need to recheck though to be sure
[13:59:49] <volans>	 I din't do that part :D
[14:00:01] <hashar>	 and for scripts usually one can just   use the DNS entry deployment.eqiad.wmnet
[14:00:03] <jynus>	 I was mostly asking joe :-)
[14:00:10] <hashar>	 which should point to the right primary deployment server
[14:00:24] <wikibugs>	 06Operations, 10Datasets-General-or-Unknown, 07HHVM, 10Wikidata, 15User-Elukey: Enable GC for HHVM CLI (at least for dump runners) - https://phabricator.wikimedia.org/T162245#3157285 (10elukey)
[14:00:52] <jynus>	 I am going to abandon my patches
[14:01:02] <volans>	 hashar: can you confirm that scap don't need to change directory to be run right? 
[14:01:09] <jynus>	 I think joe is better, and 100% clone of what I was going to do
[14:01:50] <hashar>	 volans: I see a patch by Chad that mentionned an issue when being run out of /srv/mediawiki-staging
[14:02:01] <hashar>	 so probably safer to change the cwd
[14:02:20] <wikibugs>	 (03CR) 10Thcipriani: [C: 031] Scap: use --force to skip canaries checks [switchdc] - 10https://gerrit.wikimedia.org/r/346549 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans)
[14:02:48] <hashar>	 and I dont think you need to sudo -u os.getlogin()
[14:03:00] <wikibugs_>	 (03CR) 10Jcrespo: [C: 031] Switch master DC from eqiad to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346251 (owner: 10Giuseppe Lavagetto)
[14:03:00] <hashar>	 anyway, the scap sync-file --force looks good
[14:03:18] <hashar>	 (as long as filename does not have a space in it which it should not)
[14:04:19] <wikibugs_>	 (03Abandoned) 10Jcrespo: Make mediawiki codfw dc read-write after switchover to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346547 (https://phabricator.wikimedia.org/T154658) (owner: 10Jcrespo)
[14:04:21] <volans>	 yes, we already have 3 layers of quotes, would like to avoid the 4th ;)
[14:04:52] <wikibugs>	 (03Abandoned) 10Jcrespo: Make mediawiki-eqiad dc read-only before switchover to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346544 (https://phabricator.wikimedia.org/T154658) (owner: 10Jcrespo)
[14:05:19] <wikibugs>	 (03CR) 10Ema: [C: 031] Disable wireshark-common/install-setuid to avoid debconf prompt [puppet] - 10https://gerrit.wikimedia.org/r/346162 (owner: 10Muehlenhoff)
[14:08:22] <wikibugs>	 (03PS1) 10Jgreen: temporarily move fundraisingdbread.wmnet to db1025 for db maintenance [dns] - 10https://gerrit.wikimedia.org/r/346550
[14:09:36] <wikibugs>	 (03CR) 10Volans: [C: 032] Scap: use --force to skip canaries checks [switchdc] - 10https://gerrit.wikimedia.org/r/346549 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans)
[14:09:48] <wikibugs_>	 (03CR) 10Jgreen: [C: 032] temporarily move fundraisingdbread.wmnet to db1025 for db maintenance [dns] - 10https://gerrit.wikimedia.org/r/346550 (owner: 10Jgreen)
[14:11:56] <icinga-wm>	 PROBLEM - puppet last run on wtp1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:14:07] <elukey>	 hoo: o/ - I can try to help for T162245, so things will speed up a bit
[14:14:07] <stashbot>	 T162245: Enable GC for HHVM CLI (at least for dump runners) - https://phabricator.wikimedia.org/T162245
[14:15:31] <elukey>	 hoo: if I got it correctly, it should be a matter of adding hhvm.enable_gc=true to the /etc/hhvm/php.ini of snapshot100* right?
[14:18:26] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 274 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[14:22:36] <icinga-wm>	 PROBLEM - puppet last run on wtp1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:23:26] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 274 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[14:27:30] <Danny_B>	 https://upload.wikimedia.org/wikipedia/commons/thumb/e/e3/Incubator-logo.svg/13px-Incubator-logo.svg.png returns "Content Encoding Error"
[14:27:44] <wikibugs>	 (03PS2) 10Andrew Bogott: Keystonehooks:  Create and delete sudoer rules in ldap [puppet] - 10https://gerrit.wikimedia.org/r/346489 (https://phabricator.wikimedia.org/T150091)
[14:28:16] <icinga-wm>	 PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:29:17] <paladox>	 ema ^^
[14:30:38] <volans>	 paladox: when the message is this one and there aren't a ton of them, it's not an issue, transient known failure (although a bit noisy)
[14:30:44] <wikibugs_>	 (03PS1) 10Elukey: Enable hhvm GC for CLI on snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/346554 (https://phabricator.wikimedia.org/T162245)
[14:31:53] <paladox>	 oh
[14:31:54] <paladox>	 sorry
[14:32:14] <volans>	 nw, just FYI ;)
[14:32:22] <paladox>	 ok
[14:33:56] <icinga-wm>	 PROBLEM - puppet last run on wtp1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:34:13] <wikibugs>	 (03PS4) 10Jcrespo: [WIP]Remove $::mw_primary variable from puppet [puppet] - 10https://gerrit.wikimedia.org/r/345346 (https://phabricator.wikimedia.org/T156924)
[14:37:24] <wikibugs_>	 06Operations, 10ops-codfw, 10DBA: codfw rack/setup first 10 DB servers - https://phabricator.wikimedia.org/T162159#3157431 (10Papaul)
[14:38:32] <wikibugs_>	 (03CR) 10Hoo man: "Why not use role/common/snapshot/dumper.yaml?" [puppet] - 10https://gerrit.wikimedia.org/r/346554 (https://phabricator.wikimedia.org/T162245) (owner: 10Elukey)
[14:38:45] <hoo>	 thanks for looking into they, elukey!
[14:38:56] <icinga-wm>	 RECOVERY - puppet last run on wtp1018 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures
[14:39:36] <elukey>	 hoo: thanks for the review! Will look into that :)
[14:39:46] <elukey>	 hoo: I was looking for that file, much better
[14:40:46] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[14:41:24] <wikibugs_>	 (03PS5) 10Jcrespo: [WIP]Remove $::mw_primary variable from puppet [puppet] - 10https://gerrit.wikimedia.org/r/345346 (https://phabricator.wikimedia.org/T156924)
[14:42:25] <hoo>	 :)
[14:42:52] <wikibugs_>	 (03PS1) 10Jgreen: fix reversed frdb1002/frdev1001 IPs, re-pool frdb1001 for now [dns] - 10https://gerrit.wikimedia.org/r/346556
[14:44:08] <wikibugs_>	 (03CR) 10Jgreen: [C: 032] fix reversed frdb1002/frdev1001 IPs, re-pool frdb1001 for now [dns] - 10https://gerrit.wikimedia.org/r/346556 (owner: 10Jgreen)
[14:45:26] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 274 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[14:45:46] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[14:49:27] <wikibugs>	 (03CR) 10Muehlenhoff: "Did you test these options on a 3.12 installation? According to http://hhvm.com/blog/2017/02/15/hhvm-3-18.html they were introduced with 3" [puppet] - 10https://gerrit.wikimedia.org/r/346554 (https://phabricator.wikimedia.org/T162245) (owner: 10Elukey)
[14:49:56] <wikibugs_>	 (03PS2) 10Elukey: Enable hhvm GC for CLI on snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/346554 (https://phabricator.wikimedia.org/T162245)
[14:50:26] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 274 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[14:50:36] <icinga-wm>	 RECOVERY - puppet last run on wtp1002 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures
[14:53:22] <wikibugs_>	 (03PS2) 10Matthias Mullie: Enable 3d extension on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341332 (https://phabricator.wikimedia.org/T159717) (owner: 10MarkTraceur)
[14:53:31] <wikibugs>	 (03CR) 10Matthias Mullie: Enable 3d extension on beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341332 (https://phabricator.wikimedia.org/T159717) (owner: 10MarkTraceur)
[14:53:56] <wikibugs>	 (03PS3) 10Matthias Mullie: Enable 3d extension on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341332 (https://phabricator.wikimedia.org/T159717) (owner: 10MarkTraceur)
[14:55:22] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Enable 3d extension on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341332 (https://phabricator.wikimedia.org/T159717) (owner: 10MarkTraceur)
[14:56:03] <wikibugs>	 06Operations, 10DNS, 10Traffic: mediawiki.org DNS TXT entry for Google Search Console site verification - https://phabricator.wikimedia.org/T161343#3157520 (10dr0ptp4kt) @Dzahn does Google Search Console for noc@ show that the site verification code matches what's already in DNS like you communicated here? I...
[14:56:22] <elukey>	 hoo: Moritz is right, there is no compatibility section in the hhvm docs so I got fooled, not sure if the GC options that I put are available for 3.12
[14:56:44] <hoo>	 :/
[14:57:06] <elukey>	 hoo: maybe we could test them in deploymnent-prep? 
[14:57:16] <icinga-wm>	 RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures
[14:57:34] <hoo>	 I suppose we could
[14:57:41] <hoo>	 Trusty is not going to get 3.18, is it?
[14:58:07] <elukey>	 hoo: probably not.. and we are still testing 3.18 because if shows some issues
[14:58:12] <elukey>	 hoo: maybe http://php.net/manual/en/info.configuration.php#ini.zend.enable-gc could wokr?
[14:58:15] <elukey>	 *work
[14:58:34] <moritzm>	 hoo: no, let's better migrate the snapshot* hosts to jessie
[14:58:35] <hoo>	 There's even a user space function
[14:59:31] <moritzm>	 elukey: HHVM upstream fixed the deadlock, currently building the package, then it's hopefully ready
[14:59:48] <hoo>	 I guess we could also hack this via mediawiki-config
[15:00:00] <wikibugs>	 (03CR) 10Reedy: [C: 04-1] Enable 3d extension on beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341332 (https://phabricator.wikimedia.org/T159717) (owner: 10MarkTraceur)
[15:00:01] <hoo>	 but I would still like for the settings to be set via hiera
[15:00:34] <elukey>	 hoo: what would be the best way to test the hhvm settings for you? I am a bit ignorant about snapshot*
[15:01:56] <icinga-wm>	 RECOVERY - puppet last run on wtp1011 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures
[15:03:28] <wikibugs_>	 (03CR) 10Reedy: [C: 04-1] Enable 3d extension on beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341332 (https://phabricator.wikimedia.org/T159717) (owner: 10MarkTraceur)
[15:03:54] <hoo>	 elukey: Well, whatever works for me
[15:03:58] <hoo>	 for example mwdebug1001
[15:05:30] <elukey>	 hoo: no no I mean where I can tweak hhvm settings (possibly not production) and let you check the GC settings
[15:05:51] <wikibugs>	 06Operations, 10ops-codfw, 10DBA: codfw rack/setup first 10 DB servers - https://phabricator.wikimedia.org/T162159#3157547 (10Marostegui) ` install_server module update (mac address and partitioning info,) Please provide partition schema`  Please create a RAID10 with the following options (https://wikitech.w...
[15:06:12] <hoo>	 Not sure we have enough data in beta to even see the memory leak, but it could be
[15:06:37] <hoo>	 that's why I suggested changing the settings on one of the mwdebugs and then test a dumper
[15:06:43] <hoo>	 well "memory leak"
[15:06:59] <elukey>	 ahhh we can do that in there too? If so mwdebug1001 should be good
[15:07:04] <hoo>	 Like do an actual partial dump of wikidata to /dev/null
[15:07:30] <elukey>	 can I ping you in ~1 hour after my meetings?
[15:07:41] <hoo>	 mwdebug are jessie, though
[15:07:44] <hoo>	 sure
[15:15:13] <elukey>	 hoo: in the meantime, can you test mwdebug1001 and see if you can repro the leak?
[15:15:53] <hoo>	 Is it live therE?
[15:16:14] <hoo>	 I mean, did you change the settings there?
[15:16:23] <elukey>	 nono
[15:16:27] <elukey>	 still not
[15:16:39] <elukey>	 but I want to make sure that we can repro before changing the settings
[15:16:48] <hoo>	 I tested this on mwdebug1002 earlier today and it indeed started to blow up
[15:16:59] <elukey>	 super
[15:17:02] <wikibugs_>	 (03PS1) 10Jcrespo: Kill long running queries longer with shorter terms: [software] - 10https://gerrit.wikimedia.org/r/346559 (https://phabricator.wikimedia.org/T160984)
[15:18:54] <wikibugs>	 (03PS2) 10Jcrespo: Kill long running queries with stricter conditions [software] - 10https://gerrit.wikimedia.org/r/346559 (https://phabricator.wikimedia.org/T160984)
[15:19:36] <wikibugs_>	 (03CR) 10Marostegui: [C: 031] Kill long running queries with stricter conditions [software] - 10https://gerrit.wikimedia.org/r/346559 (https://phabricator.wikimedia.org/T160984) (owner: 10Jcrespo)
[15:20:04] <elukey>	 !log playing with hhvm settings on mwdebug1002
[15:20:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:20:15] <wikibugs_>	 (03PS4) 10Matthias Mullie: Enable 3d extension on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341332 (https://phabricator.wikimedia.org/T159717) (owner: 10MarkTraceur)
[15:20:19] <wikibugs>	 (03CR) 10Jcrespo: "We are not going to just deploy this, it will need a very slow  and progressive deployment." [software] - 10https://gerrit.wikimedia.org/r/346559 (https://phabricator.wikimedia.org/T160984) (owner: 10Jcrespo)
[15:23:09] <elukey>	 hoo: I added the enable_gc options, but those probably will fail.. can we make a test?
[15:23:30] <hoo>	 Sure, I'll start a dumper
[15:24:02] <hoo>	 running on mwdebug1001, I'm monitoring the memory use
[15:25:03] <elukey>	 hoo: also do you have a quick way to check var_dump(gc_enabled()) ?
[15:25:31] <elukey>	 (as you may have guessed I am a total newbie with php)
[15:25:34] <hoo>	 > var_dump(gc_enabled());
[15:25:35] <hoo>	 bool(false)
[15:26:07] <hoo>	 dumper memory usage was also growing, killed it now
[15:28:50] <elukey>	 hoo: mmm I tried <?php var_dump(gc_enabled()); ?> and then hhvm filename.php, I get bool true
[15:29:14] <hoo>	 $ sudo -u www-data php /srv/mediawiki/multiversion/MWScript.php eval.php --wiki wikidatawiki
[15:29:40] <hoo>	 I used that
[15:29:51] <wikibugs_>	 (03CR) 10Jcrespo: "This may need some extra refactoring- servers are never going to be runing more than 10 queries at the same time due to the queuing system" [software] - 10https://gerrit.wikimedia.org/r/346559 (https://phabricator.wikimedia.org/T160984) (owner: 10Jcrespo)
[15:30:01] <elukey>	 elukey@mwdebug1002:~$ sudo -u www-data php /srv/mediawiki/multiversion/MWScript.php eval.php --wiki wikidatawiki
[15:30:02] <hoo>	 (can also be used with php5, in that case it has GC)
[15:30:04] <elukey>	 > var_dump(gc_enabled());
[15:30:07] <elukey>	 bool(true)
[15:30:26] <elukey>	 argh you tried mwdebug1001 probably
[15:30:30] <hoo>	 yeah
[15:30:47] <elukey>	 sorry I saw "I tested this on mwdebug1002" and my brain did ssh mwdebug1002
[15:32:12] <wikibugs_>	 06Operations, 10DBA, 10Wikimedia-General-or-Unknown: Spurious completely empty `image` table row on commonswiki - https://phabricator.wikimedia.org/T155769#3157631 (10matmarex) a:05matmarex>03None There is nothing else I can do myself to resolve this. I do not have the access to run the two queries I pos...
[15:33:31] <elukey>	 hoo: let's try on mwdebug1002
[15:33:37] <hoo>	 ok
[15:34:08] <hoo>	 there it's true
[15:36:46] <wikibugs>	 (03PS3) 10Jcrespo: Kill long running queries with stricter conditions [software] - 10https://gerrit.wikimedia.org/r/346559 (https://phabricator.wikimedia.org/T160984)
[15:38:52] <wikibugs>	 06Operations, 10DBA, 10Wikimedia-General-or-Unknown: Spurious completely empty `image` table row on commonswiki - https://phabricator.wikimedia.org/T155769#2953992 (10jcrespo) This was classified as a low priority task. It will be eventually done, do not worry, it is not forgotten, but at the cost of other,...
[15:41:10] <wikibugs_>	 06Operations, 10DBA, 10Wikimedia-General-or-Unknown: Spurious completely empty `image` table row on commonswiki - https://phabricator.wikimedia.org/T155769#3157662 (10Marostegui) For the record, I checked the "consistency" of that row across s4 (commons) and s1 (enwiki), and to make sure at least it is prese...
[15:41:56] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 263.89 seconds
[15:42:31] <wikibugs_>	 (03PS1) 10Marostegui: s1.hosts: Remove db1057 [software] - 10https://gerrit.wikimedia.org/r/346563 (https://phabricator.wikimedia.org/T162135)
[15:42:48] <wikibugs>	 (03CR) 10Jcrespo: [C: 031] s1.hosts: Remove db1057 [software] - 10https://gerrit.wikimedia.org/r/346563 (https://phabricator.wikimedia.org/T162135) (owner: 10Marostegui)
[15:43:48] <wikibugs>	 (03CR) 10Marostegui: [C: 032] s1.hosts: Remove db1057 [software] - 10https://gerrit.wikimedia.org/r/346563 (https://phabricator.wikimedia.org/T162135) (owner: 10Marostegui)
[15:44:50] <wikibugs_>	 (03Merged) 10jenkins-bot: s1.hosts: Remove db1057 [software] - 10https://gerrit.wikimedia.org/r/346563 (https://phabricator.wikimedia.org/T162135) (owner: 10Marostegui)
[15:51:51] <wikibugs_>	 06Operations, 10DNS, 10Traffic: mediawiki.org DNS TXT entry for Google Search Console site verification - https://phabricator.wikimedia.org/T161343#3157734 (10Dzahn) @dr0ptp4kt I don't really see the code but it has a green check box next to "DNS".  I gave full access to abaso@wikimedia.org for https://media...
[15:52:28] <wikibugs>	 06Operations, 10Ops-Access-Requests, 10Traffic, 13Patch-For-Review: Ops Onboarding for Arzhel Younsi - https://phabricator.wikimedia.org/T162073#3157735 (10ayounsi) pwstore works fine!  We should be good to close this task.
[15:53:00] <wikibugs_>	 06Operations: Integrate jessie 8.7 point release - https://phabricator.wikimedia.org/T155401#3157737 (10Halfak)
[16:02:26] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[16:03:36] <icinga-wm>	 PROBLEM - puppet last run on mw1186 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:04:35] <wikibugs>	 06Operations, 10DNS, 10Traffic: mediawiki.org DNS TXT entry for Google Search Console site verification - https://phabricator.wikimedia.org/T161343#3157769 (10dr0ptp4kt) Bummer, it looks like that didn't do it. Would you please check the noc@ email to see if there's a site verification request that you can a...
[16:07:26] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[16:07:41] <wikibugs>	 06Operations, 10DNS, 10Traffic: mediawiki.org DNS TXT entry for Google Search Console site verification - https://phabricator.wikimedia.org/T161343#3157775 (10Dzahn) I don't really know how to check noc@wikimedia.org email. If i try to login with the credentials i have on mail, i get the " Add Gmail to your...
[16:08:15] <wikibugs_>	 06Operations, 10DNS, 10Traffic: mediawiki.org DNS TXT entry for Google Search Console site verification - https://phabricator.wikimedia.org/T161343#3157776 (10Deskana) >>! In T161343#3157734, @Dzahn wrote: > Also note an existing owner you share full access with is "searchteam+gwt@wikimedia.org". Do you know...
[16:08:46] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp1072 is CRITICAL: CRITICAL: expiry mailbox lag is 612730
[16:09:50] <wikibugs>	 06Operations, 10DNS, 10Traffic: mediawiki.org DNS TXT entry for Google Search Console site verification - https://phabricator.wikimedia.org/T161343#3157778 (10Dzahn) checking the "messages" in Search Console itself, the last one is from 4/1/17.
[16:10:46] <icinga-wm>	 PROBLEM - jenkins_service_running on contint2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/java .*-jar /usr/share/jenkins/jenkins.war
[16:10:52] <wikibugs_>	 06Operations, 10ops-esams, 10netops: cr2-esams FPC 0 is dead - https://phabricator.wikimedia.org/T162239#3157779 (10ayounsi) Juniper case 2017-0405-0571 opened.
[16:13:15] <icinga-wm>	 ACKNOWLEDGEMENT - jenkins_service_running on contint2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/java .*-jar /usr/share/jenkins/jenkins.war amusso Jenkins is not active on contint2001 yet.
[16:14:26] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[16:14:58] <elukey>	 hoo|away: did you manage to run the job on mwdebug1002?
[16:15:00] <wikibugs>	 06Operations, 10DNS, 10Traffic: mediawiki.org DNS TXT entry for Google Search Console site verification - https://phabricator.wikimedia.org/T161343#3157854 (10Dzahn) @dr0ptp4kt I also added you http://mediawiki.org, https://www.mediawiki.org and  https://www.mediawiki.org  with Full access..  any difference...
[16:18:06] <icinga-wm>	 PROBLEM - Keyholder SSH agent on mira is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it.
[16:18:46] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp1072 is OK: OK: expiry mailbox lag is 0
[16:19:26] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[16:20:08] <wikibugs_>	 (03PS2) 10Subramanya Sastry: Delink new parsoid-vd test runs from updates to parsoid git repo [puppet] - 10https://gerrit.wikimedia.org/r/346196
[16:20:40] <wikibugs>	 (03PS2) 10Dzahn: Remove Apache <IfVersion < 2.4> across the tree [puppet] - 10https://gerrit.wikimedia.org/r/346128 (owner: 10Faidon Liambotis)
[16:21:30] <subbu>	 _joe_ mutante volans can one of you review and +2 https://gerrit.wikimedia.org/r/#/c/346196/ ... thanks.
[16:23:20] <wikibugs_>	 06Operations, 10Analytics, 10DBA: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3157874 (10Tbayer) >>! In T156844#3157170, @Ottomata wrote: >> How's the process to decommission db1047 going? >  > I guess ok!  I think we should just dump all the user creat...
[16:26:19] <wikibugs_>	 06Operations, 10Analytics, 10DBA: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3157875 (10Ottomata) > I thought the plan was to import them (in particular the "staging" database) to dbstore1002, so that they can be queried there as before? Ah sure we can...
[16:27:58] <wikibugs_>	 06Operations, 06Analytics-Kanban, 15User-Elukey: Reimage all the Hadoop worker nodes to Debian Jessie - https://phabricator.wikimedia.org/T160333#3157878 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by otto on neodymium.eqiad.wmnet for hosts: ``` ['analytics1028.eqiad.wmnet'] ``` The log can b...
[16:32:36] <icinga-wm>	 RECOVERY - puppet last run on mw1186 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures
[16:36:26] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[16:38:21] <hoo>	 elukey: Didn't yet try, will do immediately
[16:40:39] <elukey>	 super thanks!
[16:41:26] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[16:41:31] <hoo>	 Not working memory usage is growing steadily
[16:42:16] <elukey>	 well let's wait a bit
[16:44:24] <wikibugs>	 (03CR) 10Faidon Liambotis: "I think we can silence stderr though, it's getting a little bit too spammy." [puppet] - 10https://gerrit.wikimedia.org/r/346116 (owner: 10Alexandros Kosiaris)
[16:44:58] <hoo>	 I'm at 12% of the 4gb now… with GC enabled I was at maybe 4
[16:45:06] <hoo>	 and still growing
[16:45:33] <elukey>	 hoo: but you were actively calling gc_collect_cycles right?
[16:45:58] <hoo>	 Yeah, I'm not sure I tried calling the function to enable it
[16:47:25] <elukey>	 I am pretty sure that hhvm.enable_gc is the zend circular ref collector, so 3.18 might carry a better GC alg
[16:47:29] <hoo>	 I'm not sure how hhvm's GC works, but php's kicks in rather early
[16:47:48] <hoo>	 (per default at least)
[16:51:00] <elukey>	 hoo: so you're saying that with php5 the GC kicked in earlier?
[16:51:24] <hoo>	 Yeah, definitely
[16:51:52] <elukey>	 but with or without gc_collect_cycles ?
[16:52:56] <icinga-wm>	 PROBLEM - puppet last run on analytics1032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:53:26] <hoo>	 elukey: Without
[16:53:26] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[16:53:34] <hoo>	 I only tried collect cycles on hhvm
[16:53:51] <elukey>	 ah snap
[16:53:52] <wikibugs_>	 06Operations, 06Analytics-Kanban, 15User-Elukey: Reimage all the Hadoop worker nodes to Debian Jessie - https://phabricator.wikimedia.org/T160333#3157937 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['analytics1028.eqiad.wmnet'] ```  and were **ALL** successful.
[16:53:55] <elukey>	 this is weird
[16:54:49] <elukey>	 hoo: let's try zend.enable_gc, maybe it is different
[16:54:52] <elukey>	 changing the config
[16:55:46] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 2207
[16:56:50] <hoo>	 Ok
[16:57:00] <hoo>	 the script is still accumulating memory
[16:57:07] <elukey>	 hoo: can you stop the jobs?
[16:57:19] <elukey>	 (on mwdebug)
[16:57:50] <moritzm>	 !log rearmed keyholder on mira after reboot
[16:57:53] <hoo>	 done
[16:57:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:58:17] <wikibugs>	 06Operations, 10Ops-Access-Requests, 10Traffic, 13Patch-For-Review: Ops Onboarding for Arzhel Younsi - https://phabricator.wikimedia.org/T162073#3157942 (10MoritzMuehlenhoff) 05Open>03Resolved
[16:58:44] <icinga-wm>	 RECOVERY - Keyholder SSH agent on mira is OK: OK: Keyholder is armed with all configured keys.
[17:05:34] <elukey>	 hoo, moritzm - I am chatting with people in #hhvm and they said that the options mentioned in the docs were already present before 3.12 but not really reliable/working
[17:05:40] <elukey>	 until 3.18
[17:05:50] <wikibugs_>	 (03PS1) 10BBlack: cache_misc: noc.wm.o active/active [puppet] - 10https://gerrit.wikimedia.org/r/346572
[17:06:03] <elukey>	 so to solve this problem we'd probably need to wait until the 3.18 package is ready to be used
[17:06:24] <moritzm>	 yeah, sounds like it. otherwise they wouldn't have mentioned it in the release notes I guess
[17:07:03] <hoo>	 I see
[17:07:05] <elukey>	 BUT it is weird that <?php var_dump(gc_enabled()); ?> returns true on hhvm
[17:07:13] <elukey>	 not sure what it does in the background
[17:07:37] <wikibugs>	 (03PS3) 10Dzahn: Remove Apache <IfVersion < 2.4> across the tree [puppet] - 10https://gerrit.wikimedia.org/r/346128 (owner: 10Faidon Liambotis)
[17:07:38] <elukey>	 hoo: you said though that calling collect cycles was working
[17:07:52] <hoo>	 but only on 3.18, didn't try it on 3.12
[17:07:56] <elukey>	 so either we do some hack with 3.12 to call collect cycles periodically or we wait for 3.18
[17:08:00] <elukey>	 ahhhh okok
[17:08:01] <hoo>	 though it might just work there
[17:08:24] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[17:08:52] <elukey>	 hoo: from what the hhvm people are saying probably not, only from 3.15 onwards
[17:09:25] <hoo>	 Shoot
[17:09:31] <hoo>	 How did this work a couple of weeks ago
[17:09:39] <hoo>	 How did this ever work?
[17:09:45] <wikibugs>	 (03Abandoned) 10Elukey: Enable hhvm GC for CLI on snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/346554 (https://phabricator.wikimedia.org/T162245) (owner: 10Elukey)
[17:10:08] <wikibugs_>	 (03PS1) 10BBlack: cache_misc: config-master.wm.o active/active [puppet] - 10https://gerrit.wikimedia.org/r/346573
[17:10:54] <icinga-wm>	 PROBLEM - puppet last run on dbproxy1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:10:59] <elukey>	 hoo: with "this" you mean the wikidata script?
[17:11:06] <hoo>	 yes, the dumpers
[17:11:12] <hoo>	 they are on hhvm for quite a while now
[17:11:17] <hoo>	 and only suddenly it blew up
[17:12:19] <elukey>	 no idea :(
[17:12:32] <wikibugs>	 (03CR) 10BBlack: wdqs: active/active public interface (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/346543 (owner: 10Gehel)
[17:12:34] <hoo>	 Same here :/
[17:12:50] <hoo>	 Nothing changed
[17:14:51] <wikibugs_>	 (03CR) 10Gehel: [C: 04-1] wdqs: active/active public interface (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/346543 (owner: 10Gehel)
[17:14:54] <wikibugs>	 06Operations, 10Datasets-General-or-Unknown, 07HHVM, 10Wikidata, and 2 others: Enable GC for HHVM CLI (at least for dump runners) - https://phabricator.wikimedia.org/T162245#3156596 (10elukey) Tested with @hoo the settings outlined in https://docs.hhvm.com/hhvm/configuration/INI-settings on mwdebug1002. A...
[17:15:08] <elukey>	 updated the task
[17:15:54] <hoo>	 elukey: Did you undo your changes on mwdebug1002, yet?
[17:16:04] <hoo>	 > var_dump(gc_enabled()); -> bool(false)
[17:16:06] <wikibugs_>	 (03PS2) 10Gehel: wdqs: active/active public interface [puppet] - 10https://gerrit.wikimedia.org/r/346543 (https://phabricator.wikimedia.org/T162111)
[17:16:09] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "double-checked one more time, doing it now" [puppet] - 10https://gerrit.wikimedia.org/r/346128 (owner: 10Faidon Liambotis)
[17:17:08] <wikibugs>	 06Operations, 10Datasets-General-or-Unknown, 07HHVM, 10Wikidata, and 2 others: Enable GC for HHVM CLI (at least for dump runners) - https://phabricator.wikimedia.org/T162245#3156596 (10MoritzMuehlenhoff) Note that also requires a migration of the snapshot hosts to jessie (which was blocked so far by a bug...
[17:18:48] <wikibugs_>	 (03CR) 10BBlack: "It's tricky to quantify.  For most users, most of the time, they'll consistently be routed to one side or the other.  For lots of users, r" [puppet] - 10https://gerrit.wikimedia.org/r/346543 (https://phabricator.wikimedia.org/T162111) (owner: 10Gehel)
[17:18:54] <wikibugs_>	 (03PS1) 10Andrew Bogott: labs_bootstrapvz:  Don't include mlocate or ecryptfs-utils [puppet] - 10https://gerrit.wikimedia.org/r/346575
[17:19:05] <elukey>	 hoo: yep
[17:19:21] <mutante>	 a lot of misc Apaches are reloading, removing the <2.4 config snippets.. i am watching it
[17:19:43] <hoo>	 elukey: Ok, so that explains that
[17:20:01] <elukey>	 hoo: I tried zend.enable_gc but of course it doesn't work :)
[17:20:07] <paladox>	 the change is for gerrit.wikimedia.org which i just tested and is still working. So gerrit should be unaffected :)
[17:20:46] <hoo>	 elukey: :S
[17:20:55] <hoo>	 One last thing, I can try the user space function
[17:21:01] <wikibugs_>	 (03CR) 10Andrew Bogott: [C: 032] labs_bootstrapvz:  Don't include mlocate or ecryptfs-utils [puppet] - 10https://gerrit.wikimedia.org/r/346575 (owner: 10Andrew Bogott)
[17:21:22] <mutante>	 yes, it's all fine, no problems 
[17:21:38] <wikibugs>	 06Operations, 10ops-codfw, 10DBA: setup  tempdb2001(WMF6407) - https://phabricator.wikimedia.org/T162290#3158019 (10RobH)
[17:21:55] <icinga-wm>	 RECOVERY - puppet last run on analytics1032 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures
[17:22:51] <wikibugs>	 (03PS4) 10Paladox: Gerrit: Disable md5 in ssh [puppet] - 10https://gerrit.wikimedia.org/r/346180
[17:23:04] <wikibugs_>	 06Operations, 10ops-codfw, 10DBA, 10hardware-requests: codfw: (1) spare pool system for temp allocation as database failover - https://phabricator.wikimedia.org/T161712#3140543 (10RobH) 05Open>03stalled p:05Triage>03Normal I'm setting this to stalled and normal priority, as this task will also serv...
[17:24:54] <icinga-wm>	 PROBLEM - puppet last run on elastic1023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:27:10] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] Keystonehooks:  Create and delete sudoer rules in ldap [puppet] - 10https://gerrit.wikimedia.org/r/346489 (https://phabricator.wikimedia.org/T150091) (owner: 10Andrew Bogott)
[17:27:16] <wikibugs_>	 (03PS3) 10Andrew Bogott: Keystonehooks:  Create and delete sudoer rules in ldap [puppet] - 10https://gerrit.wikimedia.org/r/346489 (https://phabricator.wikimedia.org/T150091)
[17:27:32] <wikibugs>	 (03PS1) 10RobH: setting dns for tempdb2001 [dns] - 10https://gerrit.wikimedia.org/r/346576
[17:28:02] <wikibugs>	 (03CR) 10RobH: [C: 032] setting dns for tempdb2001 [dns] - 10https://gerrit.wikimedia.org/r/346576 (owner: 10RobH)
[17:32:18] <wikibugs_>	 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: setup  tempdb2001(WMF6407) - https://phabricator.wikimedia.org/T162290#3158082 (10jcrespo) To clarify the state of this, we still need this ASAP for service implementation ahead of the switchover (that can take quite some time, it is more than just runn...
[17:33:04] <hoo>	 elukey: Why did you abandon the change for setting the config?
[17:33:10] <hoo>	 Shouldn't we at least set it?
[17:33:27] <wikibugs_>	 (03CR) 10EBernhardson: [C: 031] "Not sure if its there, but certainly the wikidata documentation probably also needs to include mention of this configuration setting." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346542 (owner: 10DCausse)
[17:34:04] <icinga-wm>	 PROBLEM - puppet last run on labtestservices2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py]
[17:37:24] <icinga-wm>	 PROBLEM - Router interfaces on cr2-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 52, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/0/0: down - Core: asw-esams:xe-3/0/42 (GBLX leg 2) {#14007} [10Gbps DF CWDM C49]BR
[17:39:54] <icinga-wm>	 RECOVERY - puppet last run on dbproxy1004 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures
[17:40:24] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[17:43:21] <wikibugs>	 (03PS1) 10RobH: tempdb2001 install parameters [puppet] - 10https://gerrit.wikimedia.org/r/346577
[17:43:32] <wikibugs>	 (03Abandoned) 10EBernhardson: Prevent wikidata dumps from taking all memory on snapshot1007 [puppet] - 10https://gerrit.wikimedia.org/r/345170 (https://phabricator.wikimedia.org/T161577) (owner: 10EBernhardson)
[17:43:41] <wikibugs_>	 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: setup  tempdb2001(WMF6407) - https://phabricator.wikimedia.org/T162290#3158106 (10RobH) I'm getting the OS installed today and handed off.
[17:44:12] <wikibugs_>	 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: setup  tempdb2001(WMF6407) - https://phabricator.wikimedia.org/T162290#3158108 (10RobH)
[17:44:40] <wikibugs>	 (03PS2) 10RobH: tempdb2001 install parameters [puppet] - 10https://gerrit.wikimedia.org/r/346577
[17:44:53] <wikibugs_>	 (03CR) 10RobH: [C: 032] tempdb2001 install parameters [puppet] - 10https://gerrit.wikimedia.org/r/346577 (owner: 10RobH)
[17:49:27] <wikibugs>	 (03PS5) 10Madhuvishy: tools: job to copytruncate logs in place [puppet] - 10https://gerrit.wikimedia.org/r/326153 (https://phabricator.wikimedia.org/T152235) (owner: 10Rush)
[17:52:54] <icinga-wm>	 RECOVERY - puppet last run on elastic1023 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures
[17:55:00] <wikibugs_>	 (03PS1) 10Thcipriani: Scap: update version to 3.5.4-1 [puppet] - 10https://gerrit.wikimedia.org/r/346579 (https://phabricator.wikimedia.org/T127762)
[17:55:24] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[17:56:06] <wikibugs_>	 (03PS1) 10Jcrespo: Indicate install recipes for newest db1* and db2* DB servers [puppet] - 10https://gerrit.wikimedia.org/r/346580 (https://phabricator.wikimedia.org/T162159)
[17:58:54] <icinga-wm>	 PROBLEM - puppet last run on mw1177 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:59:25] <wikibugs_>	 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: codfw rack/setup first 10 DB servers - https://phabricator.wikimedia.org/T162159#3158194 (10jcrespo) ^the above should be enough for the recipe. In addition to what Manuel stated, given problems we had in the past, we need to check:  * IPMI calls work a...
[17:59:44] <icinga-wm>	 PROBLEM - Host lvs2002 is DOWN: PING CRITICAL - Packet loss = 100%
[18:00:04] <icinga-wm>	 PROBLEM - puppet last run on mw2177 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:00:05] <jouncebot>	 addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170405T1800).
[18:00:19] <mutante>	 ema: ^ and there it goes again..  lvs2002 is just broken
[18:00:24] <icinga-wm>	 PROBLEM - puppet last run on ms-fe2005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:01:04] <icinga-wm>	 PROBLEM - puppet last run on suhail is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:01:11] <wikibugs>	 06Operations, 10ops-codfw, 10Traffic: lvs2002 random shut down - https://phabricator.wikimedia.org/T162099#3158202 (10Dzahn) it went down again:  11:02 < icinga-wm> PROBLEM - Host lvs2002 is DOWN: PING CRITICAL - Packet loss = 100%
[18:01:29] <wikibugs>	 (03PS2) 10Jcrespo: Indicate install recipes for newest db1* and db2* DB servers [puppet] - 10https://gerrit.wikimedia.org/r/346580 (https://phabricator.wikimedia.org/T162159)
[18:02:31] <wikibugs_>	 (03CR) 10Marostegui: [C: 031] "Thanks for taking care of this!" [puppet] - 10https://gerrit.wikimedia.org/r/346580 (https://phabricator.wikimedia.org/T162159) (owner: 10Jcrespo)
[18:04:04] <icinga-wm>	 RECOVERY - puppet last run on labtestservices2001 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures
[18:04:27] <mutante>	 !log lvs2002 - power off via mgmt (it was down but still showed power as on)
[18:04:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:05:50] <wikibugs_>	 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3158209 (10jcrespo) The guidance is the same as T162159#3157547 (documented for databases on https://wikitech.wikimedia.org/wiki/Raid_and_MegaCli#Raid_setup_at_Wikimedia )....
[18:05:52] <wikibugs>	 06Operations, 10ops-codfw, 10Traffic: lvs2002 random shut down - https://phabricator.wikimedia.org/T162099#3158211 (10Dzahn) < bblack> perhaps power it off to make sure it doesn't blip back on, for now  ``` Server Power: On  </>hpiLO-> power off  status=0 status_tag=COMMAND COMPLETED Wed Apr  5 18:03:48 2017...
[18:06:01] <wikibugs_>	 06Operations, 10ops-codfw, 10Traffic: lvs2002 random shut down - https://phabricator.wikimedia.org/T162099#3158212 (10Dzahn) p:05Normal>03High
[18:07:24] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[18:08:09] <wikibugs_>	 06Operations, 10ops-codfw, 10Traffic: lvs2002 random shut down - https://phabricator.wikimedia.org/T162099#3158216 (10Dzahn) @papaul Could you take a look at this. It seems we might have to call HP. We should make this a priority since we'll soon be moving all our traffic to codfw temporarily.
[18:11:51] <wikibugs>	 (03CR) 10Chad: [C: 031] Scap: update version to 3.5.4-1 [puppet] - 10https://gerrit.wikimedia.org/r/346579 (https://phabricator.wikimedia.org/T127762) (owner: 10Thcipriani)
[18:12:24] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[18:13:44] <wikibugs_>	 (03PS1) 10Jdlrobson: Update Russian Wikipedia logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346581 (https://phabricator.wikimedia.org/T162036)
[18:17:00] <wikibugs>	 (03CR) 10Rush: tools: job to copytruncate logs in place (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/326153 (https://phabricator.wikimedia.org/T152235) (owner: 10Rush)
[18:17:28] <wikibugs_>	 (03PS1) 10Jdlrobson: Deploy Page previews to stable on Hungrian and Hebrew Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346584 (https://phabricator.wikimedia.org/T162162)
[18:24:24] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[18:25:54] <icinga-wm>	 RECOVERY - puppet last run on mw1177 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures
[18:28:04] <icinga-wm>	 RECOVERY - puppet last run on mw2177 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures
[18:28:26] <wikibugs_>	 (03PS1) 10RobH: update tempdb2001 partitioning [puppet] - 10https://gerrit.wikimedia.org/r/346587
[18:28:40] <wikibugs_>	 (03PS2) 10RobH: update tempdb2001 partitioning [puppet] - 10https://gerrit.wikimedia.org/r/346587
[18:28:54] <wikibugs_>	 (03CR) 10RobH: [C: 032] update tempdb2001 partitioning [puppet] - 10https://gerrit.wikimedia.org/r/346587 (owner: 10RobH)
[18:29:04] <icinga-wm>	 RECOVERY - puppet last run on suhail is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures
[18:29:24] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[18:29:24] <icinga-wm>	 RECOVERY - puppet last run on ms-fe2005 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures
[18:38:22] <wikibugs>	 06Operations, 10ops-esams, 10netops: cr2-esams FPC 0 is dead - https://phabricator.wikimedia.org/T162239#3158350 (10ayounsi) Juniper is ready to proceed with an RMA. We need to sync up with the DC's remote hands for that.
[18:41:24] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[18:41:52] <wikibugs_>	 (03PS5) 10Andrew Bogott: wmfkeystonehooks:  Maintain project page on wikitech [puppet] - 10https://gerrit.wikimedia.org/r/323117 (https://phabricator.wikimedia.org/T150091)
[18:43:25] <wikibugs_>	 (03PS6) 10Andrew Bogott: wmfkeystonehooks:  Maintain project page on wikitech [puppet] - 10https://gerrit.wikimedia.org/r/323117 (https://phabricator.wikimedia.org/T150091)
[18:45:48] <wikibugs>	 (03CR) 10Subramanya Sastry: "Clarification comment for benefit of reviewers:" [puppet] - 10https://gerrit.wikimedia.org/r/346196 (owner: 10Subramanya Sastry)
[18:49:31] <wikibugs_>	 (03PS3) 10Dzahn: installserver: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345549 (owner: 10Faidon Liambotis)
[18:50:04] <icinga-wm>	 PROBLEM - puppet last run on db1068 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:50:37] <paladox>	 mutante lol you just rebased and it is showing as merge conflicts ^^
[18:50:37] <wikibugs>	 (03PS7) 10Andrew Bogott: wmfkeystonehooks:  Maintain project page on wikitech [puppet] - 10https://gerrit.wikimedia.org/r/323117 (https://phabricator.wikimedia.org/T150091)
[18:50:39] <wikibugs_>	 (03PS1) 10Andrew Bogott: Add wikitechstatusconfig for labtest [puppet] - 10https://gerrit.wikimedia.org/r/346590
[18:51:14] <wikibugs>	 (03PS4) 10Dzahn: installserver: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345549 (owner: 10Faidon Liambotis)
[18:51:26] <mutante>	 paladox: yea, it happens, the first was to remove dependencies between patches
[18:51:35] <paladox>	 yep
[18:51:54] <mutante>	 then i have to click "rebase" again in web ui, but that time it doesnt have to be manual...
[18:52:59] <wikibugs_>	 (03CR) 10Andrew Bogott: [C: 032] Add wikitechstatusconfig for labtest [puppet] - 10https://gerrit.wikimedia.org/r/346590 (owner: 10Andrew Bogott)
[18:53:09] <wikibugs>	 (03PS2) 10Andrew Bogott: Add wikitechstatusconfig for labtest [puppet] - 10https://gerrit.wikimedia.org/r/346590
[18:57:12] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:57:32] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:57:43] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:57:43] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:57:52] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:57:52] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:57:52] <icinga-wm>	 PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:57:52] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:57:53] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:58:02] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:58:02] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:58:02] <icinga-wm>	 PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:58:02] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:58:02] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:58:03] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:58:03] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:58:04] <icinga-wm>	 PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:58:05] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:58:11] <wikibugs>	 (03PS1) 10Legoktm: Deploy Linter to medium wikis too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346591 (https://phabricator.wikimedia.org/T148609)
[18:58:18] <jynus>	 not expected, but do not worry too much
[18:58:22] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:58:30] <chasemp>	 thanks jynus, I was about to start yelling
[18:58:42] <legoktm>	 jouncebot: next
[18:58:42] <jouncebot>	 In 0 hour(s) and 1 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170405T1900)
[18:58:52] <icinga-wm>	 PROBLEM - puppet last run on puppetmaster1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:59:22] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[18:59:42] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[18:59:42] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[18:59:42] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave
[18:59:42] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s7 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[18:59:42] <icinga-wm>	 RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[18:59:42] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: x1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional)
[18:59:52] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s6 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[18:59:52] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s4 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[18:59:52] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s2 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[18:59:52] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[18:59:52] <icinga-wm>	 RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave
[18:59:53] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: m3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[18:59:54] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[18:59:54] <icinga-wm>	 RECOVERY - MariaDB Slave IO: x1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[18:59:54] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s7 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[19:00:02] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[19:00:03] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s6 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[19:00:04] <jouncebot>	 RainbowSprinkles: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170405T1900). Please do the needful.
[19:00:12] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s4 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[19:00:26] <wikibugs_>	 (03CR) 10Jcrespo: "Sorry:" [puppet] - 10https://gerrit.wikimedia.org/r/345372 (owner: 10Marostegui)
[19:00:51] <wikibugs_>	 (03PS1) 10Chad: group1 to wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346592
[19:04:42] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[19:05:20] <wikibugs>	 (03CR) 10Chad: [C: 032] group1 to wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346592 (owner: 10Chad)
[19:06:34] <wikibugs_>	 06Operations, 10ops-codfw, 06Performance-Team: Plug in ex-graphite2001 SSDs to recover coal data - https://phabricator.wikimedia.org/T161538#3158445 (10Krinkle)
[19:08:35] <wikibugs_>	 (03Merged) 10jenkins-bot: group1 to wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346592 (owner: 10Chad)
[19:08:46] <wikibugs>	 (03CR) 10jenkins-bot: group1 to wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346592 (owner: 10Chad)
[19:11:42] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[19:12:21] <logmsgbot>	 !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 to wmf.19
[19:12:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:15:50] <wikibugs>	 (03PS8) 10Andrew Bogott: wmfkeystonehooks:  Create project page on wikitech on project creation [puppet] - 10https://gerrit.wikimedia.org/r/323117 (https://phabricator.wikimedia.org/T150091)
[19:16:42] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[19:18:02] <icinga-wm>	 RECOVERY - puppet last run on db1068 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures
[19:18:04] <wikibugs_>	 (03CR) 10Andrew Bogott: "I've tested this as best I can, and it works fine on labtest.  The liberty/mitaka changes are duplicates." [puppet] - 10https://gerrit.wikimedia.org/r/323117 (https://phabricator.wikimedia.org/T150091) (owner: 10Andrew Bogott)
[19:20:52] <icinga-wm>	 RECOVERY - puppet last run on puppetmaster1001 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures
[19:23:35] <Nemo_bis>	 mutante: can you something about this? https://phabricator.wikimedia.org/T161082
[19:26:34] <wikibugs_>	 06Operations, 10ops-eqiad: rack and setup boron replacement frpm1001 - https://phabricator.wikimedia.org/T162298#3158473 (10Cmjohnson)
[19:30:01] <mutante>	 Nemo_bis: indirectly, i can replace the admin (since i see philippe@ and assume he is not it anymore) if you have one and then they can handle filters
[19:30:43] <Nemo_bis>	 mutante: I can offer to be admin of wikipedia-l, but I can't admin all the mailing lists :)
[19:30:47] <mutante>	 what i'm not willing to do is handle filters for individual lists myself, just doesn't scale
[19:30:49] <Nemo_bis>	 So it would be nice to have sane defaults
[19:31:01] <mutante>	 i'll help if the admin needs to be replaced or password resdt
[19:31:06] <mutante>	 which needs master password
[19:33:06] <mutante>	 Nemo_bis: a ticket about transferring admin ship would be ideal (maybe others who philippe used to do ).. but i have to step outside.. being picked up right now
[19:33:42] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[19:34:19] <wikibugs_>	 (03PS1) 10Jgreen: flip fundraisingdb-read back to db1025 to clone frdb1001 [dns] - 10https://gerrit.wikimedia.org/r/346595
[19:35:12] <wikibugs>	 (03CR) 10Jgreen: [C: 032] flip fundraisingdb-read back to db1025 to clone frdb1001 [dns] - 10https://gerrit.wikimedia.org/r/346595 (owner: 10Jgreen)
[19:37:20] <legoktm>	 RainbowSprinkles: uh, revert train: https://phabricator.wikimedia.org/T162300
[19:37:41] <RainbowSprinkles>	 Whole train or just donatewiki?
[19:37:48] <legoktm>	 just donatewiki
[19:38:09] <legoktm>	 (I think)
[19:38:27] <logmsgbot>	 !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: roll back donatewiki to wmf.18
[19:38:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:38:42] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[19:38:44] <Nemo_bis>	 hmm https://fog.ccsf.edu/~msapiro/scripts/set_attributes
[19:38:50] <wikibugs_>	 (03PS1) 10Chad: Rolling back donatewiki to wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346596 (https://phabricator.wikimedia.org/T162300)
[19:39:02] <wikibugs>	 (03CR) 10Chad: [C: 032] Rolling back donatewiki to wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346596 (https://phabricator.wikimedia.org/T162300) (owner: 10Chad)
[19:39:47] <p858snake>	 might also want to check if foundationwiki used rawhtml in its system messages anywhere
[19:40:06] <wikibugs>	 (03Merged) 10jenkins-bot: Rolling back donatewiki to wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346596 (https://phabricator.wikimedia.org/T162300) (owner: 10Chad)
[19:40:16] <wikibugs>	 (03CR) 10jenkins-bot: Rolling back donatewiki to wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346596 (https://phabricator.wikimedia.org/T162300) (owner: 10Chad)
[19:40:45] <wikibugs_>	 06Operations, 10Wikimedia-General-or-Unknown: Job queue is growing and growing - https://phabricator.wikimedia.org/T124194#3158574 (10aaron)
[19:41:39] <legoktm>	 p858snake: I hit special:random a few times and didn't see anything wrong
[19:41:53] <RainbowSprinkles>	 I can roll back foundationwiki too
[19:41:56] <RainbowSprinkles>	 Just to be safe for now
[19:42:44] <logmsgbot>	 !log ppchelko@tin Started deploy [trending-edits/deploy@d8ca758]: Providing a debug endpoint
[19:42:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:43:08] <wikibugs>	 06Operations, 10ops-codfw, 06Performance-Team: Plug in ex-graphite2001 SSDs to recover coal data - https://phabricator.wikimedia.org/T161538#3158583 (10RobH) a:05RobH>03fgiunchedi So this system is booted to OS with NO networking.  The usb stick is mounted as /mnt/sde/  All the data can be copied over, b...
[19:43:40] <wikibugs>	 06Operations, 10ops-codfw, 10Traffic: lvs2002 random shut down - https://phabricator.wikimedia.org/T162099#3152568 (10ayounsi) Pushed the following to cr1/2.codfw.  When lvs2002 comes back online for troubleshooting it should not receive any traffic.  ``` [edit routing-options rib inet6.0 static route 2620:0...
[19:43:54] <XioNoX>	 !log pushing https://www.irccloud.com/pastebin/Kecy61aZ/ to cr1/2.codfw for T162099
[19:44:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:44:08] <stashbot>	 T162099: lvs2002 random shut down - https://phabricator.wikimedia.org/T162099
[19:44:27] <wikibugs>	 (03CR) 10BryanDavis: wmfkeystonehooks:  Create project page on wikitech on project creation (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/323117 (https://phabricator.wikimedia.org/T150091) (owner: 10Andrew Bogott)
[19:44:55] <Nemo_bis>	 (which is slightly faster to use than http://manpages.ubuntu.com/manpages/precise/man8/config_list.8.html )
[19:47:22] <icinga-wm>	 PROBLEM - puppet last run on mwdebug1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:48:54] <wikibugs>	 (03CR) 10Andrew Bogott: wmfkeystonehooks:  Create project page on wikitech on project creation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/323117 (https://phabricator.wikimedia.org/T150091) (owner: 10Andrew Bogott)
[19:50:40] <logmsgbot>	 !log ppchelko@tin Finished deploy [trending-edits/deploy@d8ca758]: Providing a debug endpoint (duration: 07m 56s)
[19:50:43] <logmsgbot>	 !log ppchelko@tin Started deploy [trending-edits/deploy@d8ca758]: Providing a debug endpoint
[19:50:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:50:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:52:45] <wikibugs_>	 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: setup  tempdb2001(WMF6407) - https://phabricator.wikimedia.org/T162290#3158631 (10RobH) a:05RobH>03jcrespo
[19:55:02] <wikibugs>	 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: setup  tempdb2001(WMF6407) - https://phabricator.wikimedia.org/T162290#3158633 (10RobH) a:05jcrespo>03None So this is now ready for puppet key/salt key and service implementation by the #DBA team.  This already has their tag for #DBA on the task, I...
[19:55:42] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp1099 is OK: OK: expiry mailbox lag is 0
[19:55:43] <logmsgbot>	 !log ppchelko@tin Finished deploy [trending-edits/deploy@d8ca758]: Providing a debug endpoint (duration: 04m 59s)
[19:55:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:00:04] <jouncebot>	 gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170405T2000). Please do the needful.
[20:00:14] <halfak>	 Nothing for ORES today
[20:03:34] <bblack>	 sss
[20:07:05] <wikibugs>	 06Operations, 10ops-esams, 10netops: cr2-esams FPC 0 is dead - https://phabricator.wikimedia.org/T162239#3156449 (10RobH) I've emailed evoswitch to open an inbound shipment ticket.  Once I have that reference, I'll update this task so @ayounsi can have Juniper dispatch the replacement part.
[20:07:12] <wikibugs_>	 06Operations, 10ops-esams, 10netops: cr2-esams FPC 0 is dead - https://phabricator.wikimedia.org/T162239#3158709 (10RobH) a:03RobH
[20:07:20] <logmsgbot>	 !log arlolra@tin Started deploy [parsoid/deploy@f2d4eee]: Updating Parsoid to 32b7c677
[20:07:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:10:42] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[20:11:43] <wikibugs_>	 06Operations, 10ops-esams, 10netops: cr2-esams FPC 0 is dead - https://phabricator.wikimedia.org/T162239#3158717 (10ayounsi) Step by step instructions for the remote hands:    # Locate the chassis: http://www.juniper.net/techpubs/en_US/release-independent/junos/topics/concept/mpc-mx480-description.html   # L...
[20:14:30] <wikibugs>	 06Operations, 10ops-esams, 10netops: cr2-esams FPC 0 is dead - https://phabricator.wikimedia.org/T162239#3158719 (10RobH) Inbound ticket # is 7326745, please go ahead and have them dispatch the part.  Update this task with the tracking # and assign to me, and I'll get the inbound ticket updated.
[20:15:42] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[20:15:47] <wikibugs>	 (03PS3) 10Zppix: Adding a few more typos that could break things if they aren't tested for. [puppet] - 10https://gerrit.wikimedia.org/r/346282
[20:16:25] <Zppix>	 can an ops member tell me if i need to have that above change swatted since its so minor?
[20:17:06] <paladox>	 Zppix if you doint find someone to merge it, you can add it to puppet swat
[20:17:07] <paladox>	 which is seperate from mediawiki swat.
[20:17:23] <paladox>	 though that looks so minor
[20:17:29] <Zppix>	 paladox:  its so minor i dont really want to waste a spot on a swat
[20:18:46] <logmsgbot>	 !log arlolra@tin Finished deploy [parsoid/deploy@f2d4eee]: Updating Parsoid to 32b7c677 (duration: 11m 26s)
[20:18:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:19:21] <paladox>	 yep
[20:19:44] <Zppix>	 addshore:  do you have a moment?
[20:19:47] <wikibugs>	 (03PS3) 10Mobrovac: RESTBase: Migrate to Scap3 deployment [puppet] - 10https://gerrit.wikimedia.org/r/346248 (https://phabricator.wikimedia.org/T116335)
[20:27:28] <arlolra>	 !log Updated Parsoid to 32b7c677 (T112043, T161936)
[20:27:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:27:37] <stashbot>	 T112043: Parsoid converts <a name="foo"></a> to [ ] - https://phabricator.wikimedia.org/T112043
[20:27:37] <stashbot>	 T161936: Parsoid gallery implementation doesn't apply the style attribute - https://phabricator.wikimedia.org/T161936
[20:31:26] <wikibugs>	 (03CR) 10Marostegui: "> Sorry:" [puppet] - 10https://gerrit.wikimedia.org/r/345372 (owner: 10Marostegui)
[20:36:12] <icinga-wm>	 PROBLEM - puppet last run on restbase-dev1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[20:41:32] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:41:52] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:41:52] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:41:52] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:41:52] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:41:52] <icinga-wm>	 PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:41:53] <logmsgbot>	 !log ppchelko@tin Started deploy [trending-edits/deploy@475a5c0]: Fix edit scorer
[20:41:53] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:41:53] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:42:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:42:02] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:42:02] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:42:03] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:42:03] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:42:03] <icinga-wm>	 PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:42:03] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:42:03] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:42:04] <icinga-wm>	 PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:42:04] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:42:12] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:42:13] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:42:22] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:43:12] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s4 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[20:43:12] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s6 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[20:43:12] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[20:43:22] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[20:43:42] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s7 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[20:43:42] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave
[20:43:42] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[20:43:42] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[20:43:42] <icinga-wm>	 RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[20:43:42] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: x1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional)
[20:43:43] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[20:43:52] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s6 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[20:43:52] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s4 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[20:43:52] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[20:43:52] <icinga-wm>	 RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave
[20:43:52] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s2 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[20:43:53] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[20:43:53] <icinga-wm>	 RECOVERY - MariaDB Slave IO: x1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[20:43:54] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s7 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[20:43:54] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: m3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[20:44:44] <logmsgbot>	 !log ppchelko@tin Finished deploy [trending-edits/deploy@475a5c0]: Fix edit scorer (duration: 02m 51s)
[20:44:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:47:31] <logmsgbot>	 !log ppchelko@tin Started deploy [trending-edits/deploy@475a5c0]: Fix edit scorer
[20:47:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:48:38] <Zppix>	 is it me or did you just deploy the same thing twice Pchelolo 
[20:49:12] <Pchelolo>	 Zppix: no, that's indeed me
[20:49:27] <Zppix>	 Pchelolo:  thats not what i meant but ok
[20:49:54] <Zppix>	 Pchelolo:  i was confused cause the two deployments looked the same 
[20:50:14] <wikibugs>	 06Operations, 10ops-esams, 10netops: cr2-esams FPC 0 is dead - https://phabricator.wikimedia.org/T162239#3158836 (10ayounsi) From Juniper: > Thank you for the information on provided, and the  RMA request has been processed for the FPC: > -          RMA number: R200119594 > -          Product ID: MPC5E-40G10...
[20:51:25] <Pchelolo>	 Zppix: they're the same, it's an experimental service so the deploy fails sometimes (it executes a very involved calculation on startup and the checks can fail). We're gonna fix it
[20:52:18] <Zppix>	 Pchelolo: i figured, i was just making sure, i've seen accidents happen and didnt want anything bad to happen so i thought it would be better to say something then not
[20:52:42] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[20:53:06] <logmsgbot>	 !log ppchelko@tin Finished deploy [trending-edits/deploy@475a5c0]: Fix edit scorer (duration: 05m 34s)
[20:53:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:02:42] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[21:05:12] <icinga-wm>	 RECOVERY - puppet last run on restbase-dev1001 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures
[21:09:59] <Zppix>	 when an operation member  gets time can they review https://gerrit.wikimedia.org/r/346282 its very minor change
[21:10:45] <RainbowSprinkles>	 Minor puppet changes go on puppet swat
[21:12:22] <Zppix>	 RainbowSprinkles:  its to the typo file is wasting a puppet swat slot really necessary 
[21:12:36] <RainbowSprinkles>	 That's what puppet swat is *for*
[21:12:38] <RainbowSprinkles>	 Minor changes
[21:13:00] <wikibugs_>	 (03PS1) 10Mobrovac: [Beta] RESTBase: Set the Cassandra seeds correctly [puppet] - 10https://gerrit.wikimedia.org/r/346639
[21:13:06] <Zppix>	 RainbowSprinkles:  do i have to be around for it cause i cant promise i will be able to
[21:13:19] <RainbowSprinkles>	 Yes, you do
[21:13:40] <Zppix>	 let me look at deployment calender and ill see if i can try 
[21:14:48] <Zppix>	 RainbowSprinkles:  the time that puppet swat is at isnt a good time for me is there any other way?
[21:15:38] <RainbowSprinkles>	 Getting puppet changes merged takes 1 of three things: puppet swat, bugging someone, or becoming a root
[21:18:54] <Zppix>	 RainbowSprinkles: i  mean if eu swat is not doing anything revelant (or they are done but there still time) could i have it done then or no? 
[21:19:06] <Zppix>	 correction not revelent i meant at the time
[21:19:16] <RainbowSprinkles>	 I mean, you gotta just find someone willing to merge
[21:19:24] <Zppix>	 alright thanks
[21:19:32] <wikibugs>	 (03CR) 10Reedy: "Are these plausible typos?" [puppet] - 10https://gerrit.wikimedia.org/r/346282 (owner: 10Zppix)
[21:20:26] <wikibugs>	 (03CR) 10Zppix: "> Are these plausible typos?" [puppet] - 10https://gerrit.wikimedia.org/r/346282 (owner: 10Zppix)
[21:20:40] <wikibugs_>	 (03CR) 10Chad: "The puppet ones seem viable to me, but I haven't seen those scap typos before." [puppet] - 10https://gerrit.wikimedia.org/r/346282 (owner: 10Zppix)
[21:20:43] <wikibugs>	 (03CR) 10Mobrovac: [C: 031] "Cherry-picked in beta, works." [puppet] - 10https://gerrit.wikimedia.org/r/346639 (owner: 10Mobrovac)
[21:21:59] <wikibugs>	 (03CR) 10Reedy: "How would it take a lot of work?" [puppet] - 10https://gerrit.wikimedia.org/r/346282 (owner: 10Zppix)
[21:22:45] <wikibugs>	 (03CR) 10Chad: [C: 031] [Beta] RESTBase: Set the Cassandra seeds correctly [puppet] - 10https://gerrit.wikimedia.org/r/346639 (owner: 10Mobrovac)
[21:23:03] <wikibugs_>	 (03CR) 10Zppix: "say if for some reason we needed to scap something automatically if you mis spell scap you'll have patches that werent scapped automatical" [puppet] - 10https://gerrit.wikimedia.org/r/346282 (owner: 10Zppix)
[21:24:19] <wikibugs>	 (03CR) 10Chad: "That doesn't even make sense. We don't automatically scap things--nor does such a thing have anything to do with rebasing or conflicts. Fi" [puppet] - 10https://gerrit.wikimedia.org/r/346282 (owner: 10Zppix)
[21:25:27] <wikibugs>	 (03PS4) 10Zppix: Adding a few more typos that could break things if they aren't tested for. [puppet] - 10https://gerrit.wikimedia.org/r/346282
[21:25:41] <wikibugs>	 (03CR) 10Zppix: "ps4 removes the scap typos and adds other puppet typos" [puppet] - 10https://gerrit.wikimedia.org/r/346282 (owner: 10Zppix)
[21:27:11] <wikibugs>	 (03CR) 10Bmansurov: [C: 031] Update Russian Wikipedia logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346581 (https://phabricator.wikimedia.org/T162036) (owner: 10Jdlrobson)
[21:34:38] <wikibugs>	 (03CR) 10Chad: "Is this still desired? Seems trivial enough" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280003 (owner: 10Jforrester)
[21:35:00] <wikibugs_>	 06Operations, 10ops-esams, 10netops: cr2-esams FPC 0 is dead - https://phabricator.wikimedia.org/T162239#3158919 (10RobH) I'm also being CC'd on those emails from Juniper.  Once they reply back with the tracking #, I'll update EvoSwitch for the open shipment ticket and open the ticket for the smart hands req...
[21:39:54] <wikibugs>	 (03PS2) 10Chad: Use directly wgGalleryOptions without wmg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331819 (owner: 10Dereckson)
[21:41:49] <wikibugs_>	 (03CR) 10Chad: "Want this to land? Easy enough" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334401 (https://phabricator.wikimedia.org/T156185) (owner: 10Brion VIBBER)
[21:43:03] <brion>	 RainbowSprinkles: hah i forgot about that one
[21:43:21] <RainbowSprinkles>	 Just scanning the backlog :)
[21:43:35] <brion>	 anything i need to poke on it?
[21:44:52] <icinga-wm>	 PROBLEM - parsoid on wtp2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:44:54] <RainbowSprinkles>	 Mainly just wondering if you want it live. I can do the sync easy enough
[21:45:42] <icinga-wm>	 RECOVERY - parsoid on wtp2004 is OK: HTTP OK: HTTP/1.1 200 OK - 1014 bytes in 0.137 second response time
[21:46:17] <brion>	 RainbowSprinkles: yeah go for it
[21:46:20] <brion>	 some files need the extra time
[21:46:32] <brion>	 thanks :D
[21:46:51] <RainbowSprinkles>	 Aww, merge conflict
[21:46:59] <Zppix>	 RainbowSprinkles:  i was just about to point that out
[21:48:16] <brion>	 poopers
[21:49:44] <RainbowSprinkles>	 Fixing
[21:49:50] <RainbowSprinkles>	 Needed a manual rebase locally
[21:50:00] <wikibugs_>	 (03PS2) 10Chad: Increase video transcode max time from 8 to 16 hours [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334401 (https://phabricator.wikimedia.org/T156185) (owner: 10Brion VIBBER)
[21:50:32] <brion>	 i really want a 'git rebase' emoji
[21:50:42] <brion>	 it would roughly resemble that painting "The Scream"
[21:50:46] <RainbowSprinkles>	 I think that's 😖
[21:50:52] <brion>	 hehe
[21:52:18] <paladox>	 lol
[21:52:20] <Zppix>	 brion: just look up "hell" on google images
[21:52:22] <icinga-wm>	 PROBLEM - parsoid on wtp2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:52:26] <brion>	 :)
[21:53:12] <icinga-wm>	 RECOVERY - parsoid on wtp2009 is OK: HTTP OK: HTTP/1.1 200 OK - 1014 bytes in 1.448 second response time
[21:54:05] <wikibugs>	 (03CR) 10Zppix: "Is PS4 okay or is there anymore changes needed?" [puppet] - 10https://gerrit.wikimedia.org/r/346282 (owner: 10Zppix)
[21:54:16] <wikibugs_>	 (03CR) 10Chad: [C: 032] Increase video transcode max time from 8 to 16 hours [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334401 (https://phabricator.wikimedia.org/T156185) (owner: 10Brion VIBBER)
[21:54:23] <brion>	 \o/ woo
[21:56:40] <wikibugs>	 (03Merged) 10jenkins-bot: Increase video transcode max time from 8 to 16 hours [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334401 (https://phabricator.wikimedia.org/T156185) (owner: 10Brion VIBBER)
[21:56:49] <wikibugs_>	 (03CR) 10jenkins-bot: Increase video transcode max time from 8 to 16 hours [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334401 (https://phabricator.wikimedia.org/T156185) (owner: 10Brion VIBBER)
[21:58:20] <logmsgbot>	 !log demon@tin Synchronized wmf-config/CommonSettings.php: bump video transcode timeouts, brion made me do it (duration: 00m 40s)
[21:58:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:58:32] <brion>	 hehe
[21:59:12] <icinga-wm>	 PROBLEM - parsoid on wtp2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:00:02] <icinga-wm>	 RECOVERY - parsoid on wtp2007 is OK: HTTP OK: HTTP/1.1 200 OK - 1014 bytes in 0.088 second response time
[22:02:04] <logmsgbot>	 !log ppchelko@tin Started deploy [trending-edits/deploy@46544de]: Correctly calculate since parameter and allow to change decay for debugging
[22:02:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:02:21] <brion>	 ok i gotta run some errands, i'll be back in the evening to work on schema bits
[22:02:35] <brion>	 going to try a split schedule since i often end up poking at the computer in the evening anyway ;)
[22:04:34] <logmsgbot>	 !log ppchelko@tin Finished deploy [trending-edits/deploy@46544de]: Correctly calculate since parameter and allow to change decay for debugging (duration: 02m 29s)
[22:04:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:04:52] <icinga-wm>	 PROBLEM - parsoid on wtp2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:05:15] <logmsgbot>	 !log ppchelko@tin Started deploy [trending-edits/deploy@46544de]: Correctly calculate since parameter and allow to change decay for debugging, attempt 2
[22:05:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:05:42] <icinga-wm>	 RECOVERY - parsoid on wtp2003 is OK: HTTP OK: HTTP/1.1 200 OK - 1014 bytes in 0.116 second response time
[22:12:21] <logmsgbot>	 !log ppchelko@tin Finished deploy [trending-edits/deploy@46544de]: Correctly calculate since parameter and allow to change decay for debugging, attempt 2 (duration: 07m 06s)
[22:12:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:28:51] <wikibugs_>	 (03CR) 10Jforrester: "> Is this still desired? Seems trivial enough" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280003 (owner: 10Jforrester)
[22:30:18] <wikibugs>	 (03CR) 10Jforrester: [C: 031] Deploy Linter to medium wikis too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346591 (https://phabricator.wikimedia.org/T148609) (owner: 10Legoktm)
[22:37:36] <mobrovac>	 !log restbase deploying a8d4d027
[22:37:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:57:55] <wikibugs_>	 06Operations, 10ops-codfw, 10Traffic: lvs2002 random shut down - https://phabricator.wikimedia.org/T162099#3159129 (10Papaul) a:03Papaul
[23:00:04] <jouncebot>	 addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170405T2300). Please do the needful.
[23:00:04] <jouncebot>	 Jdlrobson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process.
[23:00:41] <jdlrobson>	 \o
[23:01:57] <thcipriani>	 I can SWAT
[23:02:11] <RainbowSprinkles>	 I'm also self-swatting something
[23:02:14] <RainbowSprinkles>	 Last minute
[23:02:48] <Zppix>	 thcipriani:  when your done with jdlrobson  mind going ahead and swatting a puppet patch for me as i wont be able to be around for puppet swat?
[23:02:59] <thcipriani>	 RainbowSprinkles: ok :)
[23:03:15] <thcipriani>	 Zppix: I don't have +2 on operations/puppet, so I can't sorry :(
[23:03:24] <Zppix>	 thcipriani:  no worries
[23:04:02] <icinga-wm>	 PROBLEM - puppet last run on mc1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[23:04:24] <wikibugs_>	 (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346584 (https://phabricator.wikimedia.org/T162162) (owner: 10Jdlrobson)
[23:06:12] <wikibugs_>	 (03Merged) 10jenkins-bot: Deploy Page previews to stable on Hungrian and Hebrew Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346584 (https://phabricator.wikimedia.org/T162162) (owner: 10Jdlrobson)
[23:06:29] <wikibugs>	 (03CR) 10jenkins-bot: Deploy Page previews to stable on Hungrian and Hebrew Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346584 (https://phabricator.wikimedia.org/T162162) (owner: 10Jdlrobson)
[23:07:42] <thcipriani>	 RainbowSprinkles: are you in the middle of SWATting? or can I go ahead with ^
[23:07:53] <RainbowSprinkles>	 Go ahead, I'm still waiting on jenkins
[23:08:00] <thcipriani>	 okie doke
[23:09:08] <thcipriani>	 jdlrobson: page previews on hewiki and huwiki is on mwdebug1002, check please
[23:09:16] <jdlrobson>	 thcipriani: on it
[23:11:43] <jdlrobson>	 thcipriani: you can proceed!
[23:11:52] <thcipriani>	 jdlrobson: ok, going live
[23:12:22] <RainbowSprinkles>	 Ah crud, thought you were done
[23:12:25] <RainbowSprinkles>	 Yay mid-sync
[23:12:31] <RainbowSprinkles>	 gimme 10 seconds
[23:12:52] <logmsgbot>	 !log demon@tin Synchronized php-1.29.0-wmf.19/extensions/Dashiki/: swattttttt (duration: 00m 41s)
[23:13:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:13:04] <RainbowSprinkles>	 Crap.
[23:13:21] <RainbowSprinkles>	 thcipriani: Continue.
[23:13:24] <RainbowSprinkles>	 I'm out of your way
[23:13:47] <logmsgbot>	 !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:346584|Deploy Page previews to stable on Hungrian and Hebrew Wikipedias]] T162162 (duration: 00m 40s)
[23:13:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:13:54] <stashbot>	 T162162: Deploy page previews to Hungarian and Hebrew wikipedias - https://phabricator.wikimedia.org/T162162
[23:13:55] <thcipriani>	 ^ jdlrobson live now
[23:13:59] <jdlrobson>	 yay
[23:14:43] <thcipriani>	 RainbowSprinkles: somehow I must've missed your sync :)
[23:15:04] <wikibugs>	 (03PS2) 10Thcipriani: Update Russian Wikipedia logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346581 (https://phabricator.wikimedia.org/T162036) (owner: 10Jdlrobson)
[23:15:13] <wikibugs_>	 (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346581 (https://phabricator.wikimedia.org/T162036) (owner: 10Jdlrobson)
[23:16:21] <wikibugs_>	 (03PS1) 10Milimetric: Revert "Revert "Restore Dashiki config in CommonSettings for now"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346661
[23:16:33] <wikibugs_>	 (03PS2) 10Milimetric: Revert "Revert "Restore Dashiki config in CommonSettings for now"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346661
[23:17:19] <wikibugs>	 (03Merged) 10jenkins-bot: Update Russian Wikipedia logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346581 (https://phabricator.wikimedia.org/T162036) (owner: 10Jdlrobson)
[23:17:22] <wikibugs_>	 (03CR) 10Chad: [V: 032 C: 032] Revert "Revert "Restore Dashiki config in CommonSettings for now"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346661 (owner: 10Milimetric)
[23:17:29] <wikibugs_>	 (03CR) 10jenkins-bot: Update Russian Wikipedia logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346581 (https://phabricator.wikimedia.org/T162036) (owner: 10Jdlrobson)
[23:18:36] <logmsgbot>	 !log demon@tin Synchronized wmf-config/CommonSettings.php: unbreak dashiki again (duration: 00m 40s)
[23:18:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:19:00] <wikibugs>	 (03CR) 10jenkins-bot: Revert "Revert "Restore Dashiki config in CommonSettings for now"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346661 (owner: 10Milimetric)
[23:20:12] <icinga-wm>	 PROBLEM - puppet last run on mc1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[23:21:25] <thcipriani>	 jdlrobson: I fetched the image down on mwdebug1002, dunno if I'm seeing it or not :)
[23:23:04] <jdlrobson>	 thcipriani: lemme see
[23:23:49] <jdlrobson>	 im seeing something different
[23:23:56] <jdlrobson>	 MaxSem: you around?
[23:24:01] <jdlrobson>	 need a Russian speaker :)
[23:24:42] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[23:25:08] <MaxSem>	 jdlrobson, ?
[23:25:35] <jdlrobson>	 MaxSem: could you check the logo on Russian Wikipedia on mwdebug1002 and tell me if it looks normal to you? https://ru.m.wikipedia.org/
[23:26:18] <MaxSem>	 jdlrobson, logo in the footer?
[23:26:21] <jdlrobson>	 header
[23:26:31] <jdlrobson>	 next to the hamburger
[23:26:33] <MaxSem>	 ah
[23:26:45] <MaxSem>	 that's wordmar or something :P
[23:26:56] <jdlrobson>	 it should be fine, but it would be nice to hear from a true Russian that it's an improvement :)
[23:27:10] <jdlrobson>	 compared to what we show without mwdebug1002
[23:27:10] <MaxSem>	 lgtm
[23:27:23] <jdlrobson>	 thanks MaxSem on behalf of all russians everywhere! go for it thcipriani 
[23:27:37] <thcipriani>	 :)
[23:27:39] <thcipriani>	 going live
[23:27:51] <MaxSem>	 what's the difference?
[23:29:42] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 18 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[23:29:49] <logmsgbot>	 !log thcipriani@tin Synchronized static/images/mobile/copyright/wikipedia-wordmark-ru.svg: SWAT: [[gerrit:346581|Update Russian Wikipedia logo]] T162036 (duration: 00m 40s)
[23:29:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:29:57] <stashbot>	 T162036: Rendering issues with logo in Russian Wikipedia on mobile - https://phabricator.wikimedia.org/T162036
[23:31:11] <thcipriani>	 jdlrobson: ok, sync'd and purged
[23:32:02] <icinga-wm>	 RECOVERY - puppet last run on mc1034 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures
[23:33:46] <wikibugs_>	 (03CR) 10Dzahn: [C: 032] installserver: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345549 (owner: 10Faidon Liambotis)
[23:33:50] <wikibugs>	 (03PS5) 10Dzahn: installserver: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345549 (owner: 10Faidon Liambotis)
[23:47:15] <wikibugs>	 (03PS6) 10Dzahn: DHCP: remove backup4001 [puppet] - 10https://gerrit.wikimedia.org/r/345356 (https://phabricator.wikimedia.org/T158220)
[23:47:28] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] DHCP: remove backup4001 [puppet] - 10https://gerrit.wikimedia.org/r/345356 (https://phabricator.wikimedia.org/T158220) (owner: 10Dzahn)
[23:48:12] <icinga-wm>	 RECOVERY - puppet last run on mc1036 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures
[23:51:18] <wikibugs>	 (03PS7) 10Dzahn: DHCP: remove backup4001 [puppet] - 10https://gerrit.wikimedia.org/r/345356 (https://phabricator.wikimedia.org/T158220)
[23:54:21] <wikibugs_>	 06Operations, 10Monitoring: certspotter on einsteinium has issues talking to external - https://phabricator.wikimedia.org/T162327#3159322 (10Dzahn)
[23:56:36] <wikibugs_>	 06Operations, 10DNS, 10Traffic: mediawiki.org DNS TXT entry for Google Search Console site verification - https://phabricator.wikimedia.org/T161343#3159330 (10Dzahn) re: mail to noc@  I was stupid of course i can check that, it's just an alias for root@ and all ops get that. but .. i can still not see one fr...
[23:57:08] <wikibugs>	 (03PS8) 10Dzahn: DHCP: remove backup4001 [puppet] - 10https://gerrit.wikimedia.org/r/345356 (https://phabricator.wikimedia.org/T161904)
[23:57:44] <wikibugs_>	 (03CR) 10Dzahn: [C: 032] "this is now per the new decom task" [puppet] - 10https://gerrit.wikimedia.org/r/345356 (https://phabricator.wikimedia.org/T161904) (owner: 10Dzahn)