[00:08:08] PROBLEM - puppet last run on rdb1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:08:28] PROBLEM - puppet last run on elastic1052 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:18:58] PROBLEM - check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 185 bytes in 0.235 second response time [00:23:38] RECOVERY - puppet last run on prometheus2001 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [00:23:58] RECOVERY - check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.007 second response time [00:24:08] PROBLEM - puppet last run on dbproxy1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:36:09] RECOVERY - puppet last run on rdb1008 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [00:37:29] RECOVERY - puppet last run on elastic1052 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [00:53:08] RECOVERY - puppet last run on dbproxy1009 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [01:14:58] PROBLEM - check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 185 bytes in 0.009 second response time [01:15:58] RECOVERY - check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.037 second response time [01:27:58] PROBLEM - check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 185 bytes in 0.009 second response time [01:28:58] RECOVERY - check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.008 second response time [02:20:20] PROBLEM - puppet last run on cp1099 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:22:58] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.15) (duration: 08m 57s) [02:23:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:28:19] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Mar 13 02:28:18 UTC 2017 (duration 5m 21s) [02:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:48:18] RECOVERY - puppet last run on cp1099 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [03:39:38] PROBLEM - puppet last run on cp3004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:48:18] PROBLEM - puppet last run on labsdb1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:08:38] RECOVERY - puppet last run on cp3004 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [04:10:58] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=2243.90 Read Requests/Sec=2967.00 Write Requests/Sec=9.90 KBytes Read/Sec=24300.80 KBytes_Written/Sec=1380.40 [04:16:18] RECOVERY - puppet last run on labsdb1005 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [04:18:58] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=8.10 Read Requests/Sec=0.10 Write Requests/Sec=59.40 KBytes Read/Sec=0.40 KBytes_Written/Sec=424.40 [05:03:28] PROBLEM - puppet last run on mc1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:15:08] PROBLEM - check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 185 bytes in 0.008 second response time [05:16:08] RECOVERY - check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.067 second response time [05:27:48] PROBLEM - puppet last run on mw1284 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:28:08] PROBLEM - check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 185 bytes in 0.006 second response time [05:31:28] RECOVERY - puppet last run on mc1034 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [05:32:09] RECOVERY - check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.022 second response time [05:38:28] PROBLEM - puppet last run on mc1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:55:48] RECOVERY - puppet last run on mw1284 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [06:06:28] RECOVERY - puppet last run on mc1020 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:09:18] PROBLEM - Host mw2256 is DOWN: PING CRITICAL - Packet loss = 100% [06:28:28] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:38:18] PROBLEM - Disk space on ms-be2008 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sde1 is not accessible: Input/output error [06:51:28] PROBLEM - MegaRAID on ms-be2008 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) [06:51:39] ACKNOWLEDGEMENT - MegaRAID on ms-be2008 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T160312 [06:51:42] 06Operations, 10ops-codfw: Degraded RAID on ms-be2008 - https://phabricator.wikimedia.org/T160312#3094632 (10ops-monitoring-bot) [06:51:50] PROBLEM - puppet last run on ms-be2008 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[mkfs-/dev/sde1] [06:52:02] !log powercycle mw2256, stuck in boot (looked in the console) [06:52:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:09] RECOVERY - Host mw2256 is UP: PING OK - Packet loss = 0%, RTA = 36.09 ms [06:57:29] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [07:00:04] Deploy window Holiday (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170313T0700) [07:01:19] RECOVERY - Disk space on ms-be2008 is OK: DISK OK [07:05:55] (03PS1) 10Marostegui: db-eqiad.php: Depool db1030 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342414 (https://phabricator.wikimedia.org/T159414) [07:06:46] (03Abandoned) 10Marostegui: Revert "db-eqiad.php: Depool db1022" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342218 (owner: 10Marostegui) [07:06:58] 06Operations, 10ops-codfw, 13Patch-For-Review, 15User-Elukey: codfw: mw2251-mw2260 rack/setup - https://phabricator.wikimedia.org/T155180#3094640 (10elukey) It keeps repeating, I can see a lot of EDAC errors in kern.log: ``` elukey@mw2256:~$ sudo grep -i EDAC /var/log/kern.log Mar 12 15:29:09 mw2256 kerne... [07:07:39] (03PS2) 10Marostegui: db-eqiad.php: Depool db1030 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342414 (https://phabricator.wikimedia.org/T159414) [07:09:38] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1030 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342414 (https://phabricator.wikimedia.org/T159414) (owner: 10Marostegui) [07:11:10] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1030 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342414 (https://phabricator.wikimedia.org/T159414) (owner: 10Marostegui) [07:11:19] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1030 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342414 (https://phabricator.wikimedia.org/T159414) (owner: 10Marostegui) [07:12:30] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1030 - T159414 (duration: 00m 52s) [07:12:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:36] T159414: Rampant differences in indexes and PK on s6 (frwiki, jawiki, ruwiki) for revision table - https://phabricator.wikimedia.org/T159414 [07:13:14] !log Deploy alter table s6 revision table on db1030 - T159414 [07:13:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:08] (03PS1) 10Marostegui: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342421 (https://phabricator.wikimedia.org/T132416) [07:21:55] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342421 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui) [07:23:01] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342421 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui) [07:23:13] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342421 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui) [07:24:06] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1089 - T132416 (duration: 00m 41s) [07:24:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:12] T132416: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416 [07:24:12] !log Deploy alter table enwiki.revision db1089 - T132416 [07:24:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:29] PROBLEM - puppet last run on rdb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:35:38] 06Operations, 10ops-codfw, 10DBA: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3094702 (10Marostegui) There are multiple errors on that host, related to memory and CPU (maybe it is the wrong DIMM bank affecting the CPU or the other way around as those can be related to each other)... [07:46:29] !log upgrading apache on remaining mediawiki servers in eqiad [07:46:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:03] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1030" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342426 [07:56:07] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1030" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342426 [07:58:49] PROBLEM - git_daemon_running on contint2001 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/lib/git-core/git-daemon [07:59:29] RECOVERY - puppet last run on rdb1001 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [07:59:49] RECOVERY - git_daemon_running on contint2001 is OK: PROCS OK: 1 process with regex args ^/usr/lib/git-core/git-daemon [08:01:07] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1030" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342426 (owner: 10Marostegui) [08:02:26] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1030" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342426 (owner: 10Marostegui) [08:02:33] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1030" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342426 (owner: 10Marostegui) [08:03:17] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1030 - T159414 (duration: 00m 41s) [08:03:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:23] T159414: Rampant differences in indexes and PK on s6 (frwiki, jawiki, ruwiki) for revision table - https://phabricator.wikimedia.org/T159414 [08:08:35] !log Deploy alter table s6 - db1050 (master) - T159414 [08:08:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:41] T159414: Rampant differences in indexes and PK on s6 (frwiki, jawiki, ruwiki) for revision table - https://phabricator.wikimedia.org/T159414 [08:18:19] (03PS2) 10Filippo Giunchedi: Enable ipvs node_exporter collector on lvs boxes [puppet] - 10https://gerrit.wikimedia.org/r/342175 [08:20:07] (03PS1) 10Marostegui: db-eqiad.php: Depool db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342427 (https://phabricator.wikimedia.org/T153743) [08:22:47] (03CR) 10Filippo Giunchedi: [C: 032] Enable ipvs node_exporter collector on lvs boxes [puppet] - 10https://gerrit.wikimedia.org/r/342175 (owner: 10Filippo Giunchedi) [08:23:44] (03PS2) 10Muehlenhoff: debdeploy: Support stretch installations in update spec files [puppet] - 10https://gerrit.wikimedia.org/r/342233 [08:24:06] (03PS5) 10Filippo Giunchedi: facilities: add row and site parameters for pdus [puppet] - 10https://gerrit.wikimedia.org/r/341533 (https://phabricator.wikimedia.org/T148541) [08:25:07] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342427 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [08:25:41] (03CR) 10Muehlenhoff: [C: 032] debdeploy: Support stretch installations in update spec files [puppet] - 10https://gerrit.wikimedia.org/r/342233 (owner: 10Muehlenhoff) [08:26:33] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342427 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [08:27:09] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342427 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [08:27:40] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1070 - T153743 (duration: 00m 41s) [08:27:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:46] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [08:30:42] !log Stop MySQL on db1095 (sanitarium2) to take a backup - T153743 [08:30:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:30] !log Stop replication on labsdb1009,10 and 11 - T153743 [08:31:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:17] (03Abandoned) 10Hashar: Gerrit: Increase sendemail.threadPoolSize to 5 [puppet] - 10https://gerrit.wikimedia.org/r/342313 (owner: 10Paladox) [08:40:43] !log Compress dewiki - db1070 - T153743 [08:40:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:49] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [08:47:40] 06Operations, 13Patch-For-Review: Evaluate use of systemd-timesyncd on jessie for clock synchronisation - https://phabricator.wikimedia.org/T150257#3094778 (10MoritzMuehlenhoff) [08:47:43] 06Operations, 10Monitoring, 10Traffic, 13Patch-For-Review: diamond crashing on hosts using systemd-timesyncd - https://phabricator.wikimedia.org/T157794#3094776 (10MoritzMuehlenhoff) 05Open>03Resolved Closing, the collector isn't applied to systems using timesyncd any longer (and this is fixed on the c... [08:49:00] (03PS4) 10Filippo Giunchedi: Send thumbor process age to statsd via cron [puppet] - 10https://gerrit.wikimedia.org/r/341791 (https://phabricator.wikimedia.org/T159352) (owner: 10Gilles) [08:51:32] (03PS5) 10Filippo Giunchedi: Send thumbor process age to statsd via cron [puppet] - 10https://gerrit.wikimedia.org/r/341791 (https://phabricator.wikimedia.org/T159352) (owner: 10Gilles) [08:52:28] (03CR) 10Gilles: [C: 031] Disable WikimediaEvents extension on closed wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342148 (https://phabricator.wikimedia.org/T158721) (owner: 10Krinkle) [08:54:20] elukey: good morning. Do you still need the instance deployment-copper.deployment-prep.eqiad.wmflabs ? Apparently used as a package builder but the OpenStack metadata re broken for it [08:54:31] elukey: so I am wondering whether we should try to fix it / rebuild it or just delete it [08:54:35] 06Operations, 10DBA, 13Patch-For-Review: Install and reimage dbstore1001 as jessie - https://phabricator.wikimedia.org/T153768#3094798 (10Marostegui) I think we are good to go now \o/: ``` Automatically selected FileSet: mysql-srv-backups +--------+-------+----------+-------------------+--------------------... [08:54:43] 06Operations, 10DBA, 13Patch-For-Review: Install and reimage dbstore1001 as jessie - https://phabricator.wikimedia.org/T153768#3094799 (10Marostegui) 05Open>03Resolved [08:54:54] (03CR) 10Gilles: [C: 031] Send thumbor process age to statsd via cron [puppet] - 10https://gerrit.wikimedia.org/r/341791 (https://phabricator.wikimedia.org/T159352) (owner: 10Gilles) [08:54:57] hashar: thanks for pinging, we can delete it (I'll do it this morning if you want) [08:55:20] elukey: thanks. It is gone :-} [08:56:20] elukey: should be easy to recreate it thanks to puppet :] [08:57:06] (03PS6) 10Filippo Giunchedi: Send thumbor process age to statsd via cron [puppet] - 10https://gerrit.wikimedia.org/r/341791 (https://phabricator.wikimedia.org/T159352) (owner: 10Gilles) [08:58:46] (03CR) 10Filippo Giunchedi: [C: 032] Send thumbor process age to statsd via cron [puppet] - 10https://gerrit.wikimedia.org/r/341791 (https://phabricator.wikimedia.org/T159352) (owner: 10Gilles) [09:00:13] gilles: ^ merged [09:00:27] rhanks [09:00:28] thanks [09:01:31] (03PS4) 10Filippo Giunchedi: Optional Cassandra client encryption; Enabled on RESTBase Staging [puppet] - 10https://gerrit.wikimedia.org/r/342075 (https://phabricator.wikimedia.org/T111113) (owner: 10Eevans) [09:06:55] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Graph the age of the Thumbor processes in Grafana - https://phabricator.wikimedia.org/T159352#3094818 (10Gilles) Added: https://grafana.wikimedia.org/dashboard/db/thumbor?panelId=12&fullscreen [09:07:43] (03CR) 10Filippo Giunchedi: [C: 032] Optional Cassandra client encryption; Enabled on RESTBase Staging [puppet] - 10https://gerrit.wikimedia.org/r/342075 (https://phabricator.wikimedia.org/T111113) (owner: 10Eevans) [09:11:09] (03PS6) 10Filippo Giunchedi: facilities: add row and site parameters for pdus [puppet] - 10https://gerrit.wikimedia.org/r/341533 (https://phabricator.wikimedia.org/T148541) [09:12:07] (03CR) 10Filippo Giunchedi: [C: 032] facilities: add row and site parameters for pdus [puppet] - 10https://gerrit.wikimedia.org/r/341533 (https://phabricator.wikimedia.org/T148541) (owner: 10Filippo Giunchedi) [09:23:03] !log downgrading elasticsearch to v5.1.2 on relforge, a full reindex will be needed - T156150 [09:23:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:09] T156150: Install ES 5.x to relforge100[12] - https://phabricator.wikimedia.org/T156150 [09:25:13] 06Operations, 06Analytics-Kanban, 10Traffic, 06Wikipedia-iOS-App-Backlog, and 2 others: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#3094868 (10elukey) Set `innodb_buffer_pool_size = 2048M` to see if helps. I checked `SHOW ENGINE INNODB STATUS` and some data is relevan... [09:27:09] PROBLEM - bacula sd process on helium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (bacula), command name bacula-sd [09:28:15] godog: moritzm: gilles: hi! Any idea why deployment-imagescaler01 puppet config points to the puppet master puppetmaster.thumbor.eqiad.wmflabs ? [09:28:45] the instance fails to reach that puppet master. Might have been done around Feb 28th [09:32:35] 06Operations, 06Analytics-Kanban, 10Traffic, 06Wikipedia-iOS-App-Backlog, and 2 others: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#3094891 (10elukey) @JoeWalsh: Hi! I am wondering if the 100% -> 10% reduction in the sample rate has been released or not, since I am cu... [09:39:04] hashar: no idea, I didn't change that [09:40:33] moritzm: merely wondering since you showed up in `last` :} [09:42:59] (03PS5) 10Filippo Giunchedi: facilities: add codfw PDUs [puppet] - 10https://gerrit.wikimedia.org/r/341534 (https://phabricator.wikimedia.org/T148541) [09:43:01] (03PS15) 10Filippo Giunchedi: prometheus: add snmp_exporter module and profile [puppet] - 10https://gerrit.wikimedia.org/r/341005 (https://phabricator.wikimedia.org/T148541) [09:43:03] (03PS5) 10Filippo Giunchedi: [WIP] add PDUs jobs to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/341535 (https://phabricator.wikimedia.org/T148541) [09:43:53] (03CR) 10Filippo Giunchedi: "Thanks Volans! I've moved the role to a profile in the next PS, see also inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/341005 (https://phabricator.wikimedia.org/T148541) (owner: 10Filippo Giunchedi) [09:44:26] (03CR) 10jerkins-bot: [V: 04-1] [WIP] add PDUs jobs to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/341535 (https://phabricator.wikimedia.org/T148541) (owner: 10Filippo Giunchedi) [09:45:14] godog: the snmp_exporter read/reload the yaml file at any change or you need to trigger a reload after changing it? [09:47:31] volans: now IIRC it is the latter, though I think at some point it might move to watch its config [09:47:31] hashar: I was using that to test thumbor-related puppet changes [09:48:49] hashar: it worked fine when I set it up, don't know why it can't reach the puppetmaster now [09:49:05] godog: ok, then as you want, I was thinking of building it in memory and then writing it. And if it has to be atomic you could still write to file.yml.new and then rename. Both solutions works :) [09:49:22] gilles: I guess the puppetmaster died or some firewall rules prevent it to be reachable from deployment-prep :/ [09:50:19] PROBLEM - puppet last run on analytics1056 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:50:37] 06Operations, 10ops-eqiad, 10DBA: db1047 BBU RAID issues (was: Investigate db1047 replication lag) - https://phabricator.wikimedia.org/T159266#3094946 (10Marostegui) @Cmjohnson once you are back in the DC can you check if you have any spare BBU? Thanks! [09:56:15] (03PS1) 10Gilles: Performance Grafana alerts [puppet] - 10https://gerrit.wikimedia.org/r/342431 (https://phabricator.wikimedia.org/T156245) [09:57:25] hashar: is there a phab task for that issue yet? [09:57:43] gilles: na just noticed it while doing my monday morning check of beta :} [10:02:18] 06Operations, 10ops-codfw: Degraded RAID on ms-be2008 - https://phabricator.wikimedia.org/T160312#3094995 (10Volans) [10:03:40] 06Operations, 10ops-codfw: Degraded RAID on ms-be2008 - https://phabricator.wikimedia.org/T160312#3094632 (10Volans) @fgiunchedi I've manually updated the task description because NRPE timed out (it took me ~1 minute to get the output). As usual puppet is broken due to mkfs and alarming on Icinga [10:09:03] (03PS1) 10Jcrespo: mariadb: Decouple db proxy role classes to separate files [puppet] - 10https://gerrit.wikimedia.org/r/342436 (https://phabricator.wikimedia.org/T150850) [10:11:18] volans: yeah both would work, I don't feel strongly about it heh [10:12:04] :) [10:16:55] 06Operations: Enhance account handling (meta bug) - https://phabricator.wikimedia.org/T142815#3095016 (10MoritzMuehlenhoff) [10:16:58] 06Operations, 07LDAP: Update/add/remove LDAP entries based on changes to data.yaml - https://phabricator.wikimedia.org/T142819#3095014 (10MoritzMuehlenhoff) 05Open>03declined Closing, this very much overlaps with T142819, tracking that one instead. [10:18:19] RECOVERY - puppet last run on analytics1056 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [10:23:29] (03CR) 10Muehlenhoff: "JFTR, this is using a bot user indeed: https://phabricator.wikimedia.org/p/offboarding/" [puppet] - 10https://gerrit.wikimedia.org/r/342222 (owner: 10Muehlenhoff) [10:23:57] (03PS3) 10Gehel: Elastic 5.1.2 plugins [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/341826 (owner: 10DCausse) [10:25:48] (03PS6) 10Filippo Giunchedi: facilities: add codfw PDUs [puppet] - 10https://gerrit.wikimedia.org/r/341534 (https://phabricator.wikimedia.org/T148541) [10:25:50] (03PS16) 10Filippo Giunchedi: prometheus: add snmp_exporter module and profile [puppet] - 10https://gerrit.wikimedia.org/r/341005 (https://phabricator.wikimedia.org/T148541) [10:25:52] (03PS6) 10Filippo Giunchedi: [WIP] add PDUs jobs to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/341535 (https://phabricator.wikimedia.org/T148541) [10:26:35] (03CR) 10jerkins-bot: [V: 04-1] [WIP] add PDUs jobs to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/341535 (https://phabricator.wikimedia.org/T148541) (owner: 10Filippo Giunchedi) [10:28:29] PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:30:23] (03PS4) 10Gehel: Elastic 5.1.2 plugins [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/341826 (owner: 10DCausse) [10:33:10] (03PS1) 10Filippo Giunchedi: hieradata: enable https for swift eqiad [puppet] - 10https://gerrit.wikimedia.org/r/342438 (https://phabricator.wikimedia.org/T127455) [10:33:23] 06Operations, 10Gerrit, 10Mail, 06Release-Engineering-Team: Gerrit emails are showing up as being sent late via Yahoo servers - https://phabricator.wikimedia.org/T159960#3095070 (10Aklapper) [10:41:03] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: enable https for swift eqiad [puppet] - 10https://gerrit.wikimedia.org/r/342438 (https://phabricator.wikimedia.org/T127455) (owner: 10Filippo Giunchedi) [10:44:08] !log Update site statistics on gu.wikipedia (T160328) [10:44:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:15] T160328: Update statistics count on gu.wikipedia - https://phabricator.wikimedia.org/T160328 [10:44:51] (03PS2) 10Jcrespo: mariadb: Decouple db proxy role classes to separate files [puppet] - 10https://gerrit.wikimedia.org/r/342436 (https://phabricator.wikimedia.org/T150850) [10:48:03] (03CR) 10Jcrespo: [C: 032] "https://puppet-compiler.wmflabs.org/5753/" [puppet] - 10https://gerrit.wikimedia.org/r/342436 (https://phabricator.wikimedia.org/T150850) (owner: 10Jcrespo) [10:48:07] (03PS3) 10Jcrespo: mariadb: Decouple db proxy role classes to separate files [puppet] - 10https://gerrit.wikimedia.org/r/342436 (https://phabricator.wikimedia.org/T150850) [10:56:29] RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [11:02:28] (03PS1) 10Harjotsingh: Disable default quick survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342443 (https://phabricator.wikimedia.org/T159739) [11:03:35] (03PS1) 10Giuseppe Lavagetto: Add tests, improve code [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/342444 [11:03:37] (03PS1) 10Giuseppe Lavagetto: Refactor ReplicationController, version bump [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/342445 [11:05:11] (03CR) 10Phuedx: [C: 031] Disable default quick survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342443 (https://phabricator.wikimedia.org/T159739) (owner: 10Harjotsingh) [11:06:08] !log purge bswiki logo - T158815 [11:06:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:13] T158815: Update logo for bs.wikipedia - https://phabricator.wikimedia.org/T158815 [11:11:39] (03PS2) 10Phuedx: quickSurveys: Disable surveys on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342443 (https://phabricator.wikimedia.org/T159739) (owner: 10Harjotsingh) [11:15:15] !log bounce pybal on lvs1006 to try picking up swift https changes [11:15:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:22] 06Operations, 06Analytics-Kanban, 10Traffic, 06Wikipedia-iOS-App-Backlog, and 2 others: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#3095154 (10JoeWalsh) @elukey the reduction to 10% will be in the next app version, 5.4. Currently it's scheduled to be released at the e... [11:22:19] 06Operations, 10ops-codfw: Degraded RAID on ms-be2008 - https://phabricator.wikimedia.org/T160312#3095186 (10fgiunchedi) sde is unhappy ``` [12359587.846888] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [12359587.846891] sd 0:2:4:0: [sde] CDB: [12359587.846892] Read(10): 28 00 00 00 00 29 00 00 01 00... [11:27:34] PROBLEM - puppet last run on labsdb1007 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 35 seconds ago with 2 failures. Failed resources (up to 3 shown): Exec[create_shapelines-gis-coastlines],Exec[create_shapelines-gis-land_polygons] [11:29:42] akosiaris, did you enable it back?^ [11:30:10] jynus: [11:30:12] jynus: yes [11:30:19] I am loading the coastlines and lang polygons [11:30:30] and then it's just getting the user dbs imported and we are done [11:30:34] RECOVERY - puppet last run on labsdb1007 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [11:30:40] I checked 1006 [11:30:51] and there are not extra dbs other than the ones puppetized [11:31:54] cool, so it's just moving the data over [11:32:07] I 'll do it in a while.. should be quick enough [11:39:41] joewalsh: thanks a lot! [11:40:20] elukey: no problem! [11:42:24] (03PS2) 10Muehlenhoff: Setup "bot" credentials file for Phabricator support in offboarding script [puppet] - 10https://gerrit.wikimedia.org/r/342222 [11:48:04] (03PS3) 10Muehlenhoff: Setup "bot" credentials file for Phabricator support in offboarding script [puppet] - 10https://gerrit.wikimedia.org/r/342222 [11:49:41] (03CR) 10Muehlenhoff: [C: 032] Setup "bot" credentials file for Phabricator support in offboarding script [puppet] - 10https://gerrit.wikimedia.org/r/342222 (owner: 10Muehlenhoff) [11:55:46] 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 07Wikimedia-Multiple-active-datacenters: MediaWiki Datacenter Switchover automation - https://phabricator.wikimedia.org/T160178#3095230 (10Volans) My proposal is to have a python file for each task (where feasible) with the same external interface, so... [11:56:54] !log reimage analytics1042 (Hadoop worker node) to Debian Jessie [11:56:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:12] 06Operations, 10Analytics, 10Analytics-Cluster, 15User-Elukey: Reimage all the Hadoop worker nodes to Debian Jessie - https://phabricator.wikimedia.org/T160333#3095234 (10elukey) [12:00:29] 06Operations, 10Analytics, 10Analytics-Cluster, 15User-Elukey: Reimage all the Hadoop worker nodes to Debian Jessie - https://phabricator.wikimedia.org/T160333#3095234 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` ['analytics1042.eqiad.wmnet']... [12:02:20] gilles: looks like the puppet master lacks some apache module :/ [12:03:36] 06Operations, 10ops-codfw, 10DBA: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3095262 (10jcrespo) From T147769: > Description: CPU 1 has an internal error (IERR). > Mentioned in SAL (#wikimedia-operations) [2016-10-18T14:40:20Z] Shutting down es2015 for hardware ma... [12:04:15] 06Operations, 10ops-codfw, 10DBA: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3095264 (10Marostegui) So looks like the CPU is broken then and needs replacement. @Papaul let's dismiss the DIMM change and proceed to change that CPU that has failed twice now? [12:05:31] !log install libevent security updates [12:05:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:31] !log restart pybal on lvs1003 to add swift-https_443 [12:08:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:08] RECOVERY - LVS HTTPS IPv4 on ms-fe.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 391 bytes in 0.020 second response time [12:10:29] (03CR) 10Giuseppe Lavagetto: [C: 031] "The code looks much nicer now too." [software/cumin] - 10https://gerrit.wikimedia.org/r/342238 (https://phabricator.wikimedia.org/T159968) (owner: 10Volans) [12:10:50] (03PS3) 10Volans: Clustershell: always require an event handler [software/cumin] - 10https://gerrit.wikimedia.org/r/342238 (https://phabricator.wikimedia.org/T159968) [12:11:32] (03CR) 10Volans: [C: 032] Clustershell: always require an event handler [software/cumin] - 10https://gerrit.wikimedia.org/r/342238 (https://phabricator.wikimedia.org/T159968) (owner: 10Volans) [12:12:08] (03Merged) 10jenkins-bot: Clustershell: always require an event handler [software/cumin] - 10https://gerrit.wikimedia.org/r/342238 (https://phabricator.wikimedia.org/T159968) (owner: 10Volans) [12:13:16] (03PS6) 10Volans: Add integration tests for clustershell transport [software/cumin] - 10https://gerrit.wikimedia.org/r/342239 (https://phabricator.wikimedia.org/T159969) [12:15:47] (03PS1) 10Elukey: Set Debian Jessie as default image for all the Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/342448 (https://phabricator.wikimedia.org/T160333) [12:18:06] (03CR) 10Elukey: [C: 032] Set Debian Jessie as default image for all the Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/342448 (https://phabricator.wikimedia.org/T160333) (owner: 10Elukey) [12:21:04] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [12:22:05] 06Operations, 10Analytics, 10Analytics-Cluster, 13Patch-For-Review, 15User-Elukey: Reimage all the Hadoop worker nodes to Debian Jessie - https://phabricator.wikimedia.org/T160333#3095359 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['analytics1042.eqiad.wmnet'] ``` and were **ALL** succe... [12:22:33] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Graph the age of the Thumbor processes in Grafana - https://phabricator.wikimedia.org/T159352#3095360 (10Gilles) 05Open>03Resolved Graph works fine, it should let me see if updates make any improvement. [12:22:44] PROBLEM - puppet last run on wtp1021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:26:54] 06Operations, 10Analytics, 10Analytics-Cluster, 13Patch-For-Review, 15User-Elukey: Reimage all the Hadoop worker nodes to Debian Jessie - https://phabricator.wikimedia.org/T160333#3095380 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` ['analy... [12:27:01] jouncebot: refresh [12:27:02] I refreshed my knowledge about deployments. [12:27:06] jouncebot: next [12:27:06] In 0 hour(s) and 32 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170313T1300) [12:29:16] (03CR) 10Giuseppe Lavagetto: [C: 031] "I like the way docker is used and managed, but the actual test code is extremely verbose and abstract at the same time; if you allowed for" (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/342239 (https://phabricator.wikimedia.org/T159969) (owner: 10Volans) [12:44:42] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 2 others: Purge Varnish cache when a banner is saved - https://phabricator.wikimedia.org/T154954#3095397 (10ema) >>! In T154954#3092101, @AndyRussG wrote: > The patch might purge 3000 or more URLs for each banner save... [12:49:44] RECOVERY - puppet last run on wtp1021 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [12:54:59] 06Operations, 10Analytics, 10Analytics-Cluster, 13Patch-For-Review, 15User-Elukey: Reimage all the Hadoop worker nodes to Debian Jessie - https://phabricator.wikimedia.org/T160333#3095482 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['analytics1042.eqiad.wmnet'] ``` and were **ALL** succe... [12:57:30] jouncebot: refresh [12:57:31] I refreshed my knowledge about deployments. [12:57:33] jouncebot: next [12:57:33] In 0 hour(s) and 2 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170313T1300) [12:57:54] (03PS3) 10Hashar: Allow 'autoreviewrestore' to be managed from Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342042 (owner: 10MarcoAurelio) [12:58:25] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342042 (owner: 10MarcoAurelio) [12:58:33] I'm here fwiw [12:58:37] just in case [12:59:42] (03Merged) 10jenkins-bot: Allow 'autoreviewrestore' to be managed from Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342042 (owner: 10MarcoAurelio) [12:59:50] (03CR) 10jenkins-bot: Allow 'autoreviewrestore' to be managed from Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342042 (owner: 10MarcoAurelio) [13:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170313T1300). [13:00:04] TabbyCat and phuedx: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:30] Hello. I can SWAT. [13:00:48] oh hashar, you're already doing that, okay :) [13:00:58] Dereckson: feel free to do the other one ? :} [13:01:05] https://gerrit.wikimedia.org/r/#/c/342443/ [13:01:08] Dereckson: Je pense que hashar est en train de le faire? [13:01:18] !log hashar@tin Synchronized wmf-config/CommonSettings.php: +$wgAvailableRights[] = autoreviewrestore; (duration: 00m 41s) [13:01:23] oh that is for labs.php only [13:01:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:37] * TabbyCat tests the change [13:01:57] (03PS2) 10Ema: cache_misc: set timeout_idle to 120s [puppet] - 10https://gerrit.wikimedia.org/r/341576 (https://phabricator.wikimedia.org/T159429) [13:02:08] Dereckson: I am not quite sure about that patch really [13:02:16] mine should be fine [13:02:28] it's beta cluster only and easy to test [13:02:29] any reason to just disable them ? [13:02:43] iirc they got created to be able to test quicksurveys [13:02:45] (03CR) 10Ema: [V: 032 C: 032] cache_misc: set timeout_idle to 120s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/341576 (https://phabricator.wikimedia.org/T159429) (owner: 10Ema) [13:03:04] hashar: noise on performance graphs [13:03:21] then I am not sure how the browser test will manage to enable them [13:03:36] ema: you there? [13:03:45] wait wait, these are for the nightlies aren't they [13:03:59] * phuedx backs off [13:04:06] yeah we have a few daily browser tests in QuickSurveys [13:04:06] elukey: yes, I've just noticed that the mariadb module was also there [13:04:22] too late :( [13:04:27] :( [13:04:29] phuedx: the job being https://integration.wikimedia.org/ci/view/Selenium/job/selenium-QuickSurveys/ [13:04:32] ta [13:05:13] jynus: ping, I've accidentally merged the mariadb module https://gerrit.wikimedia.org/r/#/c/341576/ [13:05:15] 06Operations, 06Performance-Team, 10Traffic: What happened 2017-03-09 04:00 - 06:00 UTC - https://phabricator.wikimedia.org/T160109#3095504 (10Gilles) [13:05:24] phuedx: does a few tests : https://integration.wikimedia.org/ci/job/selenium-QuickSurveys/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=BrowserTests/339/testReport/(root)/ [13:05:48] (03CR) 10Phuedx: [C: 04-1] "As Hashar pointed out, these surveys are for the QuickSurveys nightly build, which will fail if the surveys aren't enabled. This'll need a" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342443 (https://phabricator.wikimedia.org/T159739) (owner: 10Harjotsingh) [13:05:51] hashar: ^ [13:05:58] i'll unschedule the deploy [13:06:11] as for -labs.php jobs, they can be merged outside of SWAT [13:06:15] since that is a noop for prod [13:06:33] the daily browser test is mentioned in https://phabricator.wikimedia.org/T159739 [13:07:26] or maybe they can be disabled [13:07:37] and passing ?quicksurvey=true force them to be enabled [13:07:50] but then I have no idea what that extension is doing, much less how it works exactly [13:08:48] hashar: we can drop the survey's coverage to 0% and then make the browser pass in ?quicksurvey=true [13:08:51] i'll update the tracking task [13:08:56] thanks for shouting out <3 <3 <3 [13:10:33] 06Operations, 06Performance-Team, 10Traffic: What happened 2017-03-09 04:00 - 06:00 UTC - https://phabricator.wikimedia.org/T160109#3095512 (10Gilles) @BBlack have there been any known networking incidents during that period? 2017-03-09 05:00 -> 2017-03-10 13:00 The higher TTFB has been confirmed with diffe... [13:11:05] alright [13:13:07] phuedx: worth trying :) [13:13:28] (03CR) 10Phuedx: [C: 04-1] "This is actually very easy to fix. Rather than" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342443 (https://phabricator.wikimedia.org/T159739) (owner: 10Harjotsingh) [13:13:43] hashar: it's actually remarkably simple to fix, i hope the volunteer is online [13:13:50] hashar: sorry for the delay but my patch is working just fine, so thanks [13:14:02] TabbyCat: \O/ [13:14:18] migrated the logs from mediawiki to meta :D [13:14:25] so c'est fait [13:14:29] phuedx: or at your option hijack his patch, CR+2 and we can then run the tests against beta :} [13:14:45] * phuedx doesn't really like treading on folk's feet [13:15:05] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [13:15:33] hashar: ^ it can wait a while, there's no huge rush [13:16:28] ok ok :) [13:21:42] (03PS1) 10Ema: Revert "cache_misc: set timeout_idle to 120s" [puppet] - 10https://gerrit.wikimedia.org/r/342455 [13:32:52] okay im stupid wheres the initializesettings.php located for core? [13:32:59] oops wrong channel [13:33:03] (03PS3) 10Harjotsingh: quickSurveys: Disable surveys on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342443 (https://phabricator.wikimedia.org/T159739) [13:33:10] (03CR) 10Ema: [V: 032 C: 032] Revert "cache_misc: set timeout_idle to 120s" [puppet] - 10https://gerrit.wikimedia.org/r/342455 (owner: 10Ema) [13:33:11] Zppix: with a "s" [13:33:14] (british spelling) [13:33:27] 06Operations, 06Performance-Team, 10Traffic: What happened 2017-03-09 04:00 - 06:00 UTC - https://phabricator.wikimedia.org/T160109#3095584 (10BBlack) Yes, the anomalies you're describing here were the result of a DDoS attack against us, and our mitigations to reduce user impact. The incident doc is private... [13:33:34] Dereckson: ah wheres it at however [13:33:58] 06Operations: Encrypt all the things - https://phabricator.wikimedia.org/T111653#3095590 (10fgiunchedi) [13:34:03] 06Operations, 10media-storage, 13Patch-For-Review, 07Wikimedia-Multiple-active-datacenters: Enable HTTPS for Swift traffic - https://phabricator.wikimedia.org/T127455#3095588 (10fgiunchedi) 05Open>03Resolved HTTPS for `ms-fe.svc` is now active in eqiad and codfw [13:34:35] (03CR) 10Harjotsingh: "> As Hashar pointed out, these surveys are for the QuickSurveys" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342443 (https://phabricator.wikimedia.org/T159739) (owner: 10Harjotsingh) [13:36:14] 06Operations, 06Performance-Team, 10Traffic: What happened 2017-03-09 04:00 - 06:00 UTC - https://phabricator.wikimedia.org/T160109#3095601 (10Gilles) @MoritzMuehlenhoff gave me the link. Is there a page we can watch to stay up to date on those? Since the depooling/repooling steps didn't show up in the SAL.... [13:39:13] 06Operations, 06Performance-Team, 10Traffic: What happened 2017-03-09 04:00 - 06:00 UTC - https://phabricator.wikimedia.org/T160109#3095604 (10BBlack) Presently, we don't have any sort of "live" feed on security incidents other than what you've already mentioned. For some classes of incident, such a thing w... [13:41:36] (03Draft2) 10Zppix: Deprecation of "editusercssjs" in MW-CORE Must merge with Ic9851d53affe0f4ece7a79f541ec5cb39133b109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342456 [13:44:15] 06Operations, 06Performance-Team, 10Traffic: What happened 2017-03-09 04:00 - 06:00 UTC - https://phabricator.wikimedia.org/T160109#3095606 (10Gilles) I meant something private. The incident page on officewiki looks orphaned. If there was a page where those are listed, I could go there like I go to the SAL w... [13:45:43] Dereckson: can you look over gerrit:342456 Ive being meaning to get this deprecatiated for a bit now [13:46:42] (03CR) 10Volans: "What actually made me do the abstract test code was to avoid to repeat myself 70 times to duplicate the code for each different test combi" (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/342239 (https://phabricator.wikimedia.org/T159969) (owner: 10Volans) [13:47:09] (03CR) 10Dereckson: "To deprecate != to remove" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342456 (owner: 10Zppix) [13:47:15] PROBLEM - puppet last run on mw1235 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:47:21] ah ok [13:48:43] (03PS7) 10Volans: Add integration tests for clustershell transport [software/cumin] - 10https://gerrit.wikimedia.org/r/342239 (https://phabricator.wikimedia.org/T159969) [13:51:25] (03PS3) 10Zppix: Deprecation of "editusercssjs" in MW-CORE Must merge with Ic9851d53affe0f4ece7a79f541ec5cb39133b109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342456 [13:52:03] (03PS4) 10Jcrespo: mariadb: Decouple db proxy role classes to separate files [puppet] - 10https://gerrit.wikimedia.org/r/342436 (https://phabricator.wikimedia.org/T150850) [13:52:15] Zppix: a "deprecation" is a warning for a future removal, when you remove something, it's a removal [13:52:24] (03CR) 10jerkins-bot: [V: 04-1] Deprecation of "editusercssjs" in MW-CORE Must merge with Ic9851d53affe0f4ece7a79f541ec5cb39133b109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342456 (owner: 10Zppix) [13:52:34] the goal is to allow some delay to adjust the configuration or code [13:54:57] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3095626 (10akosiaris) We 've ended up promoting labsdb1007 to master, resyncing from planet.osm and pg_dump/pg_restore the various databases/tables.... [13:55:28] Dereckson: im confused are you asking me to change something? [13:56:41] (03PS1) 10Alexandros Kosiaris: Switch osmdb.eqiad.wmnet to use labsdb1007 [dns] - 10https://gerrit.wikimedia.org/r/342458 (https://phabricator.wikimedia.org/T157359) [13:56:50] I think you may have taken `To deprecate != to remove` literally - Dereckson just means "to deprecate is not the same as removing" [13:57:43] So you can clarify what you wish to do, for example if you really want to remove, explain when it has been deprecated and it's now removed from core [13:57:58] (03CR) 10Jcrespo: [C: 031] "When it is ready." [dns] - 10https://gerrit.wikimedia.org/r/342458 (https://phabricator.wikimedia.org/T157359) (owner: 10Alexandros Kosiaris) [13:58:11] and if you only wish to mark them deprecated you can keep the settings and add `// deprecated` at the end of the line (or do nothing) [13:58:54] https://gerrit.wikimedia.org/r/#/c/332934/ <- that's a more clear commit message [14:04:09] (03PS1) 10Jcrespo: Remove mariadb submodule [puppet] - 10https://gerrit.wikimedia.org/r/342459 [14:05:00] (03CR) 10jerkins-bot: [V: 04-1] Remove mariadb submodule [puppet] - 10https://gerrit.wikimedia.org/r/342459 (owner: 10Jcrespo) [14:07:32] (03PS2) 10Jcrespo: Remove mariadb submodule [puppet] - 10https://gerrit.wikimedia.org/r/342459 [14:09:06] (03PS4) 10Zppix: Removal of "editusercssjs" in MW-CORE (Mediawiki 1.16 Depricated) Must merge with Ic9851d53affe0f4ece7a79f541ec5cb39133b109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342456 [14:10:12] (03CR) 10jerkins-bot: [V: 04-1] Removal of "editusercssjs" in MW-CORE (Mediawiki 1.16 Depricated) Must merge with Ic9851d53affe0f4ece7a79f541ec5cb39133b109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342456 (owner: 10Zppix) [14:10:58] (03PS2) 10Filippo Giunchedi: Provision new ms-be machines in codfw [puppet] - 10https://gerrit.wikimedia.org/r/342074 (https://phabricator.wikimedia.org/T158337) [14:11:57] (03CR) 10Volans: [C: 032] Add integration tests for clustershell transport [software/cumin] - 10https://gerrit.wikimedia.org/r/342239 (https://phabricator.wikimedia.org/T159969) (owner: 10Volans) [14:12:36] (03Merged) 10jenkins-bot: Add integration tests for clustershell transport [software/cumin] - 10https://gerrit.wikimedia.org/r/342239 (https://phabricator.wikimedia.org/T159969) (owner: 10Volans) [14:12:49] (03CR) 10Filippo Giunchedi: [C: 032] Provision new ms-be machines in codfw [puppet] - 10https://gerrit.wikimedia.org/r/342074 (https://phabricator.wikimedia.org/T158337) (owner: 10Filippo Giunchedi) [14:13:22] jynus: merging your change too [14:13:30] thanks [14:13:41] np, JFYI [14:15:25] (03CR) 10Alexandros Kosiaris: [C: 032] Switch osmdb.eqiad.wmnet to use labsdb1007 [dns] - 10https://gerrit.wikimedia.org/r/342458 (https://phabricator.wikimedia.org/T157359) (owner: 10Alexandros Kosiaris) [14:16:15] RECOVERY - puppet last run on mw1235 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [14:20:17] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3095661 (10akosiaris) And we are done. The rest of the databases/tables have been copied over, the DNS record has been updated and DNS caches cleare... [14:20:38] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1089" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342460 [14:20:43] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1089" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342460 [14:21:48] (03PS1) 10Ema: cache_misc: set timeout_idle to 120s [puppet] - 10https://gerrit.wikimedia.org/r/342461 (https://phabricator.wikimedia.org/T159429) [14:22:52] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1089" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342460 (owner: 10Marostegui) [14:23:43] hashar: gonna merge https://gerrit.wikimedia.org/r/#/c/342443/ [14:23:52] now it just sets all coverage to zero [14:24:03] and the tests make sure that this is ignored [14:24:03] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1089" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342460 (owner: 10Marostegui) [14:24:03] phuedx: ok :) [14:24:15] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1089" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342460 (owner: 10Marostegui) [14:24:17] (03CR) 10Phuedx: [C: 032] quickSurveys: Disable surveys on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342443 (https://phabricator.wikimedia.org/T159739) (owner: 10Harjotsingh) [14:24:20] phuedx: then once it is deployed I guess we can trigger the browser tests [14:24:23] ^ [14:24:27] (03CR) 10Alexandros Kosiaris: [C: 031] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/342459 (owner: 10Jcrespo) [14:25:02] stuff to play while waiting: Kempff playing beethoven https://www.youtube.com/watch?v=oqSulR9Fymg [14:25:19] (03Merged) 10jenkins-bot: quickSurveys: Disable surveys on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342443 (https://phabricator.wikimedia.org/T159739) (owner: 10Harjotsingh) [14:25:49] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1089 - T132416 (duration: 00m 41s) [14:25:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:54] T132416: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416 [14:26:03] hashar: stuff i'm listening to https://www.youtube.com/watch?v=fnvMe0rWI2I [14:27:05] (03CR) 10jenkins-bot: quickSurveys: Disable surveys on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342443 (https://phabricator.wikimedia.org/T159739) (owner: 10Harjotsingh) [14:27:10] (03PS1) 10Marostegui: db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342462 (https://phabricator.wikimedia.org/T132416) [14:28:01] hashar: can I deploy db-eqiad.php? [14:28:02] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3095669 (10jcrespo) @aude @MaxSem @Kolossos Can you verify your applications (e.g. restarting them) and see that they work as expected to be 100% th... [14:28:10] kicking off a build [14:28:26] marostegui: yeah [14:28:32] \o/ [14:28:34] thanks [14:28:34] marostegui: phuedx is landing a beta cluster only change [14:28:51] ah ok :) [14:29:08] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342462 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui) [14:30:40] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342462 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui) [14:30:48] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342462 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui) [14:31:07] (03CR) 10Jcrespo: "I do not want to pressure reviewers- but this should be applied quickly or abandoned quickly. Otherwise, it will cause confusion and block" [puppet] - 10https://gerrit.wikimedia.org/r/342459 (owner: 10Jcrespo) [14:31:09] phuedx: scap running on https://integration.wikimedia.org/ci/job/beta-scap-eqiad/146163/console [14:31:42] hashar: mibad -- i thought it ran as a post-merge job [14:31:56] phuedx: the fetch on deployment-tin is indeed a post merge [14:32:02] ah [14:32:08] which then trigger the scap job asynchronously [14:32:14] iirc [14:32:20] hashar, quick question- do you think getting rid of a submodule on puppet could cause any issue to CI? [14:32:28] so it fetches post-merge so there's no noise when you're git fetch origin; git log origin.. ing [14:32:45] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1083 - T132416 (duration: 00m 41s) [14:32:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:51] T132416: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416 [14:33:13] !log Deploy alter table enwiki.revision db1083 - T132416 [14:33:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:54] jynus: maybe there will be some paths conflict on the process that merges patches on tip of branch (zuul-merger) but that is fixable [14:34:27] all submodules should now pass puppet parser validate / puppet-lint [14:34:37] feel free to comment on https://gerrit.wikimedia.org/r/342459 [14:34:52] in case you have any tip or want to be aware of it being merged [14:35:23] removing physically the repo would be done separatelly, but after it [14:35:36] ACKNOWLEDGEMENT - MD RAID on ms-be2028 is CRITICAL: CRITICAL: State: degraded, Active: 5, Working: 5, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T160349 [14:35:38] (03CR) 10Marostegui: [C: 031] "yaaay +1!! This compiles fine: https://puppet-compiler.wmflabs.org/5756/" [puppet] - 10https://gerrit.wikimedia.org/r/342459 (owner: 10Jcrespo) [14:35:40] 06Operations, 10ops-codfw: Degraded RAID on ms-be2028 - https://phabricator.wikimedia.org/T160349#3095692 (10ops-monitoring-bot) [14:35:53] jynus: the job passed so I guess that is green light as far as CI is concerned :-} [14:36:00] hashar, nice [14:37:03] jynus: there are a bunch of files you can remove such as .gitignore .gitreview .puppet-lint.rc Gemfile [14:37:16] 06Operations: Provide wrapper script for account handling - https://phabricator.wikimedia.org/T142825#3095698 (10MoritzMuehlenhoff) The cross-validation is already handled via the daily consistency check. I started to work on a quick frontend to add a user to data.yaml, but that doesn't work very well, since loa... [14:37:24] hashar, yeah, I was wondering [14:37:26] jynus: that can be cleaned up later though [14:37:27] if thoese were [14:37:32] per-dir [14:37:34] or per-repo [14:37:41] so I left all of them just in case [14:38:01] the job runs from the root of the repo [14:38:10] and basically does: bundle install && bundle exec rake test [14:38:28] I have copy pasted a subset of that to each of the submodules which have the same command run for them [14:38:50] (sorry my english is becoming crappier and crappier :/ ) [14:39:00] 06Operations: Offboarding script for account handling - https://phabricator.wikimedia.org/T142825#3095699 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [14:40:27] phuedx: And the survey code has fully loaded -> "timed out after 30 seconds, waiting for true condition" on [14:40:30] :( [14:41:11] phuedx: I guess since the coverage is 0 , there is no way to be in a bucket that enables it [14:42:08] (03CR) 10Elukey: [C: 031] "LGTM! https://puppet-compiler.wmflabs.org/5758/" [puppet] - 10https://gerrit.wikimedia.org/r/342461 (https://phabricator.wikimedia.org/T159429) (owner: 10Ema) [14:43:30] hashar: weird -- we're sending a queryparam that overrides the bucketing :/ [14:43:36] i'll take a look [14:44:10] (03PS3) 10Jcrespo: Remove mariadb submodule [puppet] - 10https://gerrit.wikimedia.org/r/342459 [14:44:12] (03CR) 10Ema: [C: 031] Remove mariadb submodule [puppet] - 10https://gerrit.wikimedia.org/r/342459 (owner: 10Jcrespo) [14:44:59] ah [14:45:10] we're not adding it in some tests [14:47:59] jynus: so yeah basically you can drop any files at the root of modules/mariadb/ [14:48:50] and then I guess we can mark the repo read only [14:49:09] I am not as worried for that as much as things actually breaking [14:49:37] although I do not see anyone using the mariadb-wmf specific module on its own [14:58:39] (03CR) 10Ema: [C: 032] cache_misc: set timeout_idle to 120s [puppet] - 10https://gerrit.wikimedia.org/r/342461 (https://phabricator.wikimedia.org/T159429) (owner: 10Ema) [14:59:02] hashar: is eu swat done? we're starting es5 upgrade in codfw (I have a small mw-config patch to deploy) [15:01:08] (03CR) 10Gehel: [C: 031] "LGTM, we are ready to go..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342031 (https://phabricator.wikimedia.org/T157479) (owner: 10DCausse) [15:02:52] PROBLEM - MariaDB Slave Lag: s2 on db1047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 620.11 seconds [15:03:53] checking [15:04:23] I see, it is the table-checksum [15:07:32] PROBLEM - salt-minion processes on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:08:23] RECOVERY - salt-minion processes on thumbor1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:09:38] 06Operations, 10ops-codfw: Degraded RAID on ms-be2028 - https://phabricator.wikimedia.org/T160349#3095805 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi My mistake, host is provisioning [15:09:38] ok nobody logged on tin, assuming that swat is done. I'll deploy a mw-config change from tin to stop sending writes to elastic@codfw [15:10:08] (03PS2) 10DCausse: [es5 upgrade] step 1: depool codfw for writes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342031 (https://phabricator.wikimedia.org/T157479) [15:10:34] dcausse: yeah swat is done [15:10:46] hashar: thanks, deploying [15:10:51] dcausse: did it starting at 2pm CET :} [15:10:59] my best wishes for es! [15:11:07] thanks :) [15:11:32] PROBLEM - puppet last run on install1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:13:06] (03CR) 10DCausse: [C: 032] [es5 upgrade] step 1: depool codfw for writes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342031 (https://phabricator.wikimedia.org/T157479) (owner: 10DCausse) [15:14:30] (03Merged) 10jenkins-bot: [es5 upgrade] step 1: depool codfw for writes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342031 (https://phabricator.wikimedia.org/T157479) (owner: 10DCausse) [15:14:42] (03CR) 10jenkins-bot: [es5 upgrade] step 1: depool codfw for writes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342031 (https://phabricator.wikimedia.org/T157479) (owner: 10DCausse) [15:17:26] (03PS1) 10Alexandros Kosiaris: Make oresrdb2002 a slave of oresrb2001 [puppet] - 10https://gerrit.wikimedia.org/r/342468 (https://phabricator.wikimedia.org/T160082) [15:19:09] !log dcausse@tin Synchronized wmf-config/CommonSettings.php: [es5 upgrade] step 1: depool codfw for writes 1/2 (duration: 00m 45s) [15:19:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:08] !log dcausse@tin Synchronized wmf-config/InitialiseSettings.php: [es5 upgrade] step 1: depool codfw for writes 2/2 (duration: 00m 44s) [15:21:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:41] !log elastic@codfw stopped to receive writes [15:21:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:58] 06Operations, 06Commons, 10Datasets-General-or-Unknown, 07Community-Wishlist-Survey-2016: Back up of Commons files - https://phabricator.wikimedia.org/T160229#3095846 (10Nemo_bis) I have no idea what this request wants from ArchiveTeam/WikiTeam that we aren't already doing (http://archiveteam.org/index.php... [15:21:59] I'm done deploying my change [15:22:28] dcausse: thanks! I'm going to wait a bit for things to settle before shutting down elasticsearch... [15:28:24] (03CR) 10Alexandros Kosiaris: [C: 032] Make oresrdb2002 a slave of oresrb2001 [puppet] - 10https://gerrit.wikimedia.org/r/342468 (https://phabricator.wikimedia.org/T160082) (owner: 10Alexandros Kosiaris) [15:28:29] (03PS2) 10Alexandros Kosiaris: Make oresrdb2002 a slave of oresrb2001 [puppet] - 10https://gerrit.wikimedia.org/r/342468 (https://phabricator.wikimedia.org/T160082) [15:28:36] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Make oresrdb2002 a slave of oresrb2001 [puppet] - 10https://gerrit.wikimedia.org/r/342468 (https://phabricator.wikimedia.org/T160082) (owner: 10Alexandros Kosiaris) [15:30:42] PROBLEM - puppet last run on db1039 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:36:35] (03PS1) 10Alexandros Kosiaris: oresrdb2002: Add it to ores::redis::client_hosts [puppet] - 10https://gerrit.wikimedia.org/r/342469 [15:38:28] (03PS2) 10Alexandros Kosiaris: oresrdb2002: Add it to ores::redis::client_hosts [puppet] - 10https://gerrit.wikimedia.org/r/342469 [15:38:33] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] oresrdb2002: Add it to ores::redis::client_hosts [puppet] - 10https://gerrit.wikimedia.org/r/342469 (owner: 10Alexandros Kosiaris) [15:39:32] RECOVERY - puppet last run on install1002 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [15:40:48] 06Operations: dbstore1001 troubleshoot IPMI issue - https://phabricator.wikimedia.org/T158893#3095874 (10Marostegui) @Cmjohnson did you arrange any concrete date with Dell in the end? The server is now in service and ideally, if it needed to be brought down...Monday or Tuesdays would fit us better (the backups a... [15:41:05] 06Operations, 10DBA: dbstore1001 troubleshoot IPMI issue - https://phabricator.wikimedia.org/T158893#3095875 (10Marostegui) [15:41:49] !log shutting down elasticsearch on codfw for v5.1.2 upgrade - T158680 [15:41:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:54] T158680: Upgrade codfw to ES 5.x - https://phabricator.wikimedia.org/T158680 [15:42:01] 06Operations, 10Analytics, 10Analytics-Cluster, 13Patch-For-Review, 15User-Elukey: Reimage all the Hadoop worker nodes to Debian Jessie - https://phabricator.wikimedia.org/T160333#3095880 (10Nuria) p:05Normal>03High [15:42:04] 06Operations, 10ops-codfw, 13Patch-For-Review: codfw: oresrdb2002 rack/setup - https://phabricator.wikimedia.org/T160082#3095881 (10akosiaris) [15:42:52] 06Operations, 10ops-codfw, 13Patch-For-Review: codfw: oresrdb2002 rack/setup - https://phabricator.wikimedia.org/T160082#3088068 (10akosiaris) 05Open>03Resolved [15:42:52] RECOVERY - MariaDB Slave Lag: s2 on db1047 is OK: OK slave_sql_lag Replication lag: 51.95 seconds [15:43:10] 06Operations, 10ops-codfw, 13Patch-For-Review: codfw: oresrdb2002 rack/setup - https://phabricator.wikimedia.org/T160082#3088068 (10akosiaris) Resolved. thanks @Papaul , @fgiunchedi [15:46:34] 07Puppet, 10Continuous-Integration-Config: also clone submodules in operations/puppet jobs - https://phabricator.wikimedia.org/T112670#1641972 (10jcrespo) I think this was done some time ago, but I could be wrong. Could you check its validity? [15:50:25] PROBLEM - puppet last run on cp1099 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:50:49] PROBLEM - LVS HTTP IPv4 on search.svc.codfw.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.161 second response time [15:50:49] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - search-https_9243 - Could not depool server elastic2018.codfw.wmnet because of too many down!: search_9200 - Could not depool server elastic2007.codfw.wmnet because of too many down! [15:50:49] PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - search-https_9243 - Could not depool server elastic2001.codfw.wmnet because of too many down!: search_9200 - Could not depool server elastic2007.codfw.wmnet because of too many down! [15:51:14] gehel^ [15:51:16] 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: EQIAD: stat1003 replacement - https://phabricator.wikimedia.org/T159839#3095921 (10Nuria) p:05Normal>03High [15:51:31] is that the upgrade to elasticsearch 5 ? [15:51:32] yep, this is me, it seems I forgot to silence an alert... [15:51:34] 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: EQIAD: stat1002 replacement - https://phabricator.wikimedia.org/T159838#3095922 (10Nuria) p:05Normal>03High [15:51:35] ? [15:51:37] ok [15:53:55] (03CR) 10Alexandros Kosiaris: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/341534 (https://phabricator.wikimedia.org/T148541) (owner: 10Filippo Giunchedi) [15:54:41] pybal checks can only be silenced globally it seems... [15:54:55] ACKNOWLEDGEMENT - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - search-https_9243 - Could not depool server elastic2018.codfw.wmnet because of too many down!: search_9200 - Could not depool server elastic2007.codfw.wmnet because of too many down! Gehel elasticsearch upgrade to 5.1.2 - T158680 [15:54:55] ACKNOWLEDGEMENT - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - search-https_9243 - Could not depool server elastic2001.codfw.wmnet because of too many down!: search_9200 - Could not depool server elastic2007.codfw.wmnet because of too many down! Gehel elasticsearch upgrade to 5.1.2 - T158680 [15:55:20] gehel: it's the LVS check that should have been silenced.. not the pybal one IMHO [15:55:25] that is.. the one that pages [15:56:00] (03CR) 10Alexandros Kosiaris: [C: 031] facilities: add codfw PDUs [puppet] - 10https://gerrit.wikimedia.org/r/341534 (https://phabricator.wikimedia.org/T148541) (owner: 10Filippo Giunchedi) [15:56:18] I silenced the LVS check now (too late, I know). I was sure I did it, but I must have forgot to click ok... [15:58:45] RECOVERY - puppet last run on db1039 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [15:59:21] (03PS1) 10Gehel: elasticsearch - upgrade codfw to elasticsearch 5.1.2 [puppet] - 10https://gerrit.wikimedia.org/r/342472 (https://phabricator.wikimedia.org/T158680) [16:01:58] 07Puppet, 10Continuous-Integration-Config: also clone submodules in operations/puppet jobs - https://phabricator.wikimedia.org/T112670#3095957 (10hashar) Status: | CI job | Submodules |--|-- | operations-puppet-tox-jessie | NO | operations-puppet-rake-jessie | YES, recursive | operations-puppet-typos | NO Th... [16:06:54] !log upgrading plugins to 5.1.2 on elasticsearch codfw - T158680 [16:06:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:00] T158680: Upgrade codfw to ES 5.x - https://phabricator.wikimedia.org/T158680 [16:07:14] (03CR) 10Gehel: [C: 032] elasticsearch - upgrade codfw to elasticsearch 5.1.2 [puppet] - 10https://gerrit.wikimedia.org/r/342472 (https://phabricator.wikimedia.org/T158680) (owner: 10Gehel) [16:08:40] 06Operations, 15User-Elukey: JobQueue Redis codfw replicas periodically lagging - https://phabricator.wikimedia.org/T159850#3095981 (10elukey) ``` elukey@neodymium:~$ sudo -i salt -E 'rdb100[1357].eqiad.wmnet' cmd.run "du -hs /srv/redis/*.rdb | sort -h" rdb1007.eqiad.wmnet: 12K /srv/redis/rdb1007-6378.rdb... [16:10:15] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban: Reinstall Analytics Hadoop Cluster with Debian Jessie - https://phabricator.wikimedia.org/T157807#3095988 (10Nuria) [16:10:25] PROBLEM - puppet last run on mw1227 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:10:35] PROBLEM - Host ms-be2039 is DOWN: PING CRITICAL - Packet loss = 100% [16:11:23] (03PS1) 10Gehel: elasticsearch - enable experimental APT repo for ES 5.x upgrade [puppet] - 10https://gerrit.wikimedia.org/r/342473 (https://phabricator.wikimedia.org/T158680) [16:11:25] RECOVERY - Host ms-be2039 is UP: PING OK - Packet loss = 0%, RTA = 36.45 ms [16:12:50] (03CR) 10EBernhardson: [C: 031] elasticsearch - enable experimental APT repo for ES 5.x upgrade [puppet] - 10https://gerrit.wikimedia.org/r/342473 (https://phabricator.wikimedia.org/T158680) (owner: 10Gehel) [16:13:35] (03CR) 10Gehel: [C: 032] elasticsearch - enable experimental APT repo for ES 5.x upgrade [puppet] - 10https://gerrit.wikimedia.org/r/342473 (https://phabricator.wikimedia.org/T158680) (owner: 10Gehel) [16:18:25] RECOVERY - puppet last run on cp1099 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [16:23:23] !log restarting elasticsearch on elastic2001 after upgrade - T158680 [16:23:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:30] T158680: Upgrade codfw to ES 5.x - https://phabricator.wikimedia.org/T158680 [16:25:23] !log restarting elasticsearch on all codfw cluster after upgrade - T158680 [16:25:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:35] (03PS1) 10Jcrespo: mariadb: Depool db1054 for upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342477 [16:27:32] 06Operations, 15User-Elukey: JobQueue Redis codfw replicas periodically lagging - https://phabricator.wikimedia.org/T159850#3096036 (10elukey) Number of keys: ``` elukey@rdb1007:~$ for instance in 6378 6379 6380 6381; do echo ${instance}; redis-cli -a "$(sudo grep -Po '(?<=masterauth ).*' /etc/redis/tcp_${ins... [16:28:50] RECOVERY - LVS HTTP IPv4 on search.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.173 second response time [16:29:45] RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy [16:29:45] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy [16:35:42] !log add ms-be2028/29/30 to swift codfw-prod, initial add - T158337 [16:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:48] T158337: codfw: ms-be2028-ms-be2039 rack/setup - https://phabricator.wikimedia.org/T158337 [16:39:25] RECOVERY - puppet last run on mw1227 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [16:42:55] PROBLEM - git_daemon_running on contint2001 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/lib/git-core/git-daemon [16:45:15] 06Operations, 10Continuous-Integration-Infrastructure, 06Labs, 10netops: git clone over EQIAD (wmflabs) CODFW timeout due to low bandwidth (~250 KiB/s) - https://phabricator.wikimedia.org/T158601#3096072 (10EddieGP) p:05Triage>03Normal [16:45:36] AHHH [16:45:51] finally I found out why git_daemon_running keep alerting :} [16:47:55] RECOVERY - git_daemon_running on contint2001 is OK: PROCS OK: 1 process with regex args ^/usr/lib/git-core/git-daemon [16:50:27] PROBLEM - swift eqiad-prod object availability on graphite1001 is CRITICAL: CRITICAL: 11.76% of data under the critical threshold [90.0] [16:51:31] 06Operations, 10ops-codfw: Degraded RAID on ms-be2008 - https://phabricator.wikimedia.org/T160312#3096083 (10Papaul) @fgiunchedi Yes we do have some spares on site. [16:52:19] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1054 for upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342477 (owner: 10Jcrespo) [16:53:33] (03PS1) 10Hashar: zuul: monitoring should discard forked git-daemon [puppet] - 10https://gerrit.wikimedia.org/r/342483 [16:54:55] (03Merged) 10jenkins-bot: mariadb: Depool db1054 for upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342477 (owner: 10Jcrespo) [16:55:05] (03CR) 10jenkins-bot: mariadb: Depool db1054 for upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342477 (owner: 10Jcrespo) [16:55:22] (03CR) 10Hashar: "I messed it up previously sorry :( Tested on contint1001 / contint2001 which are the hosts having a git-daemon running:" [puppet] - 10https://gerrit.wikimedia.org/r/342483 (owner: 10Hashar) [16:55:29] !log outdated swift rings pushed in eqiad-prod, pushed again updated rings from git repo - T158337 [16:55:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:35] T158337: codfw: ms-be2028-ms-be2039 rack/setup - https://phabricator.wikimedia.org/T158337 [16:57:24] 06Operations, 10ops-codfw, 10DBA: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3096110 (10Papaul) Since we swapped CPU's in T147769 and we still have the same error, I will contact Dell once on site tomorrow for CPU replacement. [16:57:44] 06Operations, 06Services, 15User-mobrovac: Move all Node.JS services to Jessie and Node 4 - https://phabricator.wikimedia.org/T124989#3096130 (10mobrovac) [16:57:46] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1054 for upgrade (duration: 00m 53s) [16:57:48] 06Operations, 06Services (done), 15User-mobrovac: Migrate SCA cluster to SCB (Jessie and Node 4.2) - https://phabricator.wikimedia.org/T96017#3096126 (10mobrovac) 05Open>03Resolved a:03mobrovac This has been completed a looong time ago. Closing. [16:57:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:04] gehel: Respected human, time to deploy Weekly Wikidata query service deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170313T1700). Please do the needful. [17:00:58] Damn daylight saving, things start happening faster than expected... [17:04:59] 06Operations, 10Continuous-Integration-Infrastructure, 06Labs, 10netops: git clone over EQIAD (wmflabs) CODFW timeout due to low bandwidth (~250 KiB/s) - https://phabricator.wikimedia.org/T158601#3096149 (10hashar) 05Open>03Resolved a:03hashar Must have been a transient issue. Seems the bandwidth is... [17:09:38] 06Operations, 10Continuous-Integration-Config, 10Continuous-Integration-Infrastructure: Update npm to 3 or 4 - https://phabricator.wikimedia.org/T155488#3096168 (10hashar) [17:10:03] (03CR) 10EBernhardson: [es5 upgrade] step 2: repool codfw and send wmf16 to codfw (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342032 (https://phabricator.wikimedia.org/T157479) (owner: 10DCausse) [17:10:05] (03PS2) 10DCausse: [es5 upgrade] step 2: repool codfw and send wmf16 to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342032 (https://phabricator.wikimedia.org/T157479) [17:10:35] (03CR) 10DCausse: [es5 upgrade] step 2: repool codfw and send wmf16 to codfw (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342032 (https://phabricator.wikimedia.org/T157479) (owner: 10DCausse) [17:12:14] (03PS3) 10DCausse: [es5 upgrade] step 2: repool codfw and send wmf16 to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342032 (https://phabricator.wikimedia.org/T157479) [17:14:37] (03CR) 10DCausse: [es5 upgrade] step 2: repool codfw and send wmf16 to codfw (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342032 (https://phabricator.wikimedia.org/T157479) (owner: 10DCausse) [17:14:46] (03CR) 10EBernhardson: [C: 031] [es5 upgrade] step 2: repool codfw and send wmf16 to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342032 (https://phabricator.wikimedia.org/T157479) (owner: 10DCausse) [17:14:55] PROBLEM - git_daemon_running on contint2001 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/lib/git-core/git-daemon [17:15:55] RECOVERY - git_daemon_running on contint2001 is OK: PROCS OK: 1 process with regex args ^/usr/lib/git-core/git-daemon [17:16:38] I'm going deploy a mw-config change to send writes to elastic@codfw (elastic5 upgrade) [17:17:30] (03CR) 10DCausse: [C: 032] [es5 upgrade] step 2: repool codfw and send wmf16 to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342032 (https://phabricator.wikimedia.org/T157479) (owner: 10DCausse) [17:18:26] (03Merged) 10jenkins-bot: [es5 upgrade] step 2: repool codfw and send wmf16 to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342032 (https://phabricator.wikimedia.org/T157479) (owner: 10DCausse) [17:18:34] (03CR) 10jenkins-bot: [es5 upgrade] step 2: repool codfw and send wmf16 to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342032 (https://phabricator.wikimedia.org/T157479) (owner: 10DCausse) [17:18:48] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=wdqs1003.eqiad.wmnet [17:18:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:38] !log stopping mariadb at db1054 and preparing for backup and reimage [17:19:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:27] (03CR) 10Filippo Giunchedi: [C: 032] zuul: monitoring should discard forked git-daemon [puppet] - 10https://gerrit.wikimedia.org/r/342483 (owner: 10Hashar) [17:22:12] !log gehel@tin Started deploy [wdqs/wdqs@202a106]: (no justification provided) [17:22:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:59] !log gehel@tin Finished deploy [wdqs/wdqs@202a106]: (no justification provided) (duration: 01m 46s) [17:24:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:31] SMalyshev: deployment complete, tests are looking good [17:24:54] !log dcausse@tin Synchronized wmf-config/CommonSettings.php: [es5 upgrade] step 2: repool codfw and send wmf16 to codfw 1/3 (duration: 00m 46s) [17:24:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:58] (03CR) 10Filippo Giunchedi: "Since both contint machines are jessie, failed daemons are also covered under the generic "systemd health" check. Additionally if restarts" [puppet] - 10https://gerrit.wikimedia.org/r/342483 (owner: 10Hashar) [17:26:35] !log dcausse@tin Synchronized wmf-config/CirrusSearch-common.php: [es5 upgrade] step 2: repool codfw and send wmf16 to codfw 2/3 (duration: 00m 44s) [17:26:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:42] (03PS4) 10Jcrespo: Remove mariadb submodule [puppet] - 10https://gerrit.wikimedia.org/r/342459 [17:27:19] gehel: great, thank you! [17:28:12] !log dcausse@tin Synchronized wmf-config/InitialiseSettings.php: [es5 upgrade] step 2: repool codfw and send wmf16 to codfw 3/3 (duration: 00m 41s) [17:28:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:00] !log done re-enabling writes to elastic@codfw (elastic5 upgrade) [17:29:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:22] !log re-configuring cluster settings after elasticsearch upgrade - T158680 [17:29:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:29] T158680: Upgrade codfw to ES 5.x - https://phabricator.wikimedia.org/T158680 [17:30:05] (03CR) 10Jcrespo: [C: 032] Remove mariadb submodule [puppet] - 10https://gerrit.wikimedia.org/r/342459 (owner: 10Jcrespo) [17:31:26] (03PS7) 10Filippo Giunchedi: facilities: add codfw PDUs [puppet] - 10https://gerrit.wikimedia.org/r/341534 (https://phabricator.wikimedia.org/T148541) [17:33:45] jynus: do I need to wait to merge my change or good to go after your submodule removal? [17:33:57] yes please [17:34:03] it needs manual chaning [17:34:07] because of conflicts [17:34:24] heh good times, let me know if I can help [17:35:02] it is a pain, because I cannot use the script [17:35:18] as I have to delete files in-beteween puppet executions [17:36:24] godog, a second look at the repo state would be nice [17:36:56] in particular, double checking that the mariadb module is on the puppet repo [17:37:11] and not the mariadb one [17:37:12] jynus: sure, I'm checking now [17:37:19] I will check other puppet masters [17:37:27] in case they have problems with the "replication" [17:37:39] (03PS1) 10DCausse: Disable completion suggester update jobs [puppet] - 10https://gerrit.wikimedia.org/r/342487 [17:38:23] yeah, I have to fix the other puppet masters manually, too [17:38:32] and probably all local repos [17:39:02] yep the .git dir in mariadb is still there on other puppetmasters [17:39:50] (03CR) 10Gehel: [C: 032] Disable completion suggester update jobs [puppet] - 10https://gerrit.wikimedia.org/r/342487 (owner: 10DCausse) [17:40:37] gehel, please do not merge yet [17:40:40] it will fail [17:40:49] jynus: sorry, too late... :( [17:41:06] PROBLEM - Unmerged changes on repository puppet on labcontrol1001 is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [17:41:38] (03PS1) 10Reedy: Throttle for event request on IRC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342489 [17:42:01] 06Operations, 10ops-codfw, 10DBA: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3096318 (10Marostegui) Sounds good - thank you! if you need to "justify" it, the idrac logs are here: T160242#3094702 [17:42:36] (03PS1) 10Gehel: elasticsearch: align static and persistent config [puppet] - 10https://gerrit.wikimedia.org/r/342490 [17:42:48] * gehel does not enjoy git submodules... [17:42:58] (03PS2) 10Reedy: Throttle for event request on IRC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342489 [17:43:00] that is why I am getting rid of them [17:43:05] PROBLEM - puppet last run on es1013 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/opt/wmf-mariadb10/bin/mysqld_safe] [17:43:24] (03CR) 10Reedy: [C: 032] Throttle for event request on IRC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342489 (owner: 10Reedy) [17:43:36] so I assume in addition to pm1/2001/2 [17:43:43] labcontrol is the other one? [17:45:10] (03Merged) 10jenkins-bot: Throttle for event request on IRC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342489 (owner: 10Reedy) [17:45:56] in theory it should be ok for production puppet [17:45:59] now [17:46:02] (03CR) 10EBernhardson: elasticsearch: align static and persistent config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/342490 (owner: 10Gehel) [17:46:12] but anyone checking, e.g. gehel [17:46:12] !log reedy@tin Synchronized wmf-config/throttle.php: Throttle rule for event currently ongoing (duration: 00m 43s) [17:46:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:19] would be nice [17:47:16] (03CR) 10jenkins-bot: Throttle for event request on IRC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342489 (owner: 10Reedy) [17:47:35] PROBLEM - puppet last run on db2047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:49:06] your local repos will likely break too, but that is easier to fix :-) [17:49:35] RECOVERY - puppet last run on db2047 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [17:50:05] PROBLEM - Unmerged changes on repository puppet on rhodium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [17:54:27] (03PS1) 10Volans: Initial import [switchdc] - 10https://gerrit.wikimedia.org/r/342492 (https://phabricator.wikimedia.org/T160178) [17:54:37] what is rhodium? [17:55:11] puppetmaster backend jynus [17:56:05] RECOVERY - Unmerged changes on repository puppet on rhodium is OK: No changes to merge. [17:56:20] so I want to chown all /var/lib/git/operations/puppet to gitpuppet [17:56:27] on labcontrol1001 [17:56:42] because I think many people has pulled stuff as root [17:57:02] alternativelly, I can pull as root [17:58:25] PROBLEM - puppet last run on db1028 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/opt/wmf-mariadb10/bin/mysqld_safe],File[/usr/local/bin/pt-heartbeat-wikimedia] [17:58:35] PROBLEM - puppet last run on pc1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/pt-heartbeat-wikimedia] [17:58:35] PROBLEM - puppet last run on neodymium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/mysql/grcat.config] [17:58:35] PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/mysql/grcat.config] [17:59:35] RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [17:59:50] I manually ran it after the failure on db1067 and worked fine [18:00:00] probably spureous errors between the mv and the merge [18:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170313T1800). [18:00:16] yeah, I just wanted to confirm that a second run would work fine [18:00:25] RECOVERY - puppet last run on db1028 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [18:00:27] Just did it on db1028 [18:00:32] there it is [18:01:25] RECOVERY - swift eqiad-prod object availability on graphite1001 is OK: OK: Less than 1.00% under the threshold [95.0] [18:03:54] !log chowning /var/lib/git/operations/puppet to gitpuppet on labscontrol1001 [18:03:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:05] RECOVERY - Unmerged changes on repository puppet on labcontrol1001 is OK: No changes to merge. [18:05:30] labcontrol1001 should be ok now, but I am going to assume there is a failback on 1002 and so [18:06:43] !log chowning /var/lib/git/operations/puppet to gitpuppet on labscontrol1002 [18:06:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:46] I'm going to SWAT a patch for CirrusSearch (elastic5 upgrade) [18:08:16] 06Operations, 10ops-codfw: Degraded RAID on ms-be2008 - https://phabricator.wikimedia.org/T160312#3096391 (10fgiunchedi) a:03Papaul @Papaul ok! please replace even though we're decommissioning the machines in 4-5 weeks, on the basis that disks will be wiped and possibly used as spares? When decom time comes... [18:10:05] RECOVERY - puppet last run on es1013 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [18:10:46] (03CR) 10Filippo Giunchedi: [C: 032] facilities: add codfw PDUs [puppet] - 10https://gerrit.wikimedia.org/r/341534 (https://phabricator.wikimedia.org/T148541) (owner: 10Filippo Giunchedi) [18:10:53] (03PS8) 10Filippo Giunchedi: facilities: add codfw PDUs [puppet] - 10https://gerrit.wikimedia.org/r/341534 (https://phabricator.wikimedia.org/T148541) [18:14:05] RECOVERY - MegaRAID on ms-be2008 is OK: OK: optimal, 13 logical, 13 physical [18:17:05] PROBLEM - Unmerged changes on repository puppet on labcontrol1002 is CRITICAL: There are 1790 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [18:18:31] Reedy: I see you on tin, are you deploying something? [18:18:54] dcausse: Nope, I deployed about 40 minutes ago :) [18:19:01] ok thanks! :) [18:20:35] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:20:58] wah wah, I broke it [18:22:02] Error 400 on SERVER: Failed to submit 'replace catalog' command for einsteinium.wikimedia.org to PuppetDB at nitrogen.eqiad.wmnet:443: [413 Request Entity Too Large] [18:22:06] a new one [18:22:48] * Reedy takes the nice things from godog [18:23:41] ? [18:23:46] can I help [18:25:00] jynus: investigating, I think it was https://gerrit.wikimedia.org/r/#/c/341534/ which I'll rollback if I can't roll forward [18:25:10] * godog eyes Reedy [18:26:35] RECOVERY - puppet last run on pc1006 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [18:26:35] RECOVERY - puppet last run on neodymium is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [18:27:37] godog: POST limit on puppetdb API? [18:27:50] * volans assuming is a POST of course, but no idea [18:28:55] godog: client_max_body_size 20971520 [18:29:04] volans, ping about https://gerrit.wikimedia.org/r/#/c/338950/ again .. not sure if someone else has to +2 that. [18:29:27] subbu: no I can deploy it, I wanted to ping you today but assumed you were not working :) [18:29:40] i am not :-) [18:29:58] and I'd like to have you around to ensure everything is working after merging [18:30:22] ok. i am around now. [18:30:54] ok, waiting a second to see if godog needs to revert the last merge [18:31:03] sg [18:32:18] volans: I'm trying to confirm it is indeed the POST body, you can go ahead though [18:32:37] ok, thanks [18:32:52] (03PS7) 10Volans: Ruthenium VisualDiff: Test w/ local Parsoid instead of prod Parsoid [puppet] - 10https://gerrit.wikimedia.org/r/338950 (owner: 10Subramanya Sastry) [18:34:04] (03CR) 10Muehlenhoff: [C: 031] PDFRender: Delay service shut-down to work around xpra race [puppet] - 10https://gerrit.wikimedia.org/r/341833 (https://phabricator.wikimedia.org/T159922) (owner: 10GWicke) [18:34:22] (03CR) 10Volans: [C: 032] Ruthenium VisualDiff: Test w/ local Parsoid instead of prod Parsoid [puppet] - 10https://gerrit.wikimedia.org/r/338950 (owner: 10Subramanya Sastry) [18:34:30] subbu: merging [18:35:26] PROBLEM - puppet last run on analytics1055 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:36:18] (03PS2) 10Muehlenhoff: role::analytics_cluster::hadoop::standby: Enable base::firewall in the role [puppet] - 10https://gerrit.wikimedia.org/r/341292 [18:36:53] subbu: puppet run on ruthenium completed [18:37:15] let me know if all looks good and if I need to restart anything [18:37:43] !log dcausse@tin Synchronized php-1.29.0-wmf.15/extensions/CirrusSearch/: Make incoming link counting compatible with 5.x (duration: 00m 53s) [18:37:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:31] volans, looking [18:47:39] thanks [18:49:08] (03PS2) 10Giuseppe Lavagetto: Initial import [switchdc] - 10https://gerrit.wikimedia.org/r/342492 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [18:49:10] (03PS1) 10Giuseppe Lavagetto: Add redis switching task, some more stages boilerplate [switchdc] - 10https://gerrit.wikimedia.org/r/342498 [18:50:20] volans, lgtm. [18:50:41] great! thanks! [18:50:43] later this week, when we update the parsoid checkout on there, will know if the tests kick off as expected. [18:50:58] sounds good [18:53:45] PROBLEM - Unmerged changes on repository puppet on labtestcontrol2001 is CRITICAL: There are 1160 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [18:56:09] (03PS1) 10Filippo Giunchedi: Revert "facilities: add codfw PDUs" [puppet] - 10https://gerrit.wikimedia.org/r/342499 [18:57:23] reverting, can't quite figure out exactly why puppet is failing there [18:57:42] (03CR) 10Filippo Giunchedi: [C: 032] Revert "facilities: add codfw PDUs" [puppet] - 10https://gerrit.wikimedia.org/r/342499 (owner: 10Filippo Giunchedi) [18:57:44] godog: have you checked with a puppet compiler the size of the catalog? [18:57:47] (03PS2) 10Filippo Giunchedi: Revert "facilities: add codfw PDUs" [puppet] - 10https://gerrit.wikimedia.org/r/342499 [18:58:38] volans: yeah on compiler02 the pcc run I did resulted in ~4MB catalog [19:00:07] o [19:00:08] ok [19:01:07] I couldn't find offhand where/if puppet reports catalog size [19:01:14] anyways, I'm off [19:01:35] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [19:04:25] RECOVERY - puppet last run on analytics1055 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [19:22:15] PROBLEM - puppet last run on mw1198 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:22:48] (03PS2) 10Krinkle: Disable WikimediaEvents extension on closed wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342148 (https://phabricator.wikimedia.org/T158721) [19:23:08] (03PS3) 10Krinkle: Disable WikimediaEvents extension on closed wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342148 (https://phabricator.wikimedia.org/T158721) [19:23:32] (03PS4) 10Krinkle: Disable WikimediaEvents extension on closed wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342148 (https://phabricator.wikimedia.org/T158721) [19:24:05] (03CR) 10Krinkle: Disable WikimediaEvents extension on closed wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342148 (https://phabricator.wikimedia.org/T158721) (owner: 10Krinkle) [19:37:55] 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: EQIAD: stat1003 replacement - https://phabricator.wikimedia.org/T159839#3096614 (10Nuria) [19:38:00] 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: EQIAD: stat1002 replacement - https://phabricator.wikimedia.org/T159838#3096615 (10Nuria) [19:51:15] RECOVERY - puppet last run on mw1198 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:53:37] PROBLEM - puppet last run on aqs1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:56:55] (03PS5) 10Nuria: navtiming: Make tests easier to extend [puppet] - 10https://gerrit.wikimedia.org/r/338044 (owner: 10Krinkle) [19:57:04] (03PS21) 10Nuria: Changes to perf consumer of event logging events [puppet] - 10https://gerrit.wikimedia.org/r/337158 (https://phabricator.wikimedia.org/T156760) [20:00:04] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170313T2000). [20:00:05] addshore: Respected human, time to deploy InterwikiSorting (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170313T2000). Please do the needful. [20:02:22] * Reedy kicks addshore [20:03:41] Yes! bah, [20:03:45] just got back from france [20:05:43] (03CR) 10Addshore: wmgUseInterwikiSorting true for group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341032 (https://phabricator.wikimedia.org/T150183) (owner: 10Addshore) [20:05:46] (03PS5) 10Addshore: wmgUseInterwikiSorting true for group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341032 (https://phabricator.wikimedia.org/T150183) [20:09:06] (03CR) 10Addshore: [C: 032] wmgUseInterwikiSorting true for group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341032 (https://phabricator.wikimedia.org/T150183) (owner: 10Addshore) [20:10:41] (03Merged) 10jenkins-bot: wmgUseInterwikiSorting true for group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341032 (https://phabricator.wikimedia.org/T150183) (owner: 10Addshore) [20:10:52] (03CR) 10jenkins-bot: wmgUseInterwikiSorting true for group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341032 (https://phabricator.wikimedia.org/T150183) (owner: 10Addshore) [20:11:34] (03PS22) 10Krinkle: webperf: Update event logging consumer for userAgent schema change [puppet] - 10https://gerrit.wikimedia.org/r/337158 (https://phabricator.wikimedia.org/T156760) (owner: 10Nuria) [20:13:54] Reedy: Could you check the db lists in that patch and see if they appear correct? [20:14:11] addshore: It looks vaguely right [20:14:20] But I don't know ori's dblist maths vodoo [20:14:32] hmmm, i scap pulled to mwdebug1002 but not appearing on special:version [20:15:02] silly question, did you enable the x-wikimedia-debug script addshore ? [20:15:04] https://noc.wikimedia.org/conf/highlight.php?file=flow_computed_labs.dblist [20:15:09] TabbyCat: yus :) [20:15:16] k, because I sometimes forget [20:15:20] :) [20:15:28] sorry for the noise [20:16:31] addshore: touch IS? [20:17:41] as in? the dblist? [20:20:26] IS being InitialiseSettings.php [20:20:32] ack [20:22:05] Dereckson: are you busy? [20:22:09] PROBLEM - puppet last run on ms-be1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:22:15] touched on tin and scap pulled on mwdebug1002 Reedy but still nothing [20:22:39] RECOVERY - puppet last run on aqs1004 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [20:24:52] reedy@tin:~$ mwscript eval.php mediawikiwiki [20:24:52] > var_dump( $wmgUseInterwikiSorting ); [20:24:52] bool(false) [20:25:27] definitely touched [20:25:30] addshore@mwdebug1002:/srv/mediawiki/wmf-config$ ls -ls |grep InitialiseSettings.php [20:25:30] 576 -rw-r--r-- 1 mwdeploy mwdeploy 587478 Mar 13 20:21 InitialiseSettings.php [20:25:41] I guess there is something wrong with the dblist magic then [20:26:03] I dunno if it does without any maths [20:26:15] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 06Services (watching), and 5 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#3096695 (10aaron) Assuming there are decent and simple libraries, having cache-aside APC with a... [20:26:18] I guess I should be able to add some pointless maths...? [20:26:30] Let's see [20:26:37] dblist changes usually need a commonsettings.php touch addshore and Reedy [20:26:48] try touching CommonSettings.php [20:27:03] Dunno what CS has to do with it [20:27:03] if ( @filemtime( $filename ) >= filemtime( "$wmfConfigDir/InitialiseSettings.php" ) ) { [20:27:04] it happened to me not so long ago [20:27:17] and you don't loose anything trying [20:27:26] A CS touch didn't fix it [20:27:29] Reedy: it could be cache? [20:27:37] i'll try using math ;) [20:29:33] I guess touch the whole wmf-config/ folder won't fix it? [20:29:45] no [20:29:52] Reedy: interesting, tried with %% group0.dblist - large.dblist and it also didnt seem to work [20:29:59] (also after an IS touch) [20:30:07] addshore: [20:30:08] OH [20:30:18] You need to add your dblist as a tag [20:30:24] Touch all the things, it won't help [20:30:37] in commonsettings? [20:30:42] Yes [20:30:46] foreach ( [ 'private', 'fishbowl', 'special', 'closed', 'flow', 'flaggedrevs', 'small', 'medium', [20:30:46] 'large', 'wikimania', 'wikidata', 'wikidataclient', 'visualeditor-nondefault', [20:30:46] 'commonsuploads', 'nonbetafeatures', 'group0', 'group1', 'group2', 'wikipedia', 'nonglobal', [20:30:46] 'wikitech', 'nonecho', 'mobilemainpagelegacy', 'compact-language-links', 'nowikidatadescriptiontaglines', [20:30:46] 'related-articles-footer-blacklisted-skins', [20:30:48] 'top6-wikipedia' [20:30:50] ] as $tag ) { [20:30:57] I knew it was CS fault [20:30:57] I seeeeee [20:31:08] TabbyCat: Can touch it all day [20:31:09] * TabbyCat wants a cookie [20:31:40] Reedy: if you don't add the dblist there it won't work, ofc; but after that you need to :) [20:31:48] No you don't [20:31:49] we have too many db-lists :/ [20:31:54] The touch does nothing at that point [20:31:57] You've already modified the file [20:32:06] well, yes in this case [20:32:16] bd808: indeed [20:32:29] bd808: Indeed. I can only presume addshore is adding another, is because he's gonna do more dblist math before they enable everywhere [20:32:32] bd808: i think some consolidation is in order indeed [20:32:42] Rather than just reusing the groups that already exist [20:32:48] Reedy: indeed [20:32:53] I propose we populate deleted.dblist :P [20:32:59] * Reedy gets in his car to go and slap addshore [20:33:20] Reedy: bd808 do dblists cause overhead? [20:33:33] Yes, same as anything [20:33:39] I don't think it's anything significant [20:33:44] okay :) [20:34:03] Five thousand dblists later [20:34:50] (03PS1) 10Addshore: Add interwikisorting to CommonSettings $wikiTags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342504 (https://phabricator.wikimedia.org/T150183) [20:34:58] (03CR) 10Addshore: [C: 032] Add interwikisorting to CommonSettings $wikiTags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342504 (https://phabricator.wikimedia.org/T150183) (owner: 10Addshore) [20:35:03] there's nothing inherently bad about db-lists, it just means that there a many ways that the wiki farm wikis are special snowflakes [20:35:30] Hopefully, addshore's in this case, should be a temporary one for the rollout [20:35:30] bd808: yup :/ [20:35:33] Which I can understand [20:35:36] which in turn means to me that there are many ways that wikiX can blow up in a new and different way [20:35:37] Reedy: indeed :) [20:35:43] If it's gone in a month, we can't complain too much [20:35:52] And It should be :) [20:36:08] * bd808 reserves the right to complain about any thing at any time for any reason [20:36:20] (03Merged) 10jenkins-bot: Add interwikisorting to CommonSettings $wikiTags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342504 (https://phabricator.wikimedia.org/T150183) (owner: 10Addshore) [20:36:47] What we need [20:36:50] ONWIKI CONFIG OF ALL THE THINGS [20:37:02] (03CR) 10jenkins-bot: Add interwikisorting to CommonSettings $wikiTags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342504 (https://phabricator.wikimedia.org/T150183) (owner: 10Addshore) [20:37:28] Reedy: thanks! showing up on testwiki now, from mwdebug1002.! [20:37:41] Can I invoice WMDE now? [20:37:47] nope ;) [20:38:22] Reedy: that'd would be between good and dangerous [20:38:32] (wiki config all the things) [20:39:20] dinner bell [20:39:50] !log addshore@tin Synchronized dblists/interwikisorting.dblist: T150183 Enable InterwikiSorting on group0 [[gerrit:341032|#1]] [[gerrit:342504|#2]] PT 1/4 (duration: 00m 51s) [20:39:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:59] T150183: Deploy InterwikiSorting extension to production - https://phabricator.wikimedia.org/T150183 [20:40:23] jouncebot: now [20:40:23] For the next 0 hour(s) and 19 minute(s): Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170313T2000) [20:40:23] For the next 0 hour(s) and 19 minute(s): InterwikiSorting (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170313T2000) [20:41:07] !log addshore@tin Synchronized docroot/noc/conf/interwikisorting.dblist: T150183 Enable InterwikiSorting on group0 [[gerrit:341032|#1]] [[gerrit:342504|#2]] PT 2/4 NOOP (duration: 00m 42s) [20:41:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:23] (03PS1) 10Hashar: mariadb: clear build related files [puppet] - 10https://gerrit.wikimedia.org/r/342506 [20:42:41] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: T150183 Enable InterwikiSorting on group0 [[gerrit:341032|#1]] [[gerrit:342504|#2]] PT 3/4 (duration: 00m 41s) [20:42:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:35] !log addshore@tin Synchronized wmf-config/CommonSettings.php: T150183 Enable InterwikiSorting on group0 [[gerrit:341032|#1]] [[gerrit:342504|#2]] PT 4/4 (duration: 00m 40s) [20:43:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:47] (03PS2) 10Gehel: elasticsearch: align static and persistent config [puppet] - 10https://gerrit.wikimedia.org/r/342490 [20:44:30] lovely, all done! [20:44:31] (03CR) 10Gehel: elasticsearch: align static and persistent config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/342490 (owner: 10Gehel) [20:45:24] !log InterwikiSorting deploy (to group0) done [20:45:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:09] RECOVERY - puppet last run on ms-be1007 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [21:00:04] dapatrick, bawolff, and Reedy: Respected human, time to deploy Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170313T2100). Please do the needful. [21:06:57] 06Operations, 10Gerrit, 10Mail, 06Release-Engineering-Team: Gerrit emails are showing up as being sent late via Yahoo servers - https://phabricator.wikimedia.org/T159960#3096772 (10Paladox) @valhallasw Hi, yahoo tells me that they doint throttle emails but they do have an internal thing that can block spec... [21:09:59] PROBLEM - puppet last run on db1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:11:45] (03CR) 10Reedy: Removal of "editusercssjs" in MW-CORE (Mediawiki 1.16 Depricated) Must merge with Ic9851d53affe0f4ece7a79f541ec5cb39133b109 (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342456 (owner: 10Zppix) [21:15:16] (03CR) 10Reedy: Removal of "editusercssjs" in MW-CORE (Mediawiki 1.16 Depricated) Must merge with Ic9851d53affe0f4ece7a79f541ec5cb39133b109 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342456 (owner: 10Zppix) [21:19:09] PROBLEM - puppet last run on restbase1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:21:03] 06Operations, 10Mail: Yahoo is blocking mail from wikimedia - https://phabricator.wikimedia.org/T160381#3096795 (10Paladox) [21:21:32] 06Operations, 10Mail: Yahoo is blocking mail from wikimedia - https://phabricator.wikimedia.org/T160381#3096808 (10Paladox) p:05Triage>03High If yahoo is blocking emails then I'm setting high priority. [21:21:37] bawolff done https://phabricator.wikimedia.org/T160381 [21:24:25] (03PS5) 10Paladox: Removal of "editusercssjs" in MW-CORE (Mediawiki 1.16 Depricated) Must merge with Ic9851d53affe0f4ece7a79f541ec5cb39133b109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342456 (owner: 10Zppix) [21:25:25] (03CR) 10jerkins-bot: [V: 04-1] Removal of "editusercssjs" in MW-CORE (Mediawiki 1.16 Depricated) Must merge with Ic9851d53affe0f4ece7a79f541ec5cb39133b109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342456 (owner: 10Zppix) [21:26:01] (03PS6) 10Paladox: Removal of "editusercssjs" in MW-CORE (Mediawiki 1.16 Depricated) Must merge with Ic9851d53affe0f4ece7a79f541ec5cb39133b109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342456 (owner: 10Zppix) [21:27:14] (03CR) 10jerkins-bot: [V: 04-1] Removal of "editusercssjs" in MW-CORE (Mediawiki 1.16 Depricated) Must merge with Ic9851d53affe0f4ece7a79f541ec5cb39133b109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342456 (owner: 10Zppix) [21:27:39] (03PS7) 10Zppix: Removal of "editusercssjs" in MW-CORE (Mediawiki 1.16 Depricated) Must merge with Ic9851d53affe0f4ece7a79f541ec5cb39133b109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342456 [21:28:44] (03PS8) 10Paladox: Removal of "editusercssjs" in MW-CORE (Mediawiki 1.16 Depricated) Must merge with Ic9851d53affe0f4ece7a79f541ec5cb39133b109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342456 (owner: 10Zppix) [21:30:09] (03PS9) 10Paladox: Removal of "editusercssjs" in MW-CORE (Mediawiki 1.16 Depricated) Must merge with Ic9851d53affe0f4ece7a79f541ec5cb39133b109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342456 (owner: 10Zppix) [21:30:56] (03PS10) 10Paladox: Removal of "editusercssjs" in MW-CORE (Mediawiki 1.16 Depricated) Must merge with Ic9851d53affe0f4ece7a79f541ec5cb39133b109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342456 (owner: 10Zppix) [21:33:28] 06Operations, 10Gerrit, 10Mail, 06Release-Engineering-Team: Gerrit emails are showing up as being sent late via Yahoo servers - https://phabricator.wikimedia.org/T159960#3096821 (10Paladox) I have filled T160381 as wikimedia will need to fill out a form to get yahoo investigating it further. [21:38:59] RECOVERY - puppet last run on db1036 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [21:40:56] !log Deployed fix for T160266 [21:41:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:52] (03CR) 10Reedy: [C: 04-1] Removal of "editusercssjs" in MW-CORE (Mediawiki 1.16 Depricated) Must merge with Ic9851d53affe0f4ece7a79f541ec5cb39133b109 (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342456 (owner: 10Zppix) [21:41:55] 06Operations, 10Mail: Yahoo is blocking mail from wikimedia - https://phabricator.wikimedia.org/T160381#3096826 (10Aklapper) The task summary says "Yahoo is blocking mail from wikimedia". The task description says "It seems yahoo is blocking mail from wikimedia." That's a contradiction - either you know for su... [21:42:21] 06Operations, 06Discovery, 10Elasticsearch, 10Wikimedia-Logstash, 06Discovery-Search (Current work): Update logstash on wikimedia to 5.x - https://phabricator.wikimedia.org/T154473#3096827 (10EBernhardson) Some initial thoughts: * It would be nice to not upgrade logstash, elasticsearch and kibana all in... [21:43:05] 06Operations, 06Discovery, 10Elasticsearch, 10Wikimedia-Logstash, 06Discovery-Search (Current work): Update logstash on wikimedia to 5.x - https://phabricator.wikimedia.org/T154473#3096828 (10EBernhardson) a:03EBernhardson [21:47:39] PROBLEM - puppet last run on mw1294 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:48:09] RECOVERY - puppet last run on restbase1009 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [21:52:08] 06Operations, 10Mail: Yahoo is blocking mail from wikimedia - https://phabricator.wikimedia.org/T160381#3096832 (10Paladox) @bawolff told me to let wikimedia fill out the form do to it asking company related questions like address and phone number. Also I'm only going on by what yahoo told me on https://forum... [21:52:09] (03PS11) 10Zppix: Removal of "editusercssjs" in MW-CORE (Mediawiki 1.16 Depricated) Must merge with Ic9851d53affe0f4ece7a79f541ec5cb39133b109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342456 [21:52:13] (03CR) 10Zppix: Removal of "editusercssjs" in MW-CORE (Mediawiki 1.16 Depricated) Must merge with Ic9851d53affe0f4ece7a79f541ec5cb39133b109 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342456 (owner: 10Zppix) [21:53:50] (03PS12) 10Zppix: Removal of "editusercssjs" in MW-CORE (Mediawiki 1.16 Depricated) Should be merged with Ic9851d53affe0f4ece7a79f541ec5cb39133b109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342456 [21:57:33] 06Operations, 10Mail: Yahoo is blocking mail from wikimedia - https://phabricator.wikimedia.org/T160381#3096852 (10Bawolff) To clarify, paladox was asking for the full address of Wikimedia Foundation on irc, which made it sound like he was filling out some sort of form that is expected to be filled out by an o... [21:59:40] 06Operations, 06Discovery, 10Elasticsearch, 10Wikimedia-Logstash, 06Discovery-Search (Current work): Update logstash on wikimedia to 5.x - https://phabricator.wikimedia.org/T154473#3096856 (10EBernhardson) Actually i was mis-reading the compatability matrix. For elasticsearch 2.3.x logstash is reported a... [22:15:39] RECOVERY - puppet last run on mw1294 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [22:23:19] PROBLEM - puppet last run on labvirt1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:28:22] 06Operations, 07Puppet, 06Labs, 10Traffic, 07Technical-Debt: Convert all of our site.pp/roles to the role/profile paradigm - https://phabricator.wikimedia.org/T159412#3096884 (10Ciencia_Al_Poder) [22:29:49] !log bawolff@tin Synchronized php-1.29.0-wmf.15/extensions/SemanticForms/includes/SF_ValuesUtils.php: Backport bb42c6f401b9 (duration: 00m 48s) [22:29:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:19] RECOVERY - puppet last run on labvirt1005 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [22:51:39] PROBLEM - puppet last run on mw1274 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170313T2300). Please do the needful. [23:05:00] ^ nothing to deploy [23:12:00] RainbowSprinkles should your irc name be updated from ostriches to RainbowSprinkles? ^^ [23:18:37] (03PS3) 10Volans: Initial import [switchdc] - 10https://gerrit.wikimedia.org/r/342492 (https://phabricator.wikimedia.org/T160178) [23:19:39] RECOVERY - puppet last run on mw1274 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [23:20:24] (03CR) 10Volans: "@_joe_: I've added the possibility to run a single task or all tasks in a stage. It should not require any change on your code." [switchdc] - 10https://gerrit.wikimedia.org/r/342492 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [23:21:27] (03PS1) 10Jcrespo: mariadb: Repool db1054 with low weight after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342549 [23:29:44] (03CR) 10Jcrespo: [C: 032] mariadb: Repool db1054 with low weight after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342549 (owner: 10Jcrespo) [23:30:04] paladox: no point, he'll just change his IRC nick again next week [23:30:13] oh lol [23:31:15] (03Merged) 10jenkins-bot: mariadb: Repool db1054 with low weight after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342549 (owner: 10Jcrespo) [23:31:25] (03CR) 10jenkins-bot: mariadb: Repool db1054 with low weight after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342549 (owner: 10Jcrespo) [23:31:41] (03PS4) 10Volans: Initial import [switchdc] - 10https://gerrit.wikimedia.org/r/342492 (https://phabricator.wikimedia.org/T160178) [23:32:25] (03CR) 10Volans: "@_joe_: I've also moved the example task in doc/examples, no point to commit it in the real stages directory." [switchdc] - 10https://gerrit.wikimedia.org/r/342492 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [23:32:28] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1054 after upgrade with low weight (duration: 00m 41s) [23:32:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log