[01:29:48] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 83745.669558 Seconds
[01:29:48] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 83747.662735 Seconds
[01:29:58] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 83753.859079 Seconds
[01:30:28] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 83708.853332 Seconds
[01:30:28] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 83708.900511 Seconds
[01:30:28] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 83713.735369 Seconds
[01:39:28] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds
[01:42:28] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 84433.751575 Seconds
[01:51:48] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 0.0 Seconds
[01:51:48] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 38.904991 Seconds
[01:51:58] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 44.959266 Seconds
[01:52:28] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds
[01:52:28] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds
[01:52:28] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 4.929204 Seconds
[02:01:48] <icinga-wm>	 PROBLEM - puppet last run on mw1183 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[02:23:54] <logmsgbot>	 !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.18) (duration: 08m 17s)
[02:24:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:29:48] <icinga-wm>	 RECOVERY - puppet last run on mw1183 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures
[02:41:58] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 1807.696802 Seconds
[02:42:48] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 1858.927911 Seconds
[02:42:58] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 0.0 Seconds
[02:43:26] <logmsgbot>	 !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.19) (duration: 07m 32s)
[02:43:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:45:48] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 0.0 Seconds
[02:45:58] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 2047.673058 Seconds
[02:47:28] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 2010.687699 Seconds
[02:48:48] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 2218.95296 Seconds
[02:48:48] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 2221.06471 Seconds
[02:49:07] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Apr 10 02:49:06 UTC 2017 (duration 5m 40s)
[02:49:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:49:58] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 0.0 Seconds
[02:50:48] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 0.0 Seconds
[02:51:29] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds
[02:51:48] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 0.0 Seconds
[03:17:28] <icinga-wm>	 PROBLEM - puppet last run on labcontrol1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[03:24:59] <wikibugs>	 (03PS4) 10Mobrovac: RESTBase: Migrate to Scap3 deployment [puppet] - 10https://gerrit.wikimedia.org/r/346248 (https://phabricator.wikimedia.org/T116335)
[03:37:06] <wikibugs>	 (03CR) 10Chad: [C: 04-1] "I'm not a fan of this approach." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/347189 (owner: 10Paladox)
[03:40:24] <wikibugs__>	 (03CR) 10Chad: [C: 04-1] "Identical to my comments on I22f902372db79abec006b01f5590a087b67641d7" [puppet] - 10https://gerrit.wikimedia.org/r/347188 (owner: 10Paladox)
[03:46:28] <icinga-wm>	 RECOVERY - puppet last run on labcontrol1001 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures
[03:56:28] <icinga-wm>	 PROBLEM - puppet last run on wtp1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[04:08:18] <icinga-wm>	 PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=1758.20 Read Requests/Sec=5308.10 Write Requests/Sec=293.40 KBytes Read/Sec=22951.20 KBytes_Written/Sec=4159.20
[04:18:18] <icinga-wm>	 RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=65.00 Read Requests/Sec=5.30 Write Requests/Sec=10.60 KBytes Read/Sec=86.40 KBytes_Written/Sec=302.00
[04:23:38] <icinga-wm>	 PROBLEM - puppet last run on aqs1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[04:24:48] <icinga-wm>	 RECOVERY - puppet last run on wtp1016 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures
[04:33:48] <icinga-wm>	 PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 639 600 - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3075262 keys, up 17 days 12 hours - replication_delay is 639
[04:51:38] <icinga-wm>	 RECOVERY - puppet last run on aqs1005 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures
[04:52:48] <icinga-wm>	 RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3059919 keys, up 17 days 12 hours - replication_delay is 4
[05:03:08] <icinga-wm>	 PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 617 600 - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3059919 keys, up 17 days 12 hours - replication_delay is 617
[05:17:08] <icinga-wm>	 RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3060018 keys, up 17 days 13 hours - replication_delay is 0
[05:57:21] <wikibugs__>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347316 (https://phabricator.wikimedia.org/T160390)
[05:58:07] <wikibugs__>	 (03PS2) 10Marostegui: db-codfw.php: Add tempdb2001 to x1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346948 (https://phabricator.wikimedia.org/T162290)
[05:58:39] <wikibugs__>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347316 (https://phabricator.wikimedia.org/T160390) (owner: 10Marostegui)
[05:59:57] <wikibugs__>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347316 (https://phabricator.wikimedia.org/T160390) (owner: 10Marostegui)
[06:00:08] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347316 (https://phabricator.wikimedia.org/T160390) (owner: 10Marostegui)
[06:00:58] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-codfw.php: Add tempdb2001 to x1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346948 (https://phabricator.wikimedia.org/T162290) (owner: 10Marostegui)
[06:01:05] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1034 - T160390 (duration: 00m 39s)
[06:01:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:01:12] <stashbot>	 T160390: Unify revision table on s7 - https://phabricator.wikimedia.org/T160390
[06:01:21] <wikibugs__>	 (03PS7) 10Marostegui: mariadb: Disable trx, sync_binlog for tempdb2001 [puppet] - 10https://gerrit.wikimedia.org/r/346773 (https://phabricator.wikimedia.org/T162290)
[06:02:15] <wikibugs>	 (03Merged) 10jenkins-bot: db-codfw.php: Add tempdb2001 to x1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346948 (https://phabricator.wikimedia.org/T162290) (owner: 10Marostegui)
[06:02:24] <wikibugs>	 (03CR) 10jenkins-bot: db-codfw.php: Add tempdb2001 to x1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346948 (https://phabricator.wikimedia.org/T162290) (owner: 10Marostegui)
[06:03:30] <wikibugs>	 (03CR) 10Marostegui: [C: 032] mariadb: Disable trx, sync_binlog for tempdb2001 [puppet] - 10https://gerrit.wikimedia.org/r/346773 (https://phabricator.wikimedia.org/T162290) (owner: 10Marostegui)
[06:03:48] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-codfw.php: Add tempdb2001 to x1 as a slave - T162290 (duration: 00m 38s)
[06:03:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:03:55] <stashbot>	 T162290: setup  tempdb2001(WMF6407) - https://phabricator.wikimedia.org/T162290
[06:05:39] <wikibugs>	 06Operations, 10ops-codfw, 10DBA, 10hardware-requests: codfw: (1) spare pool system for temp allocation as database failover - https://phabricator.wikimedia.org/T161712#3167374 (10Marostegui)
[06:05:41] <wikibugs__>	 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: setup  tempdb2001(WMF6407) - https://phabricator.wikimedia.org/T162290#3167372 (10Marostegui) 05Open>03Resolved Hi,   I have merged the patch to add it as a slave in codfw and the puppet temporary change to get it with sync_binlog=0 and innodb_flush...
[06:07:19] <marostegui>	 !log Deploy schema change db1034 (s7) - T160390
[06:07:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:07:27] <stashbot>	 T160390: Unify revision table on s7 - https://phabricator.wikimedia.org/T160390
[06:08:43] <wikibugs__>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1034" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347317
[06:13:33] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1034" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347317 (owner: 10Marostegui)
[06:14:38] <wikibugs__>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1034" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347317 (owner: 10Marostegui)
[06:14:51] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1034" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347317 (owner: 10Marostegui)
[06:15:28] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1034 - T160390 (duration: 00m 38s)
[06:15:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:15:35] <stashbot>	 T160390: Unify revision table on s7 - https://phabricator.wikimedia.org/T160390
[06:18:14] <wikibugs__>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1028 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347318 (https://phabricator.wikimedia.org/T160390)
[06:22:37] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1028 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347318 (https://phabricator.wikimedia.org/T160390) (owner: 10Marostegui)
[06:23:42] <wikibugs__>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1028 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347318 (https://phabricator.wikimedia.org/T160390) (owner: 10Marostegui)
[06:23:55] <wikibugs__>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1028 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347318 (https://phabricator.wikimedia.org/T160390) (owner: 10Marostegui)
[06:24:36] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1028 - T160390 (duration: 00m 39s)
[06:24:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:24:43] <stashbot>	 T160390: Unify revision table on s7 - https://phabricator.wikimedia.org/T160390
[06:24:48] <marostegui>	 !log Deploy schema change db1028 (s7) - T160390
[06:24:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:26:27] <marostegui>	 !log Deploy schema change labsdb1001 (s7) - T160390
[06:26:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:26:48] <icinga-wm>	 PROBLEM - puppet last run on labvirt1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[06:28:38] <icinga-wm>	 PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[06:44:10] <wikibugs>	 (03Abandoned) 10Giuseppe Lavagetto: Initial stub of a dry_run functionality [switchdc] - 10https://gerrit.wikimedia.org/r/346953 (owner: 10Giuseppe Lavagetto)
[06:46:34] <wikibugs__>	 (03CR) 10Giuseppe Lavagetto: [C: 04-2] "Use profile::gerrit::server in labs instead of the role." [puppet] - 10https://gerrit.wikimedia.org/r/347189 (owner: 10Paladox)
[06:47:00] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-2] "If so, we should create two separate roles, one including the backup profile and one that doesn't, or define the correct hiera data." [puppet] - 10https://gerrit.wikimedia.org/r/347188 (owner: 10Paladox)
[06:49:48] <icinga-wm>	 PROBLEM - puppet last run on mw1238 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[06:51:04] <wikibugs__>	 (03PS1) 10Marostegui: x1.hosts: Swap positions db2033 and tempdb2001 [software] - 10https://gerrit.wikimedia.org/r/347319
[06:54:03] <wikibugs>	 (03CR) 10Marostegui: [C: 032] x1.hosts: Swap positions db2033 and tempdb2001 [software] - 10https://gerrit.wikimedia.org/r/347319 (owner: 10Marostegui)
[06:54:48] <icinga-wm>	 RECOVERY - puppet last run on labvirt1004 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures
[06:55:03] <wikibugs>	 (03Merged) 10jenkins-bot: x1.hosts: Swap positions db2033 and tempdb2001 [software] - 10https://gerrit.wikimedia.org/r/347319 (owner: 10Marostegui)
[06:57:38] <icinga-wm>	 RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures
[06:59:46] <wikibugs>	 (03PS1) 10Marostegui: x1.hosts: Remove dbstore2002 from x1 [software] - 10https://gerrit.wikimedia.org/r/347321
[07:01:03] <wikibugs>	 (03CR) 10Marostegui: [C: 032] x1.hosts: Remove dbstore2002 from x1 [software] - 10https://gerrit.wikimedia.org/r/347321 (owner: 10Marostegui)
[07:01:51] <wikibugs__>	 (03Merged) 10jenkins-bot: x1.hosts: Remove dbstore2002 from x1 [software] - 10https://gerrit.wikimedia.org/r/347321 (owner: 10Marostegui)
[07:02:05] <moritzm>	 !log installing pam updates from jessie point update
[07:02:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:07:37] <wikibugs__>	 (03CR) 10Giuseppe Lavagetto: [C: 031] Remove role::backup::config [puppet] - 10https://gerrit.wikimedia.org/r/347184 (owner: 10Alexandros Kosiaris)
[07:10:13] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Good job, a small detail missing" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/346926 (owner: 10Dzahn)
[07:10:59] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 031] mediawiki::maintenance: convert to profile/role (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/342777 (owner: 10Dzahn)
[07:14:48] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Very good work but let's try to avoid if $realm conditionals" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/344728 (owner: 10Dzahn)
[07:18:48] <icinga-wm>	 RECOVERY - puppet last run on mw1238 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures
[07:20:49] <wikibugs__>	 (03PS11) 10Matthias Mullie: Add 3d2png deploy repo to image scalers [puppet] - 10https://gerrit.wikimedia.org/r/345377 (https://phabricator.wikimedia.org/T160185) (owner: 10MarkTraceur)
[07:28:48] <icinga-wm>	 PROBLEM - Check HHVM threads for leakage on mw1260 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers
[07:37:38] <icinga-wm>	 PROBLEM - puppet last run on elastic1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[07:37:50] <wikibugs__>	 (03CR) 10Giuseppe Lavagetto: [C: 04-2] "I didn't finish reading the patch, but what I saw this far is enough to make me not like it:" (0310 comments) [switchdc] - 10https://gerrit.wikimedia.org/r/347180 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans)
[07:45:48] <icinga-wm>	 PROBLEM - Check HHVM threads for leakage on mw1260 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers
[07:46:38] <icinga-wm>	 PROBLEM - puppet last run on mw1294 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[07:48:06] <wikibugs>	 06Operations, 06Office-IT, 07LDAP: Make disabled accounts visible in the corp mirror LDAP replica - https://phabricator.wikimedia.org/T160158#3167544 (10MoritzMuehlenhoff) > I do not think this is the issue because GCDS does not generate LDAP data and it does not change our LDAP data. The GCDS just queries o...
[07:48:18] <_joe_>	 !log testing a dry-run of the switchdc software on sarin
[07:48:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:50:29] <wikibugs>	 (03PS4) 10Hashar: wmflib: switch to puppetlabs_spec_helper/rake_tasks [puppet] - 10https://gerrit.wikimedia.org/r/332475
[07:50:31] <wikibugs>	 (03PS12) 10Hashar: wmflib: basic spec for os_version() [puppet] - 10https://gerrit.wikimedia.org/r/178810
[07:50:48] <icinga-wm>	 PROBLEM - Check HHVM threads for leakage on mw1259 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers
[07:52:42] <wikibugs>	 (03CR) 10Hashar: [C: 031] "Thank you to have integrated my older change to raise an exception whenever lsb variables are missing. That saves a lot of headhaches :)" [puppet] - 10https://gerrit.wikimedia.org/r/346673 (owner: 10Faidon Liambotis)
[07:53:48] <icinga-wm>	 PROBLEM - Check HHVM threads for leakage on mw1168 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers
[08:04:37] <wikibugs>	 (03PS1) 10Urbanecm: Create editprotected right on ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347330 (https://phabricator.wikimedia.org/T162577)
[08:06:38] <icinga-wm>	 RECOVERY - puppet last run on elastic1022 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures
[08:08:07] <wikibugs__>	 (03PS1) 10Giuseppe Lavagetto: Fix redis directory lookup [switchdc] - 10https://gerrit.wikimedia.org/r/347331
[08:08:48] <icinga-wm>	 PROBLEM - Check HHVM threads for leakage on mw1168 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers
[08:10:02] <wikibugs__>	 (03CR) 10Giuseppe Lavagetto: [C: 032] Fix redis directory lookup [switchdc] - 10https://gerrit.wikimedia.org/r/347331 (owner: 10Giuseppe Lavagetto)
[08:16:38] <icinga-wm>	 RECOVERY - puppet last run on mw1294 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures
[08:17:20] <wikibugs>	 (03CR) 10Volans: "General reply. With the current very incoherent logging it's impossible to run in dry-run mode and be able to print a human-comprehensible" (032 comments) [switchdc] - 10https://gerrit.wikimedia.org/r/347180 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans)
[08:21:58] <icinga-wm>	 PROBLEM - puppet last run on mw1295 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:27:38] <icinga-wm>	 PROBLEM - puppet last run on lvs1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:37:48] <icinga-wm>	 PROBLEM - Check HHVM threads for leakage on mw1168 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers
[08:39:36] <elukey>	 !log manual failover of Hadoop master daemons from analyitics1001 to analytics1002 (T160333)
[08:39:38] <icinga-wm>	 PROBLEM - puppet last run on thumbor1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:39:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:39:44] <stashbot>	 T160333: Reimage the Hadoop Cluster to Debian Jessie - https://phabricator.wikimedia.org/T160333
[08:43:19] <gehel>	 !log rolling restart of maps-test cluster
[08:43:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:48:40] <gehel>	 !log reimage elastic2020 - T149006
[08:48:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:48:47] <stashbot>	 T149006: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006
[08:49:39] <wikibugs>	 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3167656 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic2020.codfw.wmnet'...
[08:51:05] <godog>	 !log swift codfw-prod: bump ms-be2028 ms-be2039 object weight to 4000 - T158337
[08:51:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:51:12] <stashbot>	 T158337: codfw: ms-be2028-ms-be2039 rack/setup - https://phabricator.wikimedia.org/T158337
[08:51:58] <icinga-wm>	 RECOVERY - puppet last run on mw1295 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures
[08:52:34] <wikibugs__>	 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#2739503 (10Marostegui) This looks similar to: https://phabricator.wikimedia.org/T149553 Which took us quite some time to debug, but in the end...
[08:52:49] <logmsgbot>	 !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=wdqs
[08:52:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:54:38] <icinga-wm>	 PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: ocg_job_status 878389 msg (=800000 warning): ocg_render_job_queue 3027 msg (=3000 critical)
[08:54:48] <icinga-wm>	 PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: ocg_job_status 878440 msg (=800000 warning): ocg_render_job_queue 3037 msg (=3000 critical)
[08:54:48] <icinga-wm>	 PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: ocg_job_status 878442 msg (=800000 warning): ocg_render_job_queue 3034 msg (=3000 critical)
[08:55:38] <icinga-wm>	 RECOVERY - puppet last run on lvs1003 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures
[08:59:38] <icinga-wm>	 PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: ocg_job_status 879692 msg (=800000 warning): ocg_render_job_queue 3017 msg (=3000 critical)
[08:59:48] <icinga-wm>	 PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: ocg_job_status 879746 msg (=800000 warning): ocg_render_job_queue 3033 msg (=3000 critical)
[08:59:48] <icinga-wm>	 PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: ocg_job_status 879754 msg (=800000 warning): ocg_render_job_queue 3040 msg (=3000 critical)
[09:02:37] <elukey>	 tons of jobs submitted? https://grafana.wikimedia.org/dashboard/db/ocg?orgId=1
[09:02:49] <elukey>	 not even sure what is the status of OCG nowadays
[09:06:21] <_joe_>	 ahahahh
[09:06:28] <_joe_>	 elukey: it's active!
[09:06:43] <_joe_>	 elukey: also, we're the only ones who should care, apparently
[09:07:17] <_joe_>	 elukey: tbh, I think something is wrong
[09:07:38] <icinga-wm>	 PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: ocg_job_status 881849 msg (=800000 warning): ocg_render_job_queue 3015 msg (=3000 critical)
[09:07:38] <icinga-wm>	 RECOVERY - puppet last run on thumbor1001 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures
[09:07:48] <icinga-wm>	 PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: ocg_job_status 881892 msg (=800000 warning): ocg_render_job_queue 3017 msg (=3000 critical)
[09:07:48] <icinga-wm>	 PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: ocg_job_status 881896 msg (=800000 warning): ocg_render_job_queue 3018 msg (=3000 critical)
[09:08:05] <_joe_>	 ok, 2k jobs/minute seems serious enough
[09:12:18] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 49, down: 5, dormant: 0, excluded: 0, unused: 0BRxe-0/1/0: down - Transit: TeliaEU (IC-316335, donated) {#A0010239} [10Gbps]BRxe-0/1/3: down - Core: cr2-eqiad:xe-4/1/3 (Level3, BDFS2448, 84ms) {#A0010621} [10Gbps wave]BRxe-0/1/1: down - Peering: AMS-IX (EvoSwitch SMF.2-9/ab) {#SMF3836} [10Gbps DF]BRae2: down - AMS-IXBRxe-0/1/2: d
[09:13:08] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/1/3: down - Core: cr2-esams:xe-0/1/3 (Level3, BDFS2448, 84ms) {#2013} [10Gbps wave]BR
[09:13:41] <volans>	 XioNoX: is this you? ^^^
[09:14:09] <elukey>	 volans: it might be maintenance
[09:14:16] <elukey>	 only one port down from L3
[09:14:46] <volans>	 elukey: I know it might, but better to SAL it if it is...  ;) 
[09:15:22] <XioNoX>	 elukey, volans, seems like all the 10G interface on that device are down. The DC haven't confirmed they started the work, but they might be
[09:15:32] <elukey>	 okok :)
[09:15:38] <volans>	 great! :/
[09:15:44] <XioNoX>	 yeah, all are down, so I think they are unplugging the links to replace the card
[09:15:48] <icinga-wm>	 PROBLEM - Host cr2-esams is DOWN: CRITICAL - Network Unreachable (91.198.174.244)
[09:15:56] <volans>	 feel free to SAL the maintenance then XioNoX  :)
[09:16:10] <volans>	 (here with !log ... T12345)
[09:16:10] <stashbot>	 T12345: Create "annotation" namespace on Hebrew Wikisource - https://phabricator.wikimedia.org/T12345
[09:16:28] <gehel>	 !log rolling restart of maps2* cluster
[09:16:30] <volans>	 lol stashbot, 12345 was just an example :D
[09:16:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:17:09] <XioNoX>	 !log remote hands work started to replace the FPC on cr2-esams T162239
[09:17:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:17:16] <stashbot>	 T162239: cr2-esams FPC 0 is dead - https://phabricator.wikimedia.org/T162239
[09:17:20] <wikibugs__>	 06Operations, 10ops-codfw, 06Performance-Team: Plug in ex-graphite2001 SSDs to recover coal data - https://phabricator.wikimedia.org/T161538#3167703 (10fgiunchedi) a:05fgiunchedi>03Papaul thanks @robh! I've archived and copied `/var/lib/coal` to `/mnt/sde` and umounted the usb drive. @Papaul you  can plu...
[09:18:13] <logmsgbot>	 !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps2001.codfw.wmnet
[09:18:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:21:08] <icinga-wm>	 PROBLEM - Host cr2-esams IPv6 is DOWN: CRITICAL - Destination Unreachable (2620:0:862:ffff::3)
[09:22:15] <volans>	 expected spike of 503s when the router went off: https://grafana.wikimedia.org/dashboard/file/varnish-http-errors.json?refresh=5m&orgId=1
[09:24:48] <icinga-wm>	 RECOVERY - Check HHVM threads for leakage on mw1260 is OK: OK
[09:25:06] <wikibugs>	 06Operations, 07HHVM, 13Patch-For-Review, 07Upstream: Build / migrate to HHVM 3.18 - https://phabricator.wikimedia.org/T158176#3167727 (10MoritzMuehlenhoff)
[09:27:59] <wikibugs__>	 06Operations, 07HHVM, 07Upstream: HHVM segfault in memory cleanup - https://phabricator.wikimedia.org/T162586#3167730 (10MoritzMuehlenhoff)
[09:29:36] <icinga-wm>	 PROBLEM - puppet last run on db1049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:29:47] <wikibugs>	 06Operations, 10Traffic: Select location for Asia Cache DC - https://phabricator.wikimedia.org/T156029#3167754 (10Aklapper) Feel free to bring up any further discussion topics on the talk page of https://meta.wikimedia.org/wiki/Sustainability_Initiative which is the centralized place.
[09:29:55] <logmsgbot>	 !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps2001.codfw.wmnet
[09:30:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:30:16] <icinga-wm>	 PROBLEM - Check systemd state on elastic2020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[09:31:26] <icinga-wm>	 PROBLEM - Elasticsearch HTTPS on elastic2020 is CRITICAL: SSL CRITICAL - failed to verify search.svc.codfw.wmnet against elastic2020.codfw.wmnet
[09:32:08] <gehel>	 elastic2020 is me... reimaged for testing, silencing... - T149006
[09:32:09] <stashbot>	 T149006: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006
[09:33:18] <logmsgbot>	 !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps2002.codfw.wmnet
[09:33:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:39:26] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0
[09:39:36] <icinga-wm>	 RECOVERY - Host cr2-esams is UP: PING OK - Packet loss = 0%, RTA = 84.23 ms
[09:40:16] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 59, down: 0, dormant: 0, excluded: 0, unused: 0
[09:41:36] <elukey>	 welcome back cr2-esams :)
[09:41:46] <icinga-wm>	 RECOVERY - Host cr2-esams IPv6 is UP: PING OK - Packet loss = 0%, RTA = 85.90 ms
[09:43:46] <icinga-wm>	 PROBLEM - Check HHVM threads for leakage on mw1168 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers
[09:44:28] <logmsgbot>	 !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps2002.codfw.wmnet
[09:44:29] <XioNoX>	 !log all interfaces back up on cr2-esams, BGP sessions up as well T162239
[09:44:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:44:38] <logmsgbot>	 !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps2003.codfw.wmnet
[09:44:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:44:41] <stashbot>	 T162239: cr2-esams FPC 0 is dead - https://phabricator.wikimedia.org/T162239
[09:44:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:48:42] <wikibugs__>	 (03CR) 10Alexandros Kosiaris: [C: 032] Remove role::backup::config [puppet] - 10https://gerrit.wikimedia.org/r/347184 (owner: 10Alexandros Kosiaris)
[09:48:46] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Remove role::backup::config [puppet] - 10https://gerrit.wikimedia.org/r/347184
[09:48:52] <wikibugs__>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Remove role::backup::config [puppet] - 10https://gerrit.wikimedia.org/r/347184 (owner: 10Alexandros Kosiaris)
[09:52:37] <logmsgbot>	 !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps2003.codfw.wmnet
[09:52:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:52:50] <logmsgbot>	 !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps2004.codfw.wmnet
[09:52:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:53:04] <wikibugs__>	 (03CR) 10Alexandros Kosiaris: [C: 04-2] "Just to point out that Giuseppe's -2 is justified based on" [puppet] - 10https://gerrit.wikimedia.org/r/347188 (owner: 10Paladox)
[09:53:13] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-2] "Just to point out that Giuseppe's -2 is justified based on" [puppet] - 10https://gerrit.wikimedia.org/r/347189 (owner: 10Paladox)
[09:57:36] <icinga-wm>	 RECOVERY - puppet last run on db1049 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures
[09:57:58] <wikibugs__>	 (03CR) 10Alexandros Kosiaris: [C: 031] standardize "include ::base:*" [puppet] - 10https://gerrit.wikimedia.org/r/347024 (owner: 10Dzahn)
[09:59:48] <elukey>	 from https://grafana.wikimedia.org/dashboard/db/ocg it seems that the pending jobs are going down
[09:59:52] <elukey>	 as FYI
[10:01:09] <gehel>	 !log rolling restart of maps1* (eqiad) cluster
[10:01:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:02:12] <logmsgbot>	 !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps2004.codfw.wmnet
[10:02:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:03:25] <logmsgbot>	 !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps1001.eqiad.wmnet
[10:03:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:05:54] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] lists: convert to role/profile structure (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/346923 (owner: 10Dzahn)
[10:10:46] <icinga-wm>	 PROBLEM - Check HHVM threads for leakage on mw1168 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers
[10:11:54] <logmsgbot>	 !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps1001.eqiad.wmnet
[10:11:56] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] changeprop: Add an ores_uris parameter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/345826 (https://phabricator.wikimedia.org/T159615) (owner: 10Alexandros Kosiaris)
[10:11:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:12:02] <wikibugs__>	 (03PS3) 10Alexandros Kosiaris: changeprop: Add an ores_uris parameter [puppet] - 10https://gerrit.wikimedia.org/r/345826 (https://phabricator.wikimedia.org/T159615)
[10:12:03] <logmsgbot>	 !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps1002.eqiad.wmnet
[10:12:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:16:57] <wikibugs>	 (03PS4) 10Alexandros Kosiaris: changeprop: Add an ores_uris parameter [puppet] - 10https://gerrit.wikimedia.org/r/345826 (https://phabricator.wikimedia.org/T159615)
[10:16:59] <wikibugs__>	 (03PS2) 10Alexandros Kosiaris: changeprop: Remove the ores_uri parameter [puppet] - 10https://gerrit.wikimedia.org/r/345827 (https://phabricator.wikimedia.org/T159615)
[10:17:21] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] etherpad: convert to profile/role structure [puppet] - 10https://gerrit.wikimedia.org/r/346921 (owner: 10Dzahn)
[10:17:26] <wikibugs__>	 (03PS5) 10Alexandros Kosiaris: etherpad: convert to profile/role structure [puppet] - 10https://gerrit.wikimedia.org/r/346921 (owner: 10Dzahn)
[10:17:33] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] etherpad: convert to profile/role structure [puppet] - 10https://gerrit.wikimedia.org/r/346921 (owner: 10Dzahn)
[10:19:49] <_joe_>	  /win 19
[10:23:16] <logmsgbot>	 !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps1002.eqiad.wmnet
[10:23:23] <logmsgbot>	 !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps1003.eqiad.wmnet
[10:23:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:23:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:25:50] <wikibugs__>	 06Operations, 06Labs: Standalone puppet masters are broken (uninstallable packages) - https://phabricator.wikimedia.org/T162462#3167816 (10akosiaris) >>! In T162462#3164242, @faidon wrote: > My suggestion, which needs a little more time to be fully tested is: > - Take the latest 3.8 jessie-backport (from snaps...
[10:27:36] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[10:29:06] <icinga-wm>	 PROBLEM - puppet last run on cp3036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:31:59] <logmsgbot>	 !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps1003.eqiad.wmnet
[10:32:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:32:06] <logmsgbot>	 !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps1004.eqiad.wmnet
[10:32:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:32:12] <wikibugs__>	 (03CR) 10Alexandros Kosiaris: [C: 032] apertium-spa: New upstream release [debs/contenttranslation/apertium-spa] - 10https://gerrit.wikimedia.org/r/346748 (https://phabricator.wikimedia.org/T161511) (owner: 10KartikMistry)
[10:32:14] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] apertium-cat: New upstream release [debs/contenttranslation/apertium-cat] - 10https://gerrit.wikimedia.org/r/346696 (https://phabricator.wikimedia.org/T161511) (owner: 10KartikMistry)
[10:32:26] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.003 second response time
[10:33:26] <elukey>	 https://grafana-admin.wikimedia.org/dashboard/db/prometheus-apache-hhvm-dc-stats?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=videoscaler&var-instance=mw1169&from=now-1h&to=now
[10:34:01] <elukey>	 so mw1169 is a videoscaler, the load spiked to 20 (max hhvm threads configured) and then it started queueing for a bit
[10:34:06] <wikibugs__>	 (03PS1) 10Giuseppe Lavagetto: Actually select parsoid hosts when doing a rolling restart [switchdc] - 10https://gerrit.wikimedia.org/r/347342
[10:34:08] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Send output of systemctl to /dev/null [switchdc] - 10https://gerrit.wikimedia.org/r/347343
[10:34:10] <wikibugs__>	 (03PS1) 10Giuseppe Lavagetto: Verify jobrunner is stopped on the videoscalers as well [switchdc] - 10https://gerrit.wikimedia.org/r/347344
[10:34:12] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Use pgrep -c as we don't care about the output [switchdc] - 10https://gerrit.wikimedia.org/r/347345
[10:37:21] <wikibugs__>	 (03CR) 10Paladox: "@Alexandros Kosiaris and @Giuseppe Lavagetto hi, the phabricator class has not been converted to profile yet. So this means that the new r" [puppet] - 10https://gerrit.wikimedia.org/r/347188 (owner: 10Paladox)
[10:38:47] <icinga-wm>	 RECOVERY - Check HHVM threads for leakage on mw1259 is OK: OK
[10:40:46] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[10:41:36] <icinga-wm>	 PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[10:41:36] <wikibugs__>	 (03CR) 10Volans: [C: 032] Actually select parsoid hosts when doing a rolling restart [switchdc] - 10https://gerrit.wikimedia.org/r/347342 (owner: 10Giuseppe Lavagetto)
[10:43:18] <wikibugs>	 (03CR) 10Paladox: "@Giuseppe Lavagetto using just the profile class will break. Because backup::set is in the profile class. Even though it says != undef tha" [puppet] - 10https://gerrit.wikimedia.org/r/347189 (owner: 10Paladox)
[10:43:55] <paladox>	 _joe_ Hi, see ^^
[10:44:19] <paladox>	 I carn't use profile as backup::set is used there which will then lead to a carn't find that error.
[10:45:23] <wikibugs__>	 (03CR) 10Volans: [C: 032] Send output of systemctl to /dev/null [switchdc] - 10https://gerrit.wikimedia.org/r/347343 (owner: 10Giuseppe Lavagetto)
[10:45:53] <_joe_>	 paladox: then we need to fix that profile to work when no backup is available, I guess
[10:46:09] <_joe_>	 paladox: but then your patch is wrong too 
[10:46:24] <paladox>	 Oh.
[10:46:44] <_joe_>	 paladox: meaning your patch won't work either
[10:46:51] <paladox>	 It works for me
[10:47:03] <paladox>	 tested on puppetmaster
[10:47:25] <_joe_>	 oh you also patched the profile
[10:47:36] <icinga-wm>	 RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[10:47:38] <_joe_>	 ok, so we just need your patch to fix the profile
[10:48:00] <paladox>	 Oh i havent patched profile
[10:48:03] <_joe_>	 but I would not use the if $::realm != 'labs' there
[10:48:07] <paladox>	 I meant i tested ^^
[10:48:11] <wikibugs>	 (03CR) 10Volans: [C: 032] Verify jobrunner is stopped on the videoscalers as well [switchdc] - 10https://gerrit.wikimedia.org/r/347344 (owner: 10Giuseppe Lavagetto)
[10:48:12] <_joe_>	 https://gerrit.wikimedia.org/r/#/c/347189/5/modules/profile/manifests/gerrit/server.pp
[10:48:12] <paladox>	 oh
[10:48:43] <logmsgbot>	 !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps1004.eqiad.wmnet
[10:48:43] <paladox>	 _joe_ but as we doint set default now
[10:48:44] <paladox>	 $bacula = hiera('gerrit::server::bacula'),
[10:48:46] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[10:48:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:48:56] <paladox>	 doing gerrit::server::bacula: undef
[10:49:02] <paladox>	 will not work in hiera
[10:49:02] <wikibugs>	 (03CR) 10Daniel Kinzler: [C: 031] "agree with intent" [puppet] - 10https://gerrit.wikimedia.org/r/347234 (https://phabricator.wikimedia.org/T155103) (owner: 10Hoo man)
[10:49:11] <paladox>	 i tryed doing that
[10:49:36] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 2368.673321 Seconds
[10:49:46] <paladox>	 in https://wikitech.wikimedia.org/wiki/Hiera:Git
[10:49:59] <paladox>	 i had to resort to doing "gerrit::server::bacula": srv-gerrit-git
[10:50:14] <_joe_>	 paladox: gerrit::server::bacula: false maybe?
[10:50:22] <paladox>	 Hmm, let me try that
[10:51:36] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds
[10:52:08] <_joe_>	 so that conditional will evaluate to false
[10:54:34] <paladox>	 _joe_ that dosent seem to work
[10:54:34] <paladox>	 Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Puppet::Parser::AST::Resource failed with error ArgumentError: No title provided and "backup::set" is not a valid resource reference at /etc/puppet/modules/profile/manifests/gerrit/server.pp:62 on node gerrit-test3.git.eqiad.wmflabs
[10:54:34] <paladox>	 Warning: Not using cache on failed catalog
[10:54:35] <paladox>	 Error: Could not retrieve catalog; skipping run
[10:54:43] <paladox>	 "gerrit::server::bacula": false
[10:54:46] <icinga-wm>	 PROBLEM - Check HHVM threads for leakage on mw1168 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers
[10:55:43] <_joe_>	 paladox: sorry, I can't really work on this now, but my suggestion is to modify profile::gerrit::server to make it easy to include/exclude the backup
[10:56:02] <paladox>	 ok
[10:56:15] <paladox>	 i doint think it will be easy unless we can set defaults.
[10:56:28] <akosiaris>	 it's not easy
[10:56:48] <akosiaris>	 the entire backup thing where there are conflicting requirements and no sane defaults is not easy
[10:57:06] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 282 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[10:57:10] <paladox>	 The class worked before we moved it to profile
[10:57:18] <akosiaris>	 by conflicting requirements I mean, production wants to have it enabled, labs does not but no sane defaults exist
[10:57:27] <akosiaris>	 cause it was an ugly hack by me
[10:57:47] <akosiaris>	 it was pretending labs wanted backups
[10:57:52] <akosiaris>	 which was not true
[10:58:02] <paladox>	 oh
[10:58:06] <icinga-wm>	 RECOVERY - puppet last run on cp3036 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures
[10:58:42] <gehel>	 !log starting load test on elstic2020 - T149006
[10:58:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:58:50] <stashbot>	 T149006: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006
[10:59:09] <akosiaris>	 paladox: let me have a look at how to fix that, maybe I can think of something
[10:59:21] <paladox>	 Ok thanks :)
[11:00:28] <paladox>	 akosiaris changing != undef to != false seemed to work.
[11:00:36] <akosiaris>	 !log upload apertium-cat_2.0.0~r77286-1+wmf1, apertium-spa_1.0.0~r77293-1+wmf1 on apt.wikmedia.org/jessie-wikimedia
[11:00:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:01:18] <paladox>	 Puppet has bad caching
[11:01:19] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-spa-cat] - 10https://gerrit.wikimedia.org/r/346780 (https://phabricator.wikimedia.org/T161511) (owner: 10KartikMistry)
[11:01:36] <paladox>	 setting gerrit::server::bacula to true still passes puppet
[11:01:36] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 3088.644616 Seconds
[11:01:38] <wikibugs__>	 (03CR) 10Volans: [C: 032] Use pgrep -c as we don't care about the output [switchdc] - 10https://gerrit.wikimedia.org/r/347345 (owner: 10Giuseppe Lavagetto)
[11:01:56] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 18 probes of 282 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[11:02:37] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds
[11:03:40] <wikibugs>	 06Operations, 06Labs: Standalone puppet masters are broken (uninstallable packages) - https://phabricator.wikimedia.org/T162462#3167853 (10akosiaris) I 'll do the first task from above (get the 3.8 bpo and put it in jessie-wikimedia/main)  and possibly do the puppet upgrade on the jessie hosts.
[11:05:36] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 3328.674771 Seconds
[11:05:45] <akosiaris>	 paladox: yeah, I 'll take a look later when my queue of TODOs is a bit emptier. I think some more architectural changes make sense
[11:06:02] <paladox>	 ok
[11:06:03] <paladox>	 thanks
[11:09:36] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds
[11:10:44] <wikibugs>	 06Operations, 10Traffic, 10Wikimedia-Logstash, 13Patch-For-Review: Investigate 502 errors from nginx when backend returns 302 - https://phabricator.wikimedia.org/T161819#3167863 (10fgiunchedi) p:05Triage>03Normal
[11:11:27] <wikibugs__>	 06Operations, 06Commons, 10Traffic, 10Wikimedia-Site-requests, and 2 others: Allow anonymous users to change interface language on Commons with ULS - https://phabricator.wikimedia.org/T161517#3167864 (10fgiunchedi) p:05Triage>03Normal
[11:11:46] <akosiaris>	 !log upload puppet_3.8.5-2~bpo8+1 on apt.wikimedia.org jessie-wikimedia/main
[11:11:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:14:43] <wikibugs>	 (03PS1) 10KartikMistry: Rename apertium-es-ca -> apertium-spa-cat [puppet] - 10https://gerrit.wikimedia.org/r/347351 (https://phabricator.wikimedia.org/T161511)
[11:17:32] <paladox>	 akosiaris hi, trying to upgrade puppet on the puppetmaster results in puppet : Breaks: facter (< 2.4.0~) but 2.2.0-1 is to be installed
[11:18:37] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 4108.550043 Seconds
[11:20:04] <paladox>	 Jessie backport has 2.4 https://packages.debian.org/jessie-backports/facter but not sure how to install that.
[11:21:36] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds
[11:24:28] <wikibugs__>	 (03CR) 10Faidon Liambotis: "Interesting! I wasn't aware of this old patch. Minor nit: shouldn't ruby code follow a 2-space rather than 4-space indentation?" [puppet] - 10https://gerrit.wikimedia.org/r/178810 (owner: 10Hashar)
[11:24:58] <paravoid>	 hashar: ^
[11:25:15] <hashar>	 paravoid: hello!
[11:25:30] <hashar>	 yeah that is an old bit rotting changes I have kept in my Gerrit attic for a while
[11:25:37] <hashar>	 for indentation, I dont mind changing to 2 spaces
[11:25:57] <paravoid>	 dunno, I think that's what we follow elsewhere too
[11:26:03] <hashar>	 I guess it depends whether ones apply the 4 space convention we have in puppet or the 2 spaces from ruby :D
[11:26:17] <hashar>	 let me adjust
[11:27:36] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 4648.532983 Seconds
[11:27:58] <wikibugs>	 (03PS13) 10Hashar: wmflib: basic spec for os_version() [puppet] - 10https://gerrit.wikimedia.org/r/178810
[11:28:29] <wikibugs__>	 (03CR) 10Hashar: "Changed indentation to two spaces to be consistent with other ruby files :]  Thanks Faidon!" [puppet] - 10https://gerrit.wikimedia.org/r/178810 (owner: 10Hashar)
[11:28:36] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds
[11:28:47] <hashar>	 the tests themselves are rather lame but should give a good enough coverage
[11:29:04] <hashar>	 then wmflib has some other specs failing  :/
[11:29:07] <wikibugs>	 (03CR) 10Faidon Liambotis: [C: 032] wmflib: basic spec for os_version() [puppet] - 10https://gerrit.wikimedia.org/r/178810 (owner: 10Hashar)
[11:29:41] <hashar>	 there is also a parent change that replaces the modules/wmflib/Rakefile to use puppetlabs_spec_helper
[11:30:00] <paravoid>	 I saw
[11:30:45] <paravoid>	 I'm not much of a spec/rake expert, akosiaris is the person most familiar with that I think
[11:31:06] <hashar>	 ah right, I have added him as a reviewer
[11:31:32] <wikibugs>	 (03PS1) 10Elukey: Add JVM options tunables for Yarn RM and Hadoop DN/NN [puppet/cdh] - 10https://gerrit.wikimedia.org/r/347353 (https://phabricator.wikimedia.org/T159219)
[11:31:36] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 4888.471615 Seconds
[11:34:36] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds
[11:37:36] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:40:36] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 5428.662417 Seconds
[11:41:36] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds
[11:41:37] <hashar>	 jouncebot: refresh
[11:41:40] <jouncebot>	 I refreshed my knowledge about deployments.
[11:41:45] <hashar>	 jouncebot: next
[11:41:45] <jouncebot>	 In 1 hour(s) and 18 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170410T1300)
[11:42:39] <wikibugs__>	 (03PS2) 10Hashar: Increase default thumb size to 250px at nowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334787 (https://phabricator.wikimedia.org/T155892) (owner: 10Urbanecm)
[11:43:11] <wikibugs>	 (03CR) 10Hashar: [C: 031] "That has been cleared out:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334787 (https://phabricator.wikimedia.org/T155892) (owner: 10Urbanecm)
[11:43:26] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time
[11:43:44] <wikibugs__>	 (03CR) 10Hashar: [C: 031] Create editprotected right on ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347330 (https://phabricator.wikimedia.org/T162577) (owner: 10Urbanecm)
[11:43:46] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Add a free-form 'any' type [software/conftool] - 10https://gerrit.wikimedia.org/r/347356 (https://phabricator.wikimedia.org/T156924)
[11:46:36] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 5788.605667 Seconds
[11:47:36] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds
[11:50:36] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 6028.629504 Seconds
[11:50:47] <_joe_>	 sigh
[11:53:56] <icinga-wm>	 PROBLEM - puppet last run on es2016 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tzdata]
[11:56:36] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds
[11:59:24] <Urbanecm_>	 hashar: may I ask for earlier start (if possible now) of SWAT? It
[11:59:27] <Urbanecm_>	 It
[11:59:30] <Urbanecm_>	 Ehh
[11:59:36] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 6568.621692 Seconds
[11:59:41] <Urbanecm_>	 Itll help me :)
[12:00:09] <hashar>	 Urbanecm_: yup certainly
[12:00:22] <Urbanecm_>	 Thanks!
[12:01:06] <hashar>	 we can surely do [config] 347330 Create editprotected right on ptwikinews
[12:01:24] <hashar>	 the one to raise thumb size I can handle it with ops during the regular window
[12:01:43] <wikibugs>	 (03PS2) 10BBlack: tlsproxy: double the response buffer size [puppet] - 10https://gerrit.wikimedia.org/r/345767 (https://phabricator.wikimedia.org/T161819)
[12:02:19] <hashar>	 Urbanecm_: and maybe we should move the swat window an hour earlier ( 2 PM CEST )
[12:02:43] <Urbanecm_>	 So you wont need me? I must go to train and didn
[12:02:45] <Urbanecm_>	 t expect it
[12:03:00] <hashar>	 Urbanecm_: yeah catch your train!!  it is more important :]
[12:03:00] <Urbanecm_>	 hashar: you mean permanent moving?
[12:03:15] <hashar>	 I will figure out how to verify the ptwikinews
[12:03:32] <hashar>	 and the thumb size bump looks easy as well
[12:03:54] <wikibugs__>	 (03CR) 10BBlack: [C: 032] tlsproxy: double the response buffer size [puppet] - 10https://gerrit.wikimedia.org/r/345767 (https://phabricator.wikimedia.org/T161819) (owner: 10BBlack)
[12:04:02] <hashar>	 I could reach out to people participating in the european swat and see whetehr an hour earlier would be better
[12:04:09] <Urbanecm_>	 Oka
[12:04:11] <Urbanecm_>	 y
[12:04:14] <hashar>	 currently it is at 3pm, right in the middle of the afternoon
[12:04:30] <Urbanecm_>	 Regarding ptwikines: I think verifying Special:usergrouprights should do
[12:04:31] <hashar>	 now I guess, rush to your train!!! :]
[12:05:24] <Urbanecm_>	 Nope. I cant be here during the regular window but can be here now.
[12:06:41] <hashar>	 Urbanecm_: I will handle them both  dont worry :]
[12:06:56] <hashar>	 and will make sure to !log to the relevant tasks
[12:07:03] <hashar>	 thanks for all the patches !
[12:07:04] <Urbanecm_>	 The departure time is in an hour (so a few minutes after the window starts).
[12:07:19] <Urbanecm_>	 You're welcome
[12:07:36] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds
[12:07:50] <wikibugs__>	 (03PS2) 10BBlack: VCL: do not allow empty url when un-proxying [puppet] - 10https://gerrit.wikimedia.org/r/339648
[12:08:09] <wikibugs__>	 (03CR) 10BBlack: [V: 032 C: 032] VCL: do not allow empty url when un-proxying [puppet] - 10https://gerrit.wikimedia.org/r/339648 (owner: 10BBlack)
[12:10:37] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 7228.538775 Seconds
[12:10:55] <akosiaris>	 paladox: https://phabricator.wikimedia.org/T162462
[12:11:13] <paladox>	 ok thanks
[12:14:42] <wikibugs__>	 (03PS1) 10Giuseppe Lavagetto: conftool: add mwconfig object type, define the first couple variables [puppet] - 10https://gerrit.wikimedia.org/r/347360
[12:18:16] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 1811.999282 Seconds
[12:18:46] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 1842.711178 Seconds
[12:18:56] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 1856.25057 Seconds
[12:20:36] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds
[12:21:51] <wikibugs>	 06Operations, 06Labs: Standalone puppet masters are broken (uninstallable packages) - https://phabricator.wikimedia.org/T162462#3168039 (10akosiaris) The package has been uploaded on jessie-wikimedia/backports as of a while ago and some basic checks seem to be fine.   For what is worth puppet-master is a new p...
[12:22:06] <icinga-wm>	 RECOVERY - puppet last run on es2016 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures
[12:22:31] <wikibugs__>	 (03PS3) 10Gehel: postgresql - introduce the check-postgres package for postgres monitoring [puppet] - 10https://gerrit.wikimedia.org/r/346962 (https://phabricator.wikimedia.org/T162345)
[12:23:36] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 8008.488649 Seconds
[12:25:29] <wikibugs__>	 (03CR) 10Gehel: [C: 032] postgresql - introduce the check-postgres package for postgres monitoring [puppet] - 10https://gerrit.wikimedia.org/r/346962 (https://phabricator.wikimedia.org/T162345) (owner: 10Gehel)
[12:29:33] <wikibugs__>	 (03Abandoned) 10Ema: cp1008: override cache::route_table [puppet] - 10https://gerrit.wikimedia.org/r/346733 (owner: 10Ema)
[12:29:46] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[12:30:36] <icinga-wm>	 PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[12:33:46] <icinga-wm>	 PROBLEM - puppet last run on notebook1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tshark]
[12:34:04] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] apertium-spa-cat: New upstream release [debs/contenttranslation/apertium-spa-cat] - 10https://gerrit.wikimedia.org/r/346780 (https://phabricator.wikimedia.org/T161511) (owner: 10KartikMistry)
[12:36:21] <wikibugs__>	 (03PS9) 10BBlack: dnsrecursor: 4.x backport and edns-client-subnet [puppet] - 10https://gerrit.wikimedia.org/r/346937
[12:36:23] <wikibugs>	 (03PS6) 10BBlack: dnsrecursor: update to backports for transition [puppet] - 10https://gerrit.wikimedia.org/r/346980
[12:36:44] <ema>	 the 5xx spike above seems to be mostly due to phabricator ^
[12:37:16] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 0.0 Seconds
[12:37:57] <bblack>	 it's the source of the 500 on misc I guess?
[12:38:04] <ema>	 yep
[12:38:07] <bblack>	 there's also some other odd small bumps
[12:38:11] <bblack>	 e.g. 501 on upload
[12:38:36] <icinga-wm>	 PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[12:38:37] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1002 is OK: (OK - Rep Delay is:, 0.0, Seconds,
[12:38:43] <wikibugs__>	 (03PS1) 10Gehel: postgresql - replication lag needs to use --output=simple [puppet] - 10https://gerrit.wikimedia.org/r/347364
[12:38:46] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[12:39:36] <icinga-wm>	 RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[12:40:16] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 3132.134867 Seconds
[12:40:39] <akosiaris>	 !log upload apertium-spa-cat_2.0.0~r77288-2+wmf1 on apt.wikimedia.org jessie-wikimedia/main
[12:40:45] <akosiaris>	 kart_: all 3 uploaded 
[12:40:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:40:53] <ema>	 I've captured one of the phab 500s: https://phabricator.wikimedia.org/P5231
[12:41:01] <kart_>	 akosiaris: nice!
[12:41:31] <kart_>	 akosiaris: need to rename package, but that's not urgent. do let me know when it is merged.
[12:41:53] <kart_>	 akosiaris: we may need to restart apertium-apy service.
[12:43:02] <akosiaris>	 kart_: you mean https://gerrit.wikimedia.org/r/347351 ?
[12:43:08] <akosiaris>	 I can merge it now
[12:43:18] <kart_>	 Yes.
[12:43:22] <kart_>	 Thanks!
[12:43:28] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Rename apertium-es-ca -> apertium-spa-cat [puppet] - 10https://gerrit.wikimedia.org/r/347351 (https://phabricator.wikimedia.org/T161511) (owner: 10KartikMistry)
[12:43:34] <wikibugs__>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Rename apertium-es-ca -> apertium-spa-cat [puppet] - 10https://gerrit.wikimedia.org/r/347351 (https://phabricator.wikimedia.org/T161511) (owner: 10KartikMistry)
[12:45:01] <wikibugs>	 06Operations, 10Traffic, 10Wikimedia-Logstash, 13Patch-For-Review: Investigate 502 errors from nginx when backend returns 302 - https://phabricator.wikimedia.org/T161819#3168117 (10BBlack) 05Open>03Resolved a:03BBlack Merge above should fix this, at least for this case and any others on our cache ter...
[12:46:24] <bblack>	 looking at the phab exception in a browser, the response to URL: https://phabricator.wikimedia.org/source/RevisionSlider/history/wmf%252F1.29.0-wmf.13/jsduck.json
[12:46:30] <bblack>	 is a 500 error which says in the html text:
[12:46:36] <bblack>	 Unhandled Exception ("DiffusionRefNotFoundException")	
[12:46:36] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: (CRITICAL - Rep Delay is:, 9388.562921, Seconds,
[12:46:37] <bblack>	 Ref "wmf/1.29.0-wmf.13" does not exist in this repository.
[12:46:51] <ema>	 I've also just got this in my phab activity feed:
[12:46:53] <ema>	 PhabricatorClusterStrandedException: Unable to establish a connection to any database host (while trying "phabricator_file"). All masters and replicas are completely unreachable.
[12:47:30] <bblack>	 (it seems odd to me that an unfound thing throws an unhandled exception that becomes a 500, instead of some kind of caught exception that results in 4xx)
[12:50:36] <icinga-wm>	 PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[12:51:46] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2002 is OK: (OK - Rep Delay is:, 0.0, Seconds,
[12:52:36] <icinga-wm>	 PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[12:53:33] <phuedx>	 jouncebot next
[12:53:33] <jouncebot>	 In 0 hour(s) and 6 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170410T1300)
[12:54:24] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] wmflib: switch to puppetlabs_spec_helper/rake_tasks [puppet] - 10https://gerrit.wikimedia.org/r/332475 (owner: 10Hashar)
[12:54:36] <icinga-wm>	 PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[12:54:46] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: (CRITICAL - Rep Delay is:, 4003.747921, Seconds,
[12:54:46] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[12:55:56] <marostegui>	 !log Run pt-table-checksum on s4 - T162593
[12:56:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:56:03] <stashbot>	 T162593: Run pt-table-checksum on s4 (commonswiki) - https://phabricator.wikimedia.org/T162593
[12:56:46] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2002 is OK: (OK - Rep Delay is:, 0.0, Seconds,
[12:59:46] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: (CRITICAL - Rep Delay is:, 4303.665043, Seconds,
[12:59:58] <wikibugs>	 06Operations, 10Traffic, 10netops: knams equipment move - https://phabricator.wikimedia.org/T162601#3168183 (10ayounsi)
[13:00:04] <jouncebot>	 addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170410T1300). Please do the needful.
[13:00:04] <jouncebot>	 hashar, Urbanecm, and phuedx: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process.
[13:00:26] <wikibugs__>	 06Operations, 10DBA, 13Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3168199 (10Marostegui)
[13:00:46] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[13:00:50] <hashar>	 o/
[13:00:56] <phuedx>	 o/
[13:00:58] <twentyafterfour>	 !log stopped search indexer on iridium to lighten load on m3 databases.
[13:01:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:01:33] <ema>	 twentyafterfour: hi! Looks like you're already aware of the phab issues
[13:01:39] <wikibugs__>	 06Operations, 05Prometheus-metrics-monitoring, 15User-fgiunchedi: Effects on adjusting Prometheus retention - https://phabricator.wikimedia.org/T160677#3168202 (10fgiunchedi) p:05Triage>03Normal
[13:01:41] <zeljkof>	 o/
[13:01:44] <hashar>	 phuedx: lets do your :)
[13:01:53] <wikibugs__>	 (03PS2) 10Hashar: pagePreviews: Enable NavPopups gadget detection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347291 (https://phabricator.wikimedia.org/T160081) (owner: 10Phuedx)
[13:01:58] <twentyafterfour>	 ema: yeah the phab database is pretty heavily loaded, I think the search indexer was at least part of the problem
[13:02:01] <hashar>	 though I am not entirely sure what it does :)
[13:02:11] <twentyafterfour>	 I've been backfilling the elasticsearch index for a while
[13:02:16] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2003 is OK: (OK - Rep Delay is:, 0.0, Seconds,
[13:02:36] <icinga-wm>	 PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[13:02:41] <ema>	 twentyafterfour: OK. If it helps debugging, we've had a few 500 spikes today: https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes?panelId=2&fullscreen&orgId=1&from=1491818336930&to=1491829320586&var-site=All&var-cache_type=misc&var-status_type=5
[13:02:45] <wikibugs__>	 (03PS2) 10Alexandros Kosiaris: wmflib: multiple os_version changes [puppet] - 10https://gerrit.wikimedia.org/r/346673 (owner: 10Faidon Liambotis)
[13:02:47] <icinga-wm>	 RECOVERY - puppet last run on notebook1001 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures
[13:02:51] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] wmflib: multiple os_version changes [puppet] - 10https://gerrit.wikimedia.org/r/346673 (owner: 10Faidon Liambotis)
[13:02:57] <wikibugs__>	 06Operations: videoscalers (mw1168, mw1169) - high load / overheating - https://phabricator.wikimedia.org/T161918#3168223 (10fgiunchedi) p:05Triage>03Normal
[13:03:19] <phuedx>	 hashar: related task: https://phabricator.wikimedia.org/T160081
[13:03:46] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[13:03:47] <wikibugs>	 (03CR) 10Gehel: [C: 032] postgresql - replication lag needs to use --output=simple [puppet] - 10https://gerrit.wikimedia.org/r/347364 (owner: 10Gehel)
[13:03:51] <wikibugs__>	 (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347291 (https://phabricator.wikimedia.org/T160081) (owner: 10Phuedx)
[13:03:53] <wikibugs>	 (03PS2) 10Gehel: postgresql - replication lag needs to use --output=simple [puppet] - 10https://gerrit.wikimedia.org/r/347364
[13:04:01] <ema>	 twentyafterfour: perhaps related to the recent upgrade?
[13:04:12] <wikibugs__>	 (03CR) 10Gehel: [V: 032 C: 032] postgresql - replication lag needs to use --output=simple [puppet] - 10https://gerrit.wikimedia.org/r/347364 (owner: 10Gehel)
[13:04:36] <icinga-wm>	 RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[13:04:42] <twentyafterfour>	 ema: nope, I see what's up - it's search engines indexing us hardcore
[13:04:42] <wikibugs>	 (03PS3) 10Hashar: Increase default thumb size to 250px at nowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334787 (https://phabricator.wikimedia.org/T155892) (owner: 10Urbanecm)
[13:04:44] <wikibugs__>	 (03PS2) 10Hashar: Create editprotected right on ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347330 (https://phabricator.wikimedia.org/T162577) (owner: 10Urbanecm)
[13:04:52] <wikibugs__>	 (03PS5) 10Alexandros Kosiaris: wmflib: switch to puppetlabs_spec_helper/rake_tasks [puppet] - 10https://gerrit.wikimedia.org/r/332475 (owner: 10Hashar)
[13:04:56] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] wmflib: switch to puppetlabs_spec_helper/rake_tasks [puppet] - 10https://gerrit.wikimedia.org/r/332475 (owner: 10Hashar)
[13:05:16] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: (CRITICAL - Rep Delay is:, 4632.256525, Seconds,
[13:06:24] <hashar>	 phuedx: ahh so we have an extension for the popups written by wmf AND a gadget written by the community ? :D
[13:06:28] <wikibugs>	 (03PS14) 10Alexandros Kosiaris: wmflib: basic spec for os_version() [puppet] - 10https://gerrit.wikimedia.org/r/178810 (owner: 10Hashar)
[13:06:36] <wikibugs__>	 (03Merged) 10jenkins-bot: pagePreviews: Enable NavPopups gadget detection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347291 (https://phabricator.wikimedia.org/T160081) (owner: 10Phuedx)
[13:06:45] <wikibugs__>	 (03CR) 10jenkins-bot: pagePreviews: Enable NavPopups gadget detection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347291 (https://phabricator.wikimedia.org/T160081) (owner: 10Phuedx)
[13:06:57] <phuedx>	 hashar: yup, the wmf was a rebuild of the community one, but to avoid any confusion... etc etc
[13:07:30] <paravoid>	 gehel: since you're doing postgres stuff, https://gerrit.wikimedia.org/r/#/c/345837/ (cc: akosiaris)
[13:07:32] <hashar>	 phuedx: I better understand the context now :D I guess with time the community ones will be integrated in Hovercards
[13:08:03] <hashar>	 phuedx: patch is on mwdebug1001now
[13:08:06] <phuedx>	 ta
[13:08:08] <twentyafterfour>	 'Mozilla/5.0 (compatible; mbot/1.8; cust0002; +https://www.teorem.se/bot.html)'
[13:08:15] <gehel>	 paravoid: kool!
[13:08:18] <twentyafterfour>	 that's no search engine I know of
[13:08:40] <wikibugs__>	 (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334787 (https://phabricator.wikimedia.org/T155892) (owner: 10Urbanecm)
[13:09:11] <wikibugs>	 (03CR) 10Faidon Liambotis: [C: 04-1] "Agreed, let's hold until Apr 26th to merge this." [puppet] - 10https://gerrit.wikimedia.org/r/345838 (owner: 10Faidon Liambotis)
[13:09:30] <wikibugs__>	 (03CR) 10Gehel: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/345837 (owner: 10Faidon Liambotis)
[13:09:33] <paladox>	 twentyafterfour there certificate is invalid
[13:09:56] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2004 is OK: (OK - Rep Delay is:, 1.705307, Seconds,
[13:10:07] <wikibugs>	 (03Merged) 10jenkins-bot: Increase default thumb size to 250px at nowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334787 (https://phabricator.wikimedia.org/T155892) (owner: 10Urbanecm)
[13:10:15] <wikibugs__>	 (03CR) 10jenkins-bot: Increase default thumb size to 250px at nowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334787 (https://phabricator.wikimedia.org/T155892) (owner: 10Urbanecm)
[13:10:17] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2003 is OK: (OK - Rep Delay is:, 17.643769, Seconds,
[13:10:20] <wikibugs__>	 06Operations, 13Patch-For-Review, 15User-Elukey: Tracking and Reducing cron-spam from root@ - https://phabricator.wikimedia.org/T132324#3168247 (10fgiunchedi)
[13:10:23] <wikibugs>	 06Operations, 10Monitoring: certspotter on einsteinium has issues talking to external - https://phabricator.wikimedia.org/T162327#3168246 (10fgiunchedi)
[13:10:46] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2002 is OK: (OK - Rep Delay is:, 50.151135, Seconds,
[13:10:47] <wikibugs__>	 06Operations, 10Monitoring: certspotter on einsteinium has issues talking to external - https://phabricator.wikimedia.org/T162327#3159311 (10fgiunchedi) p:05Triage>03Low
[13:12:37] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1002 is OK: (OK - Rep Delay is:, 0.0, Seconds,
[13:13:32] <twentyafterfour>	 paladox: yeah, it looks pretty sketchy
[13:14:01] <paladox>	 Yep. I didnt press continue just in case they downloaded a virus.
[13:14:12] <wikibugs>	 (03PS2) 10Faidon Liambotis: postgresql: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345837
[13:14:14] <wikibugs__>	 (03PS4) 10Faidon Liambotis: mediawiki: kill more precise references [puppet] - 10https://gerrit.wikimedia.org/r/345546
[13:14:17] <wikibugs>	 (03PS4) 10Faidon Liambotis: apache: remove precise (and Apache 2.2) support [puppet] - 10https://gerrit.wikimedia.org/r/345548
[13:14:18] <wikibugs__>	 (03PS2) 10Faidon Liambotis: typos: add (requires_os|os_version)\([^)]*[[:upper:]] [puppet] - 10https://gerrit.wikimedia.org/r/346677 (owner: 10Dzahn)
[13:14:21] <wikibugs>	 (03PS5) 10Faidon Liambotis: releases: remove the precise suite [puppet] - 10https://gerrit.wikimedia.org/r/345838
[13:14:34] <ema>	 twentyafterfour: yeah I see the requests with User-Agent ~ teorem.se in esams
[13:15:06] <Zppix>	 twentyafterfour:  im on a vm right now want me to check it out?
[13:15:17] <phuedx>	 hashar: thumbs up
[13:15:27] <hashar>	 \^/
[13:15:36] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: (CRITICAL - Rep Delay is:, 11128.704343, Seconds,
[13:15:45] <wikibugs>	 (03CR) 10Faidon Liambotis: [C: 032] postgresql: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345837 (owner: 10Faidon Liambotis)
[13:16:08] <Zppix>	 ema:  theres nothing there at teorem.se
[13:16:25] <logmsgbot>	 !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: pagePreviews: Enable NavPopups gadget detection - T160081 (duration: 00m 40s)
[13:16:28] <hashar>	 phuedx: it is live. Congratulations.
[13:16:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:16:33] <stashbot>	 T160081: Navigational popups and page previews appearing simultaneously  - https://phabricator.wikimedia.org/T160081
[13:16:50] <paravoid>	 gehel: ty, merged
[13:17:14] <hashar>	 nowiki default thumbsize is bumped to 250px (tested/works)  <-- gilles
[13:17:15] <gehel>	 nice to see this! At some point, the maps puppet code will look clean...
[13:17:22] <ema>	 Zppix: the User-Agent is 'Mozilla/5.0 (compatible; mbot/1.8; cust0002; +https://www.teorem.se/bot.html)', invalid cert as paladox mentioned
[13:17:36] <wikibugs>	 (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347330 (https://phabricator.wikimedia.org/T162577) (owner: 10Urbanecm)
[13:17:36] <logmsgbot>	 !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: Increase default thumb size to 250px at nowiki - T155892 (duration: 00m 45s)
[13:17:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:17:43] <stashbot>	 T155892: Increase default thumb size to 250px at nowiki - https://phabricator.wikimedia.org/T155892
[13:18:01] <paladox>	 I guess it is someone wanting to cause problems to wikimedias domain.
[13:18:02] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] releases: remove the precise suite [puppet] - 10https://gerrit.wikimedia.org/r/345838 (owner: 10Faidon Liambotis)
[13:18:36] <Zppix>	 paladox:  i would stay off it i went to it and lets just say im still getting notifications from my vm (luckily it was my throw away nuclear crash course vm)
[13:18:44] <paravoid>	 oh my bad
[13:18:57] <wikibugs__>	 (03Merged) 10jenkins-bot: Create editprotected right on ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347330 (https://phabricator.wikimedia.org/T162577) (owner: 10Urbanecm)
[13:19:06] <wikibugs__>	 (03CR) 10jenkins-bot: Create editprotected right on ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347330 (https://phabricator.wikimedia.org/T162577) (owner: 10Urbanecm)
[13:19:09] <elukey>	 !log reboot analytics1040->1050 to pick up the new kernel
[13:19:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:19:24] <paladox>	 ok
[13:20:06] <wikibugs__>	 (03PS6) 10Faidon Liambotis: releases: remove the precise suite [puppet] - 10https://gerrit.wikimedia.org/r/345838
[13:20:21] <paladox>	 Zppix what type of notifications are you getting?
[13:20:25] <paladox>	 Random ads
[13:20:38] <Zppix>	 paladox:  the antivirus i have on it is giving me malware trojan the whole 9
[13:20:46] <icinga-wm>	 RECOVERY - Check HHVM threads for leakage on mw1168 is OK: OK
[13:20:48] <wikibugs__>	 (03PS2) 10Volans: Logging: add multiple handlers to the logger [switchdc] - 10https://gerrit.wikimedia.org/r/347180 (https://phabricator.wikimedia.org/T160178)
[13:20:52] <wikibugs>	 (03PS1) 10Volans: Logging: use the new handler, remove log_dry_run [switchdc] - 10https://gerrit.wikimedia.org/r/347369 (https://phabricator.wikimedia.org/T160178)
[13:20:52] <wikibugs__>	 (03PS1) 10Volans: Logging: uniformed log levels and messages [switchdc] - 10https://gerrit.wikimedia.org/r/347370 (https://phabricator.wikimedia.org/T160178)
[13:20:55] <wikibugs__>	 (03PS1) 10Volans: MySQL: better dry-run handling [switchdc] - 10https://gerrit.wikimedia.org/r/347371 (https://phabricator.wikimedia.org/T160178)
[13:20:56] <wikibugs__>	 (03PS1) 10Volans: Tendril: fix an error in the exception raising [switchdc] - 10https://gerrit.wikimedia.org/r/347372 (https://phabricator.wikimedia.org/T160178)
[13:20:59] <wikibugs>	 (03PS1) 10Volans: Disable puppet: fix title and docstring [switchdc] - 10https://gerrit.wikimedia.org/r/347373 (https://phabricator.wikimedia.org/T160178)
[13:21:01] <wikibugs>	 (03PS2) 10Gehel: postgresql - use the check-postgres package for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/346963 (https://phabricator.wikimedia.org/T162345)
[13:21:02] <wikibugs__>	 (03PS1) 10Volans: Formatting improvements [switchdc] - 10https://gerrit.wikimedia.org/r/347374 (https://phabricator.wikimedia.org/T160178)
[13:21:04] <paladox>	 Zppix lol, someone wants to infect phabricator.
[13:21:04] <wikibugs__>	 (03PS1) 10Volans: MediaWiki: announce explicitly the read-only period [switchdc] - 10https://gerrit.wikimedia.org/r/347375 (https://phabricator.wikimedia.org/T160178)
[13:21:07] <wikibugs>	 (03PS1) 10Volans: Menu: avoid double failing message [switchdc] - 10https://gerrit.wikimedia.org/r/347376 (https://phabricator.wikimedia.org/T160178)
[13:21:09] <wikibugs>	 (03PS1) 10Jcrespo: Repool db1055 after maintenance with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347377 (https://phabricator.wikimedia.org/T159319)
[13:21:11] <wikibugs__>	 (03PS5) 10Faidon Liambotis: mediawiki: kill more precise references [puppet] - 10https://gerrit.wikimedia.org/r/345546
[13:21:15] <paladox>	 ^^ thats alot of changes
[13:21:34] <wikibugs__>	 (03CR) 10Faidon Liambotis: [C: 032] mediawiki: kill more precise references [puppet] - 10https://gerrit.wikimedia.org/r/345546 (owner: 10Faidon Liambotis)
[13:21:42] <Zppix>	 paladox:  i mean we just have it backupped so they can try all they want, we can just switch to cofw's backup
[13:22:48] <paladox>	 Zppix they are some how hitting the phabricator domain not the backend as it is un assessible to the outside world. Though someone may be trying to exploit phab.
[13:23:10] <Zppix>	 paladox:  cant we just block their traffic?
[13:23:17] <paladox>	 Yes
[13:23:37] <Zppix>	 I say just hit em with a good ol rangeblock internally
[13:24:40] <wikibugs>	 (03PS3) 10Gehel: postgresql - use the check-postgres package for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/346963 (https://phabricator.wikimedia.org/T162345)
[13:25:23] <wikibugs>	 (03PS1) 10Muehlenhoff: Allow silencing a debconf query [puppet] - 10https://gerrit.wikimedia.org/r/347378
[13:25:58] <wikibugs>	 (03PS2) 10Hashar: Set wgTranslateNumerals false on bhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346965 (https://phabricator.wikimedia.org/T160098) (owner: 10Sfic)
[13:26:11] <wikibugs__>	 (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346965 (https://phabricator.wikimedia.org/T160098) (owner: 10Sfic)
[13:26:20] <logmsgbot>	 !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: Create editprotected right on ptwikinews - T162577 (duration: 00m 40s)
[13:26:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:26:27] <stashbot>	 T162577: Create "editprotected" right on Portuguese Wikinews - https://phabricator.wikimedia.org/T162577
[13:27:29] <wikibugs>	 (03Merged) 10jenkins-bot: Set wgTranslateNumerals false on bhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346965 (https://phabricator.wikimedia.org/T160098) (owner: 10Sfic)
[13:27:38] <wikibugs>	 (03CR) 10jenkins-bot: Set wgTranslateNumerals false on bhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346965 (https://phabricator.wikimedia.org/T160098) (owner: 10Sfic)
[13:30:07] <icinga-wm>	 PROBLEM - Check HHVM threads for leakage on mw1259 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers
[13:30:10] <wikibugs>	 06Operations, 10Traffic, 10media-storage, 13Patch-For-Review, 15User-Urbanecm: Some PNG thumbnails and JPEG originals delivered as [text/html] content-type and hence not rendered in browser - https://phabricator.wikimedia.org/T162035#3168304 (10Aklapper) I fail to load https://upload.wikimedia.org/wikipe...
[13:30:23] <logmsgbot>	 !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: Set wgTranslateNumerals false on bhwiki - T160098 (duration: 00m 40s)
[13:30:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:30] <stashbot>	 T160098: Change default numerals on Bhojpuri Wikipedia from Devanagari to Arabic numerals - https://phabricator.wikimedia.org/T160098
[13:31:18] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Small error in Outputhandler, but lgtm otherwise." (031 comment) [switchdc] - 10https://gerrit.wikimedia.org/r/347180 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans)
[13:32:02] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] apache: remove precise (and Apache 2.2) support [puppet] - 10https://gerrit.wikimedia.org/r/345548 (owner: 10Faidon Liambotis)
[13:32:06] <wikibugs__>	 (03PS5) 10Alexandros Kosiaris: apache: remove precise (and Apache 2.2) support [puppet] - 10https://gerrit.wikimedia.org/r/345548 (owner: 10Faidon Liambotis)
[13:32:09] <wikibugs__>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] apache: remove precise (and Apache 2.2) support [puppet] - 10https://gerrit.wikimedia.org/r/345548 (owner: 10Faidon Liambotis)
[13:32:50] <wikibugs__>	 06Operations, 06Commons, 10media-storage: Commons File:Assemblea_Costituente_1946_(2).svg missing after file move - https://phabricator.wikimedia.org/T161476#3132721 (10fgiunchedi) I believe this is related to T111838 and similar bugs, I'm tentatively merging it there
[13:32:55] <wikibugs>	 (03PS3) 10Faidon Liambotis: typos: add (requires_os|os_version)\([^)]*[[:upper:]] [puppet] - 10https://gerrit.wikimedia.org/r/346677 (owner: 10Dzahn)
[13:33:01] <wikibugs__>	 (03CR) 10Faidon Liambotis: [C: 032] typos: add (requires_os|os_version)\([^)]*[[:upper:]] [puppet] - 10https://gerrit.wikimedia.org/r/346677 (owner: 10Dzahn)
[13:33:13] <wikibugs__>	 06Operations, 06Commons, 10media-storage, 05MW-1.27-release (WMF-deploy-2015-11-03_(1.27.0-wmf.5)), 05MW-1.27-release-notes: Some files had disappeared from Commons after renaming - https://phabricator.wikimedia.org/T111838#1617529 (10fgiunchedi)
[13:33:16] <wikibugs>	 06Operations, 06Commons, 10media-storage: Commons File:Assemblea_Costituente_1946_(2).svg missing after file move - https://phabricator.wikimedia.org/T161476#3168333 (10fgiunchedi)
[13:33:34] <wikibugs__>	 (03PS3) 10Volans: Logging: add multiple handlers to the logger [switchdc] - 10https://gerrit.wikimedia.org/r/347180 (https://phabricator.wikimedia.org/T160178)
[13:33:43] <wikibugs>	 (03CR) 10Volans: "Fixed" (031 comment) [switchdc] - 10https://gerrit.wikimedia.org/r/347180 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans)
[13:35:14] <wikibugs__>	 06Operations, 10Traffic, 10media-storage, 13Patch-For-Review, 15User-Urbanecm: Some PNG thumbnails and JPEG originals delivered as [text/html] content-type and hence not rendered in browser - https://phabricator.wikimedia.org/T162035#3168334 (10ema) >>! In T162035#3168304, @Aklapper wrote: > I fail to lo...
[13:39:56] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1002 is OK: (OK - Rep Delay is:, 0.0, Seconds,
[13:40:06] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: (CRITICAL - Rep Delay is:, 1811.714751, Seconds,
[13:40:16] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: (CRITICAL - Rep Delay is:, 1817.467036, Seconds,
[13:40:59] <wikibugs>	 (03PS2) 10Volans: Logging: use the new handler, remove log_dry_run [switchdc] - 10https://gerrit.wikimedia.org/r/347369 (https://phabricator.wikimedia.org/T160178)
[13:41:20] <wikibugs__>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "A couple of minor things to change, but LGTM otherwise." (032 comments) [switchdc] - 10https://gerrit.wikimedia.org/r/347369 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans)
[13:41:47] <moritzm>	 !log upgrading wtp1002-wtp1005 to Linux 4.9
[13:41:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:42:47] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: (CRITICAL - Rep Delay is:, 1972.674695, Seconds,
[13:43:01] <wikibugs__>	 (03CR) 10Volans: "Fixed" (032 comments) [switchdc] - 10https://gerrit.wikimedia.org/r/347369 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans)
[13:44:06] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2004 is OK: (OK - Rep Delay is:, 0.0, Seconds,
[13:44:32] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] Repool db1055 after maintenance with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347377 (https://phabricator.wikimedia.org/T159319) (owner: 10Jcrespo)
[13:44:52] <jynus>	 I am about to deploy gerrit:347377
[13:45:35] <marostegui>	 go for it!
[13:45:53] <jynus>	 Needs Verified Label
[13:46:47] <wikibugs>	 06Operations, 13Patch-For-Review, 15User-Elukey, 07Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#3168342 (10elukey) Thanks a lot! The errors logged at the moment are clouded by https:...
[13:46:48] <wikibugs__>	 (03Merged) 10jenkins-bot: Repool db1055 after maintenance with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347377 (https://phabricator.wikimedia.org/T159319) (owner: 10Jcrespo)
[13:46:59] <wikibugs>	 (03CR) 10jenkins-bot: Repool db1055 after maintenance with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347377 (https://phabricator.wikimedia.org/T159319) (owner: 10Jcrespo)
[13:47:06] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: (CRITICAL - Rep Delay is:, 2231.640589, Seconds,
[13:49:16] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1055 after maintenance with low weight (duration: 00m 38s)
[13:49:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:49:46] <icinga-wm>	 PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers
[13:50:07] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2004 is OK: (OK - Rep Delay is:, 0.0, Seconds,
[13:53:06] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps-test2004 is CRITICAL: (CRITICAL - Rep Delay is:, 1872.688149, Seconds,
[13:53:06] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps-test2003 is CRITICAL: (CRITICAL - Rep Delay is:, 1875.298713, Seconds,
[13:53:07] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: (CRITICAL - Rep Delay is:, 2591.594838, Seconds,
[13:53:16] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2003 is OK: (OK - Rep Delay is:, 0.0, Seconds,
[13:53:47] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2002 is OK: (OK - Rep Delay is:, 0.0, Seconds,
[13:53:58] <wikibugs>	 (03CR) 10Gehel: "Data from Grafana (https://grafana-admin.wikimedia.org/dashboard/db/maps-performances?orgId=1&from=now-3h&to=now) indicates that a 1Mb war" [puppet] - 10https://gerrit.wikimedia.org/r/346963 (https://phabricator.wikimedia.org/T162345) (owner: 10Gehel)
[13:54:42] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "A few nitpicks, and a couple of details to fix." (034 comments) [switchdc] - 10https://gerrit.wikimedia.org/r/347370 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans)
[13:55:06] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2004 is OK: (OK - Rep Delay is:, 0.0, Seconds,
[13:55:17] <wikibugs>	 06Operations, 10Traffic: cache_upload: uncompressed images with Content-Encoding: gzip cause content decoding issues - https://phabricator.wikimedia.org/T148830#3168355 (10ema) 05Resolved>03Open Reopening, another instance of this bug has been reported in T162035#3168304.
[13:56:06] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps-test2004 is OK: (OK - Rep Delay is:, 0.0, Seconds,
[13:56:11] <wikibugs__>	 (03CR) 10Giuseppe Lavagetto: [C: 031] "LGTM" [switchdc] - 10https://gerrit.wikimedia.org/r/347371 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans)
[13:56:16] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: (CRITICAL - Rep Delay is:, 2777.50309, Seconds,
[13:56:46] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: (CRITICAL - Rep Delay is:, 2812.911793, Seconds,
[13:57:02] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 031] Tendril: fix an error in the exception raising [switchdc] - 10https://gerrit.wikimedia.org/r/347372 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans)
[13:57:19] <bblack>	 gehel: I've seen a lot of those maps pg repl alerts flapping lately, what's up with it? (bad check?)
[13:57:26] <wikibugs__>	 (03CR) 10Giuseppe Lavagetto: [C: 031] Disable puppet: fix title and docstring [switchdc] - 10https://gerrit.wikimedia.org/r/347373 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans)
[13:57:52] <gehel>	 bblack: yep, I have a change of check coming up, testing it right now...
[13:58:04] <bblack>	 nice :)
[13:58:06] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps-test2003 is OK: (OK - Rep Delay is:, 0.0, Seconds,
[13:58:07] <Zppix>	 gotta love bots gehel 
[13:58:52] <gehel>	 bblack: since the maps servers are updated once a day, they see almost no write traffic. And when a vacuum takes place, suddenly there is a large change in lag for a short time.
[13:59:06] <wikibugs__>	 (03CR) 10Giuseppe Lavagetto: [C: 031] Formatting improvements [switchdc] - 10https://gerrit.wikimedia.org/r/347374 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans)
[13:59:06] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps-test2002 is CRITICAL: (CRITICAL - Rep Delay is:, 2232.550606, Seconds,
[13:59:06] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps-test2004 is CRITICAL: (CRITICAL - Rep Delay is:, 2232.609279, Seconds,
[13:59:06] <icinga-wm>	 PROBLEM - Check HHVM threads for leakage on mw1259 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers
[13:59:06] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: (CRITICAL - Rep Delay is:, 2951.668787, Seconds,
[13:59:11] <gehel>	 solution is to stop looking at lag in time, but start looking at lag in bytes
[13:59:38] <Zppix>	 gehel:  or not for lag during the times of the write?
[13:59:55] <wikibugs>	 (03CR) 10Volans: "Replies inline" (034 comments) [switchdc] - 10https://gerrit.wikimedia.org/r/347370 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans)
[14:00:03] <wikibugs__>	 (03CR) 10Giuseppe Lavagetto: [C: 031] "great!" [switchdc] - 10https://gerrit.wikimedia.org/r/347375 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans)
[14:00:06] <gehel>	 Zppix: not sure I understood...
[14:00:06] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps-test2002 is OK: (OK - Rep Delay is:, 0.0, Seconds,
[14:00:06] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps-test2004 is OK: (OK - Rep Delay is:, 0.0, Seconds,
[14:00:18] <Zppix>	 gehel:  you could also not check for lag while the write occours
[14:00:51] <gehel>	 Zppix: you mean disable the check during write operations?
[14:00:57] <Zppix>	 yes gehel 
[14:00:59] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 031] Menu: avoid double failing message [switchdc] - 10https://gerrit.wikimedia.org/r/347376 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans)
[14:01:24] <gehel>	 that seems more complex, especially since some writes (like the vacuum) are started more or less automagically by postgres itself
[14:02:08] <Zppix>	 gehel: they arent on a timer?
[14:02:37] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] Logging: add multiple handlers to the logger [switchdc] - 10https://gerrit.wikimedia.org/r/347180 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans)
[14:02:39] <wikibugs__>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1028" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347382
[14:02:43] <wikibugs>	 (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1028" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347382
[14:03:16] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2003 is OK: (OK - Rep Delay is:, 0.0, Seconds,
[14:03:23] <wikibugs__>	 (03CR) 10Giuseppe Lavagetto: [C: 032] Logging: use the new handler, remove log_dry_run [switchdc] - 10https://gerrit.wikimedia.org/r/347369 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans)
[14:03:28] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: Logging: use the new handler, remove log_dry_run [switchdc] - 10https://gerrit.wikimedia.org/r/347369 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans)
[14:03:34] <gehel>	 Zppix: nope, we use auto vacuum (https://www.postgresql.org/docs/9.4/static/runtime-config-autovacuum.html)
[14:03:35] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Logging: use the new handler, remove log_dry_run [switchdc] - 10https://gerrit.wikimedia.org/r/347369 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans)
[14:04:20] <Zppix>	 gehel: oh...
[14:04:55] <gehel>	 in any case, I prefer to check a stable metric than disable checks in some conditions :)
[14:05:09] <elukey>	 !log reimage anaytics1001 to Debian Jessie
[14:05:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:06:37] <gehel>	 akosiaris: if you have a minute to check https://gerrit.wikimedia.org/r/#/c/346963/ (trying to fix those postgresql alerts...)
[14:06:56] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2002 is OK: (OK - Rep Delay is:, 0.0, Seconds,
[14:08:56] <wikibugs__>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] Logging: uniformed log levels and messages (033 comments) [switchdc] - 10https://gerrit.wikimedia.org/r/347370 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans)
[14:09:06] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2004 is OK: (OK - Rep Delay is:, 0.0, Seconds,
[14:09:13] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] Logging: uniformed log levels and messages [switchdc] - 10https://gerrit.wikimedia.org/r/347370 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans)
[14:09:18] <wikibugs__>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1028" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347382 (owner: 10Marostegui)
[14:09:20] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Logging: uniformed log levels and messages [switchdc] - 10https://gerrit.wikimedia.org/r/347370 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans)
[14:10:16] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: (CRITICAL - Rep Delay is:, 3617.546789, Seconds,
[14:10:57] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: (CRITICAL - Rep Delay is:, 3662.675495, Seconds,
[14:11:24] <wikibugs__>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1028" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347382 (owner: 10Marostegui)
[14:11:33] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1028" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347382 (owner: 10Marostegui)
[14:12:15] <bblack>	 gehel: is maps pg using autovacuum or some manual thing? it's been a long time since I used pg in prod myself, but I was thinking if the data updates are once a day, you could ensure vacuums are manually executed daily (e.g. from cron) so that they're on opposite timing from the data updates, too.
[14:12:23] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1028 - T160390 (duration: 00m 38s)
[14:12:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:12:30] <stashbot>	 T160390: Unify revision table on s7 - https://phabricator.wikimedia.org/T160390
[14:12:45] <wikibugs__>	 (03PS2) 10Giuseppe Lavagetto: MySQL: better dry-run handling [switchdc] - 10https://gerrit.wikimedia.org/r/347371 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans)
[14:12:49] <Zppix>	 bblack:  autov
[14:13:34] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] MySQL: better dry-run handling [switchdc] - 10https://gerrit.wikimedia.org/r/347371 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans)
[14:13:46] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Tendril: fix an error in the exception raising [switchdc] - 10https://gerrit.wikimedia.org/r/347372 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans)
[14:14:56] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] Tendril: fix an error in the exception raising [switchdc] - 10https://gerrit.wikimedia.org/r/347372 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans)
[14:14:58] <gehel>	 bblack: it is using autovacuum.
[14:15:16] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2003 is OK: (OK - Rep Delay is:, 0.0, Seconds,
[14:15:47] <gehel>	 bblack: yep, daily vacuum might make sense! But we are thinking of increasing the OSM replication frequency, so that's probably the way to go...
[14:16:26] <wikibugs__>	 (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Tendril: fix an error in the exception raising [switchdc] - 10https://gerrit.wikimedia.org/r/347372 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans)
[14:16:29] <bblack>	 yeah if the replication frequency goes up and each replication is a smaller chunk of writes, eventually down that path you reach a point where you're better off with autovac I assume
[14:17:06] <gehel>	 bblack: totally unrelated, but wdqs codfw is now up to date, ready for active active...
[14:17:39] <bblack>	 https://gerrit.wikimedia.org/r/#/c/346543/ ?
[14:17:50] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] Disable puppet: fix title and docstring [switchdc] - 10https://gerrit.wikimedia.org/r/347373 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans)
[14:17:54] <wikibugs__>	 (03PS2) 10Giuseppe Lavagetto: Disable puppet: fix title and docstring [switchdc] - 10https://gerrit.wikimedia.org/r/347373 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans)
[14:17:55] <bblack>	 whenever you're really ready basically.  I've already tested the a/a basics using noc and config-master services
[14:17:58] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Disable puppet: fix title and docstring [switchdc] - 10https://gerrit.wikimedia.org/r/347373 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans)
[14:17:58] <gehel>	 bblack: I'm planning to add that to the deployment window this evening (CEST). Anything particular I should check before merging that one?
[14:18:05] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Formatting improvements [switchdc] - 10https://gerrit.wikimedia.org/r/347374 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans)
[14:18:07] <bblack>	 no, nothing in particular
[14:18:34] <gehel>	 ok, then we'll see this evening... I'll ping you if things don't look good :P
[14:18:40] <bblack>	 don't even have to force that change or use any special timing
[14:18:49] <bblack>	 just merge it up and let it roll out on its own
[14:19:16] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: (CRITICAL - Rep Delay is:, 4157.537083, Seconds,
[14:20:43] <wikibugs>	 06Operations, 10Traffic, 10netops: knams equipment move - https://phabricator.wikimedia.org/T162601#3168403 (10ayounsi) After discussion with @BBlack  As knams going down will not impact connectivity between esams and eqiad, and esams has enough transit capacity to take over knams transits, the following pla...
[14:21:06] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: (CRITICAL - Rep Delay is:, 4271.782465, Seconds,
[14:21:34] <wikibugs__>	 (03CR) 10Giuseppe Lavagetto: [C: 032] Formatting improvements [switchdc] - 10https://gerrit.wikimedia.org/r/347374 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans)
[14:21:45] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Formatting improvements [switchdc] - 10https://gerrit.wikimedia.org/r/347374 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans)
[14:22:07] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: MediaWiki: announce explicitly the read-only period [switchdc] - 10https://gerrit.wikimedia.org/r/347375 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans)
[14:23:03] <wikibugs__>	 (03PS2) 10Urbanecm: Give sysops ability to promote users to eliminator at fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347166 (https://phabricator.wikimedia.org/T162396)
[14:23:18] <wikibugs>	 (03PS1) 1020after4: Phab: add 90.227.4.251 to ban list [puppet] - 10https://gerrit.wikimedia.org/r/347384
[14:23:23] <wikibugs__>	 (03PS4) 10Gehel: wdqs: active/active public interface [puppet] - 10https://gerrit.wikimedia.org/r/346543 (https://phabricator.wikimedia.org/T162111)
[14:23:27] <wikibugs>	 (03PS3) 10Urbanecm: Give sysops ability to promote users to eliminator at fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347166 (https://phabricator.wikimedia.org/T162396)
[14:23:40] <wikibugs__>	 (03CR) 10Giuseppe Lavagetto: [C: 032] MediaWiki: announce explicitly the read-only period [switchdc] - 10https://gerrit.wikimedia.org/r/347375 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans)
[14:24:10] <wikibugs>	 (03PS1) 10Jcrespo: Repool db1055 with full weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347385 (https://phabricator.wikimedia.org/T159319)
[14:24:27] <wikibugs__>	 (03CR) 10Alexandros Kosiaris: [C: 031] "LGTM, but keep in mind that there is a window during which there will be alerts as the old script is being absent but the icinga configura" [puppet] - 10https://gerrit.wikimedia.org/r/346963 (https://phabricator.wikimedia.org/T162345) (owner: 10Gehel)
[14:24:42] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Menu: avoid double failing message [switchdc] - 10https://gerrit.wikimedia.org/r/347376 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans)
[14:25:14] <wikibugs__>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Phab: add 90.227.4.251 to ban list [puppet] - 10https://gerrit.wikimedia.org/r/347384 (owner: 1020after4)
[14:25:21] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-1] "Waiting for buffer pool to stabilize: https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347385 (https://phabricator.wikimedia.org/T159319) (owner: 10Jcrespo)
[14:25:46] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: (CRITICAL - Rep Delay is:, 2135.104163, Seconds,
[14:25:48] <wikibugs__>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347386 (https://phabricator.wikimedia.org/T142725)
[14:25:56] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: (CRITICAL - Rep Delay is:, 2144.805083, Seconds,
[14:25:59] <marostegui>	 jynus: can I deploy ^ (after rebase?)
[14:26:15] <wikibugs__>	 (03CR) 10Giuseppe Lavagetto: [C: 032] Menu: avoid double failing message [switchdc] - 10https://gerrit.wikimedia.org/r/347376 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans)
[14:26:37] <wikibugs>	 (03PS4) 10Gehel: postgresql - use the check-postgres package for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/346963 (https://phabricator.wikimedia.org/T162345)
[14:26:41] <marostegui>	 Oh, I thought you committed yours, anyways, is it fine for you to deploy? I am asking because you are also playing with s1, so just in case
[14:26:46] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: (CRITICAL - Rep Delay is:, 2195.24401, Seconds,
[14:27:07] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2004 is OK: (OK - Rep Delay is:, 0.0, Seconds,
[14:27:33] <wikibugs>	 (03PS2) 10Marostegui: db-eqiad.php: Depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347386 (https://phabricator.wikimedia.org/T132416)
[14:27:46] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1003 is OK: (OK - Rep Delay is:, 0.0, Seconds,
[14:29:06] <icinga-wm>	 RECOVERY - Check HHVM threads for leakage on mw1259 is OK: OK
[14:29:26] <wikibugs>	 (03CR) 10Gehel: [C: 032] postgresql - use the check-postgres package for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/346963 (https://phabricator.wikimedia.org/T162345) (owner: 10Gehel)
[14:29:46] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1004 is OK: (OK - Rep Delay is:, 0.0, Seconds,
[14:31:12] <gehel>	 !log deploying new psotgresql replication check, might generate a few icinga alerts -T162345
[14:31:18] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 3016
[14:31:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:31:19] <stashbot>	 T162345: Reduce number of false positive alerts on postgresql lag for maps - https://phabricator.wikimedia.org/T162345
[14:31:56] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1528
[14:32:06] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 207216
[14:39:00] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Make backup::set effectively a virtual resource [puppet] - 10https://gerrit.wikimedia.org/r/347388
[14:39:07] <logmsgbot>	 !log switchdc (oblivian@sarin) START TASK - switchdc.stages.t00_disable_puppet(eqiad, codfw) Disabling puppet on MediaWiki jobrunners and videoscalers
[14:39:12] <logmsgbot>	 !log switchdc (oblivian@sarin) END TASK - switchdc.stages.t00_disable_puppet(eqiad, codfw) Successfully completed
[14:39:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:39:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:39:27] <wikibugs__>	 (03CR) 10Marostegui: [C: 032] "for heads up" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347386 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui)
[14:41:06] <icinga-wm>	 PROBLEM - puppet last run on db1048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:42:10] <jynus>	 spike of high 500, phabricator?
[14:42:52] <wikibugs>	 06Operations, 10media-storage: Swift version and distro upgrade - https://phabricator.wikimedia.org/T162609#3168478 (10fgiunchedi)
[14:42:56] <paladox>	 jynus known
[14:43:03] <paladox>	 twentyafterfour is aware
[14:43:22] <wikibugs__>	 06Operations, 10media-storage: swift backend machines load spike: cause and remediation - https://phabricator.wikimedia.org/T84385#3168497 (10fgiunchedi)
[14:43:25] <marostegui>	 it just had another spike in queries so looks related
[14:43:27] <wikibugs>	 06Operations, 10media-storage, 13Patch-For-Review: swift upgrade plans: jessie and swift 2.x - https://phabricator.wikimedia.org/T117972#3168493 (10fgiunchedi) 05Open>03Invalid Superseded by T162609
[14:43:46] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0]
[14:44:46] <icinga-wm>	 PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0]
[14:45:02] <logmsgbot>	 !log switchdc (oblivian@sarin) START TASK - switchdc.stages.t09_start_maintenance(codfw, eqiad) Start MediaWiki maintenance in the new master DC
[14:45:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:45:09] <logmsgbot>	 !log switchdc (oblivian@sarin) END TASK - switchdc.stages.t09_start_maintenance(codfw, eqiad) Failed to execute
[14:45:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:45:33] <logmsgbot>	 !log switchdc (oblivian@sarin) START TASK - switchdc.stages.t09_restore_ttl(codfw, eqiad) Restore the TTL of all the MediaWiki discovery records
[14:45:35] <ema>	 !log upgrade cache_maps to linux 4.9 T162029
[14:45:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:45:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:45:46] <stashbot>	 T162029: Migrate all jessie hosts to Linux 4.9 - https://phabricator.wikimedia.org/T162029
[14:45:46] <logmsgbot>	 !log switchdc (oblivian@sarin) END TASK - switchdc.stages.t09_restore_ttl(codfw, eqiad) Successfully completed
[14:45:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:46:54] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347386 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui)
[14:46:57] <logmsgbot>	 !log switchdc (oblivian@sarin) START TASK - switchdc.stages.t09_start_maintenance(codfw, eqiad) Start MediaWiki maintenance in the new master DC
[14:47:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:47:03] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347386 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui)
[14:47:05] <logmsgbot>	 !log switchdc (oblivian@sarin) END TASK - switchdc.stages.t09_start_maintenance(codfw, eqiad) Failed to execute
[14:47:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:48:04] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1073 - T132416 (duration: 00m 39s)
[14:48:08] <marostegui>	 !log Deploy alter table enwiki.revision db1073 - T132416
[14:48:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:48:11] <stashbot>	 T132416: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416
[14:48:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:50:46] <icinga-wm>	 RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[14:51:01] <wikibugs>	 06Operations, 06Office-IT, 07LDAP: Make disabled accounts visible in the corp mirror LDAP replica - https://phabricator.wikimedia.org/T160158#3168517 (10MoritzMuehlenhoff) It seems my test set was a case of really bad luck. The two disabled users accounts I was testing against, in fact still show up under cn...
[14:52:22] <logmsgbot>	 !log switchdc (oblivian@sarin) START TASK - switchdc.stages.t00_disable_puppet(eqiad, codfw) Disabling puppet on MediaWiki jobrunners and videoscalers
[14:52:27] <logmsgbot>	 !log switchdc (oblivian@sarin) END TASK - switchdc.stages.t00_disable_puppet(eqiad, codfw) Successfully completed
[14:52:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:52:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:52:46] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[14:54:47] <wikibugs>	 (03PS1) 10Volans: Logging: fix and simplify the stderr logging [switchdc] - 10https://gerrit.wikimedia.org/r/347390 (https://phabricator.wikimedia.org/T160178)
[14:54:49] <wikibugs__>	 (03CR) 10Alexandros Kosiaris: Add 3d2png deploy repo to image scalers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/345377 (https://phabricator.wikimedia.org/T160185) (owner: 10MarkTraceur)
[14:55:58] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] Logging: fix and simplify the stderr logging [switchdc] - 10https://gerrit.wikimedia.org/r/347390 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans)
[14:57:45] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:58:16] <moritzm>	 !log upgrading wtp1006-wtp1009 to Linux 4.9
[14:58:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:00:05] <icinga-wm>	 PROBLEM - puppet last run on nihal is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:01:25] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1037 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:01:26] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1040 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:01:27] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1056 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:01:28] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1048 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:01:31] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1046 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:01:31] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:01:32] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1045 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:01:33] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1042 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:01:34] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1035 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:01:36] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1041 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:01:37] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1043 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:01:38] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:01:40] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:01:40] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1044 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:01:42] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1054 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:01:42] <andrewbogott>	 !log disabling puppet on labcontrol1001 to raise log levels
[15:01:43] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:01:44] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1051 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:01:45] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1052 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:01:47] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1055 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:01:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:01:48] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1036 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:01:49] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1057 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:01:50] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1049 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:01:51] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:02:02] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time
[15:02:12] <icinga-wm>	 PROBLEM - Disk space on restbase-dev1001 is CRITICAL: DISK CRITICAL - free space: / 450 MB (1% inode=95%)
[15:02:12] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:03:12] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.0.36:9042 on restbase-dev1001 is CRITICAL: connect to address 10.64.0.36 and port 9042: Connection refused
[15:03:12] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.64.0.37:9042 on restbase-dev1001 is CRITICAL: connect to address 10.64.0.37 and port 9042: Connection refused
[15:03:13] <icinga-wm>	 PROBLEM - Check systemd state on restbase-dev1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[15:03:13] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.64.0.37:7001 on restbase-dev1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[15:03:14] <wikibugs__>	 06Operations, 10media-storage: Swift version and distro upgrade - https://phabricator.wikimedia.org/T162609#3168565 (10faidon)
[15:03:22] <icinga-wm>	 PROBLEM - puppet last run on db1090 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:03:32] <icinga-wm>	 PROBLEM - cassandra-b service on restbase-dev1001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed
[15:03:32] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.0.36:7001 on restbase-dev1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[15:03:33] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1045 is OK: OK: YARN NodeManager analytics1045.eqiad.wmnet:8041 Node-State: RUNNING
[15:03:34] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1048 is OK: OK: YARN NodeManager analytics1048.eqiad.wmnet:8041 Node-State: RUNNING
[15:03:35] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1046 is OK: OK: YARN NodeManager analytics1046.eqiad.wmnet:8041 Node-State: RUNNING
[15:03:36] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1056 is OK: OK: YARN NodeManager analytics1056.eqiad.wmnet:8041 Node-State: RUNNING
[15:03:37] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1037 is OK: OK: YARN NodeManager analytics1037.eqiad.wmnet:8041 Node-State: RUNNING
[15:03:39] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1040 is OK: OK: YARN NodeManager analytics1040.eqiad.wmnet:8041 Node-State: RUNNING
[15:03:40] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1032 is OK: OK: YARN NodeManager analytics1032.eqiad.wmnet:8041 Node-State: RUNNING
[15:03:41] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1042 is OK: OK: YARN NodeManager analytics1042.eqiad.wmnet:8041 Node-State: RUNNING
[15:03:42] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1043 is OK: OK: YARN NodeManager analytics1043.eqiad.wmnet:8041 Node-State: RUNNING
[15:03:43] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1044 is OK: OK: YARN NodeManager analytics1044.eqiad.wmnet:8041 Node-State: RUNNING
[15:03:45] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1047 is OK: OK: YARN NodeManager analytics1047.eqiad.wmnet:8041 Node-State: RUNNING
[15:03:46] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1054 is OK: OK: YARN NodeManager analytics1054.eqiad.wmnet:8041 Node-State: RUNNING
[15:03:47] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING
[15:03:48] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1038 is OK: OK: YARN NodeManager analytics1038.eqiad.wmnet:8041 Node-State: RUNNING
[15:03:49] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1041 is OK: OK: YARN NodeManager analytics1041.eqiad.wmnet:8041 Node-State: RUNNING
[15:03:51] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1035 is OK: OK: YARN NodeManager analytics1035.eqiad.wmnet:8041 Node-State: RUNNING
[15:03:53] <elukey>	 we are checking
[15:03:57] <elukey>	 not sure what happened
[15:04:01] <icinga-wm>	 PROBLEM - cassandra-a service on restbase-dev1001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[15:04:01] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1049 is OK: OK: YARN NodeManager analytics1049.eqiad.wmnet:8041 Node-State: RUNNING
[15:04:02] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1055 is OK: OK: YARN NodeManager analytics1055.eqiad.wmnet:8041 Node-State: RUNNING
[15:04:04] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1051 is OK: OK: YARN NodeManager analytics1051.eqiad.wmnet:8041 Node-State: RUNNING
[15:04:05] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1057 is OK: OK: YARN NodeManager analytics1057.eqiad.wmnet:8041 Node-State: RUNNING
[15:04:06] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1053 is OK: OK: YARN NodeManager analytics1053.eqiad.wmnet:8041 Node-State: RUNNING
[15:04:07] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1052 is OK: OK: YARN NodeManager analytics1052.eqiad.wmnet:8041 Node-State: RUNNING
[15:04:08] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1036 is OK: OK: YARN NodeManager analytics1036.eqiad.wmnet:8041 Node-State: RUNNING
[15:04:10] <elukey>	 analytics1002 was the master
[15:04:21] <icinga-wm>	 RECOVERY - Check systemd state on restbase-dev1001 is OK: OK - running: The system is fully operational
[15:04:22] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1050 is OK: OK: YARN NodeManager analytics1050.eqiad.wmnet:8041 Node-State: RUNNING
[15:04:28] <wikibugs__>	 (03CR) 10Alexandros Kosiaris: [C: 032] RESTBase: Migrate to Scap3 deployment [puppet] - 10https://gerrit.wikimedia.org/r/346248 (https://phabricator.wikimedia.org/T116335) (owner: 10Mobrovac)
[15:04:31] <icinga-wm>	 RECOVERY - cassandra-b service on restbase-dev1001 is OK: OK - cassandra-b is active
[15:04:32] <wikibugs>	 06Operations, 10media-storage: Swift version and distro upgrade - https://phabricator.wikimedia.org/T162609#3168478 (10faidon) The plan above totally makes sense to me and sounds like the path of the least amount of work with the maximum amount of consistency.  I'd add a final step of upgrading the jessie syst...
[15:04:34] <wikibugs__>	 (03PS5) 10Alexandros Kosiaris: RESTBase: Migrate to Scap3 deployment [puppet] - 10https://gerrit.wikimedia.org/r/346248 (https://phabricator.wikimedia.org/T116335) (owner: 10Mobrovac)
[15:04:42] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] RESTBase: Migrate to Scap3 deployment [puppet] - 10https://gerrit.wikimedia.org/r/346248 (https://phabricator.wikimedia.org/T116335) (owner: 10Mobrovac)
[15:04:56] <ottomata>	 looks like just the icinga checks failed temporarily?
[15:05:01] <icinga-wm>	 RECOVERY - cassandra-a service on restbase-dev1001 is OK: OK - cassandra-a is active
[15:05:06] <mobrovac>	 !log restbase disabling puppet for upgrade to scap3 deploys
[15:05:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:05:21] <icinga-wm>	 RECOVERY - Disk space on restbase-dev1001 is OK: DISK OK
[15:06:21] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.64.0.37:9042 on restbase-dev1001 is OK: TCP OK - 0.001 second response time on 10.64.0.37 port 9042
[15:06:22] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.64.0.37:7001 on restbase-dev1001 is OK: SSL OK - Certificate restbase-dev1001-b valid until 2018-01-05 22:53:03 +0000 (expires in 270 days)
[15:06:36] <elukey>	 any networking issue causing this?
[15:06:41] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.64.0.36:7001 on restbase-dev1001 is OK: SSL OK - Certificate restbase-dev1001-a valid until 2018-01-05 22:53:02 +0000 (expires in 270 days)
[15:06:47] <wikibugs>	 (03Abandoned) 10Faidon Liambotis: Disable RESTBase config.yaml deploys in puppet [puppet] - 10https://gerrit.wikimedia.org/r/229306 (https://phabricator.wikimedia.org/T107532) (owner: 10GWicke)
[15:07:08] <paravoid>	 elukey: causing what exactly?
[15:07:11] <wikibugs>	 (03PS1) 10Volans: Fix logging/dry-run setup orders [switchdc] - 10https://gerrit.wikimedia.org/r/347393 (https://phabricator.wikimedia.org/T160178)
[15:07:21] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.0.36:9042 on restbase-dev1001 is OK: TCP OK - 0.001 second response time on 10.64.0.36 port 9042
[15:07:29] <elukey>	 paravoid: the above alarms of yarn (hadoop workers)
[15:07:41] <wikibugs>	 (03PS2) 10Jcrespo: Repool db1055 with full weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347385 (https://phabricator.wikimedia.org/T159319)
[15:07:51] <icinga-wm>	 RECOVERY - puppet last run on db1090 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures
[15:07:56] <paravoid>	 yeah, but what was the underlying cause for those alerts?
[15:08:04] <elukey>	 well will double check, spot checking on hosts seems like the daemon stayed up
[15:08:10] <paravoid>	 communication lost and between which hosts for example
[15:08:22] <icinga-wm>	 PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 seconds ago with 1 failures. Failed resources (up to 3 shown): Scap_source[restbase/deploy]
[15:08:41] <icinga-wm>	 RECOVERY - puppet last run on db1048 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures
[15:09:00] <elukey>	 paravoid: einstenium and all the hadoop workers (so analytics1049.eqiad.wmnet for example)
[15:09:25] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 031] Fix logging/dry-run setup orders [switchdc] - 10https://gerrit.wikimedia.org/r/347393 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans)
[15:10:00] <wikibugs__>	 (03CR) 10Jcrespo: [C: 032] Repool db1055 with full weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347385 (https://phabricator.wikimedia.org/T159319) (owner: 10Jcrespo)
[15:10:10] <wikibugs__>	 (03PS1) 10Ladsgroup: Use redis lock manager for dispatching in all production repo instances [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347395 (https://phabricator.wikimedia.org/T159826)
[15:10:16] <elukey>	 or analytics1002 <-> analytics1049
[15:10:17] <wikibugs>	 (03CR) 10Volans: [C: 032] Fix logging/dry-run setup orders [switchdc] - 10https://gerrit.wikimedia.org/r/347393 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans)
[15:10:24] <elukey>	 (current master - worker)
[15:10:29] <elukey>	 checking logs in the meantime
[15:11:22] <icinga-wm>	 RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures
[15:11:45] <paravoid>	 analytics1067 & analytics1068 are flapping
[15:12:00] <paravoid>	 but that has been the case for hours
[15:12:04] <wikibugs__>	 (03Merged) 10jenkins-bot: Repool db1055 with full weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347385 (https://phabricator.wikimedia.org/T159319) (owner: 10Jcrespo)
[15:12:14] <wikibugs>	 (03CR) 10jenkins-bot: Repool db1055 with full weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347385 (https://phabricator.wikimedia.org/T159319) (owner: 10Jcrespo)
[15:12:22] <icinga-wm>	 PROBLEM - Disk space on restbase-dev1001 is CRITICAL: DISK CRITICAL - free space: / 127 MB (0% inode=95%)
[15:12:24] <elukey>	 yeah those are new ones... I think that somehow the current master (analytics1002) had a blip for a moment
[15:12:26] <wikibugs__>	 06Operations, 10Ops-Access-Requests: Icinga contact/permissions for cwdent (cdentinger) - https://phabricator.wikimedia.org/T159564#3168602 (10cwdent) 05Open>03Resolved @dzahn thank you for the help, everything seems to be working now.
[15:12:39] <paravoid>	 other than that, I don't see anything relevant
[15:12:41] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.0.36:7001 on restbase-dev1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[15:12:59] <elukey>	 paravoid: thanks, will keep checking
[15:13:01] <icinga-wm>	 PROBLEM - cassandra-a service on restbase-dev1001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[15:13:21] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.0.36:9042 on restbase-dev1001 is CRITICAL: connect to address 10.64.0.36 and port 9042: Connection refused
[15:13:21] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.64.0.37:9042 on restbase-dev1001 is CRITICAL: connect to address 10.64.0.37 and port 9042: Connection refused
[15:13:22] <icinga-wm>	 PROBLEM - Check systemd state on restbase-dev1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[15:13:22] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.64.0.37:7001 on restbase-dev1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[15:13:31] <icinga-wm>	 PROBLEM - cassandra-b service on restbase-dev1001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed
[15:13:35] <jynus>	 I am going to deploy https://gerrit.wikimedia.org/r/347385
[15:14:16] <marostegui>	 +1
[15:15:28] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1055 after maintenance with full weight (duration: 00m 39s)
[15:15:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:16:20] <wikibugs>	 06Operations: codfw hosts occasionally spend > 3 minutes starting networking.service with linux 4.9 - https://phabricator.wikimedia.org/T162612#3168608 (10ema)
[15:17:01] <icinga-wm>	 RECOVERY - cassandra-a service on restbase-dev1001 is OK: OK - cassandra-a is active
[15:17:31] <icinga-wm>	 RECOVERY - cassandra-b service on restbase-dev1001 is OK: OK - cassandra-b is active
[15:18:06] <urandom>	 ^^ on that (also not production)
[15:18:08] <logmsgbot>	 !log mobrovac@tin Started deploy [restbase/deploy@a8d4d02]: (no justification provided)
[15:18:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:18:21] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.64.0.37:9042 on restbase-dev1001 is OK: TCP OK - 0.000 second response time on 10.64.0.37 port 9042
[15:18:21] <icinga-wm>	 RECOVERY - Disk space on restbase-dev1001 is OK: DISK OK
[15:18:22] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.64.0.37:7001 on restbase-dev1001 is OK: SSL OK - Certificate restbase-dev1001-b valid until 2018-01-05 22:53:03 +0000 (expires in 270 days)
[15:18:33] <_joe_>	 urandom: might be related to the deploy of restbase via scap to the test cluster?
[15:18:38] <_joe_>	 oh, no
[15:18:40] <urandom>	 _joe_: nope
[15:19:31] <logmsgbot>	 !log mobrovac@tin Finished deploy [restbase/deploy@a8d4d02]: (no justification provided) (duration: 01m 22s)
[15:19:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:19:39] <urandom>	 elukey: ping?
[15:20:01] <icinga-wm>	 PROBLEM - cassandra-a service on restbase-dev1001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[15:20:11] <wikibugs>	 (03Abandoned) 10Gehel: postgresql - clean up of python / bash code after volans review [puppet] - 10https://gerrit.wikimedia.org/r/346952 (owner: 10Gehel)
[15:20:19] <elukey>	 urandom: o/
[15:20:31] <urandom>	 elukey: hi! :)
[15:20:43] <urandom>	 elukey: remember when you reimaged restbase-devv1001 for us?
[15:20:44] <wikibugs>	 (03CR) 10Volans: [C: 031] "LGTM, but let's wait to understand the MW side of it?" [software/conftool] - 10https://gerrit.wikimedia.org/r/347356 (https://phabricator.wikimedia.org/T156924) (owner: 10Giuseppe Lavagetto)
[15:21:41] <icinga-wm>	 PROBLEM - puppet last run on mira is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Scap_source[restbase/deploy]
[15:22:00] <urandom>	 elukey: i've no idea what that process looks like on your end, but somehow that resulted in /srv/ not being the raid0 volume
[15:22:30] <elukey>	 urandom: :(
[15:22:41] <wikibugs__>	 (03PS2) 10Gehel: postgresql - cleanup dead code after migration to check-postgres package [puppet] - 10https://gerrit.wikimedia.org/r/346964 (https://phabricator.wikimedia.org/T162345)
[15:22:51] <wikibugs>	 (03CR) 10Gehel: "rebased" [puppet] - 10https://gerrit.wikimedia.org/r/346964 (https://phabricator.wikimedia.org/T162345) (owner: 10Gehel)
[15:22:56] <elukey>	 urandom: I assumed that the partman config was taking care of everything, but probably I missed something
[15:23:01] <icinga-wm>	 RECOVERY - cassandra-a service on restbase-dev1001 is OK: OK - cassandra-a is active
[15:23:08] <elukey>	 urandom: can you open a phab task? I'll try to fix it
[15:23:15] <urandom>	 elukey: yeah
[15:23:21] <icinga-wm>	 RECOVERY - Check systemd state on restbase-dev1001 is OK: OK - running: The system is fully operational
[15:23:39] <wikibugs__>	 06Operations: codfw hosts occasionally spend > 3 minutes starting networking.service with linux 4.9 - https://phabricator.wikimedia.org/T162612#3168635 (10MoritzMuehlenhoff) These are tracked under the perf section (Performance monitoring) in Kconfig: "Include support for Intel uncore performance events. These a...
[15:24:21] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.0.36:9042 on restbase-dev1001 is OK: TCP OK - 0.000 second response time on 10.64.0.36 port 9042
[15:24:37] <wikibugs>	 06Operations: codfw hosts occasionally spend > 3 minutes starting networking.service with linux 4.9 - https://phabricator.wikimedia.org/T162612#3168636 (10MoritzMuehlenhoff) JFTR, this code was only made modular in between Linux 4.4 and 4.9; in Linux 4.4 the code is built-in.
[15:24:41] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.64.0.36:7001 on restbase-dev1001 is OK: SSL OK - Certificate restbase-dev1001-a valid until 2018-01-05 22:53:02 +0000 (expires in 270 days)
[15:26:16] <wikibugs>	 (03CR) 10Gehel: [C: 032] postgresql - cleanup dead code after migration to check-postgres package [puppet] - 10https://gerrit.wikimedia.org/r/346964 (https://phabricator.wikimedia.org/T162345) (owner: 10Gehel)
[15:27:36] <wikibugs>	 (03Abandoned) 10Gehel: WIP - postgresql::user should be idempotent [puppet] - 10https://gerrit.wikimedia.org/r/298960 (https://phabricator.wikimedia.org/T138092) (owner: 10Gehel)
[15:28:06] <logmsgbot>	 !log mobrovac@tin Started deploy [restbase/deploy@a8d4d02]: Initial deployment with Scap3 on staging
[15:28:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:29:36] <icinga-wm>	 RECOVERY - puppet last run on nihal is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[15:29:58] <wikibugs__>	 (03CR) 10Ottomata: [C: 032] Add python3 packages to hadoop workers for ORES in hadoop [puppet] - 10https://gerrit.wikimedia.org/r/346812 (owner: 10Ottomata)
[15:30:04] <wikibugs>	 (03PS2) 10Ottomata: Add python3 packages to hadoop workers for ORES in hadoop [puppet] - 10https://gerrit.wikimedia.org/r/346812
[15:30:10] <wikibugs__>	 (03CR) 10Ottomata: [V: 032 C: 032] Add python3 packages to hadoop workers for ORES in hadoop [puppet] - 10https://gerrit.wikimedia.org/r/346812 (owner: 10Ottomata)
[15:31:37] <logmsgbot>	 !log mobrovac@tin Finished deploy [restbase/deploy@a8d4d02]: Initial deployment with Scap3 on staging (duration: 03m 31s)
[15:31:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:31:46] <icinga-wm>	 RECOVERY - puppet last run on mira is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures
[15:32:50] <wikibugs>	 06Operations, 10ops-codfw, 06Performance-Team: Plug in ex-graphite2001 SSDs to recover coal data - https://phabricator.wikimedia.org/T161538#3168657 (10Papaul) a:05Papaul>03fgiunchedi @fgiunchedi the flash drive is not in graphite2001
[15:33:07] <wikibugs__>	 (03PS1) 10Cmjohnson: Adding mac address that was missing for analytics1068 T162216 [puppet] - 10https://gerrit.wikimedia.org/r/347398
[15:33:29] <mobrovac>	 !log restbase enabling back puppet in prod
[15:33:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:33:36] <icinga-wm>	 PROBLEM - puppet last run on analytics1054 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:33:56] <icinga-wm>	 PROBLEM - puppet last run on analytics1042 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:34:00] <wikibugs__>	 06Operations, 10Cassandra, 06Services: RAID-0 volume not mounted on restbase-dev1001.eqiad.wmnet - https://phabricator.wikimedia.org/T162614#3168663 (10Eevans)
[15:34:06] <urandom>	 elukey: ^^
[15:34:07] <wikibugs__>	 (03PS1) 10Volans: Logging: use NodeSet for more compact outputs [switchdc] - 10https://gerrit.wikimedia.org/r/347399 (https://phabricator.wikimedia.org/T160178)
[15:34:21] <elukey>	 thanks urandom, will check tomorrow and try to fix
[15:34:25] <wikibugs__>	 (03Abandoned) 10Cmjohnson: Adding mac address that was missing for analytics1068 T162216 [puppet] - 10https://gerrit.wikimedia.org/r/347398 (owner: 10Cmjohnson)
[15:35:06] <logmsgbot>	 !log mobrovac@tin Started deploy [restbase/deploy@a8d4d02]: Initial deployment with Scap3
[15:35:12] <wikibugs>	 (03PS4) 10Rush: admin: add a group for cloud services roots [puppet] - 10https://gerrit.wikimedia.org/r/346838 (https://phabricator.wikimedia.org/T162404)
[15:35:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:35:16] <logmsgbot>	 !log mobrovac@tin Finished deploy [restbase/deploy@a8d4d02]: Initial deployment with Scap3 (duration: 00m 10s)
[15:35:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:35:36] <icinga-wm>	 PROBLEM - puppet last run on analytics1055 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:35:48] <wikibugs__>	 (03PS5) 10Rush: admin: add a group for cloud services roots [puppet] - 10https://gerrit.wikimedia.org/r/346838 (https://phabricator.wikimedia.org/T162404)
[15:36:09] <wikibugs__>	 06Operations, 06Labs: create a 'root' group with bdavis strictly for labs/cloud services infrastructure - https://phabricator.wikimedia.org/T162404#3168682 (10chasemp)
[15:36:33] <wikibugs>	 06Operations, 10Ops-Access-Requests, 06Labs: create a 'root' group with bdavis strictly for labs/cloud services infrastructure - https://phabricator.wikimedia.org/T162404#3161730 (10chasemp)
[15:38:46] <wikibugs__>	 (03PS1) 10Rush: admin: add bd808 to cs-roots [puppet] - 10https://gerrit.wikimedia.org/r/347400 (https://phabricator.wikimedia.org/T162404)
[15:39:06] <wikibugs>	 (03PS2) 10Rush: admin: add bd808 to cs-roots [puppet] - 10https://gerrit.wikimedia.org/r/347400 (https://phabricator.wikimedia.org/T162404)
[15:39:45] <wikibugs>	 06Operations, 10Ops-Access-Requests, 06Labs: create a 'root' group with bdavis strictly for labs/cloud services infrastructure - https://phabricator.wikimedia.org/T162404#3168687 (10chasemp) Adding the group with initial roles:  https://gerrit.wikimedia.org/r/#/c/346838/  Adding bd808 if that all makes sense...
[15:42:43] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Make reason for disabling puppet fixed [switchdc] - 10https://gerrit.wikimedia.org/r/347401
[15:42:45] <wikibugs__>	 (03PS1) 10Giuseppe Lavagetto: Filter out verbose output from cumin in dry-run mode [switchdc] - 10https://gerrit.wikimedia.org/r/347402
[15:43:26] <icinga-wm>	 PROBLEM - puppet last run on analytics1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:43:36] <icinga-wm>	 PROBLEM - puppet last run on analytics1033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:44:36] <icinga-wm>	 PROBLEM - puppet last run on analytics1057 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:44:42] <wikibugs__>	 (03PS1) 10Cmjohnson: Adding mac address for analytis1068 [puppet] - 10https://gerrit.wikimedia.org/r/347403
[15:46:09] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/346838 (https://phabricator.wikimedia.org/T162404) (owner: 10Rush)
[15:46:36] <icinga-wm>	 PROBLEM - puppet last run on analytics1035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:47:25] <cmjohnson1>	 !log troubleshooting link cr2-eqiad:xe-3/0/1 {#2014 to asw-b-eqiad:xe-1/1/2 per T162199
[15:47:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:47:32] <stashbot>	 T162199: Faulty optics on asw-b-eqiad:xe-1/1/2 - https://phabricator.wikimedia.org/T162199
[15:47:39] <urandom>	 i gather /etc/fstab isn't normally managed by puppet in our fleet, is that correct?
[15:48:16] <icinga-wm>	 PROBLEM - puppet last run on analytics1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:48:23] <urandom>	 is it generated by the installer, and then managed by-hand afterward?
[15:49:46] <icinga-wm>	 PROBLEM - puppet last run on analytics1051 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:50:26] <icinga-wm>	 PROBLEM - puppet last run on analytics1056 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:50:35] <wikibugs>	 (03CR) 10Cmjohnson: [C: 032] Adding mac address for analytis1068 [puppet] - 10https://gerrit.wikimedia.org/r/347403 (owner: 10Cmjohnson)
[15:50:46] <icinga-wm>	 PROBLEM - puppet last run on analytics1029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:51:26] <icinga-wm>	 PROBLEM - puppet last run on analytics1052 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:51:46] <godog>	 urandom: yeah fstab isn't generally touched by puppet, with some exceptions
[15:52:15] <urandom>	 kk
[15:52:16] <icinga-wm>	 PROBLEM - puppet last run on analytics1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:52:36] <icinga-wm>	 PROBLEM - puppet last run on analytics1032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:53:24] <godog>	 elukey ottomata I take it the puppet failures there are legit?
[15:54:04] <elukey>	 godog: I am still not sure why all the yarn daemons complained at the same time when the new master came up (still without any daemons after the reimage, really weird)
[15:54:11] <elukey>	 I am going to check puppet now
[15:54:32] <godog>	 papaul: re: T161538 yeah the drive isn't currently in graphite2001 but in wmf6406, it needs to be plugged in into graphite2001 now though
[15:54:33] <stashbot>	 T161538: Plug in ex-graphite2001 SSDs to recover coal data - https://phabricator.wikimedia.org/T161538
[15:54:46] <icinga-wm>	 PROBLEM - puppet last run on analytics1041 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:55:04] <logmsgbot>	 !log mobrovac@tin Started deploy [restbase/deploy@2c70843]: Initial deployment with Scap3
[15:55:08] <wikibugs>	 06Operations, 10Cassandra, 06Services: RAID-0 volume not mounted on restbase-dev1001.eqiad.wmnet - https://phabricator.wikimedia.org/T162614#3168712 (10Eevans) p:05Triage>03Normal a:03Eevans
[15:55:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:55:16] <icinga-wm>	 PROBLEM - puppet last run on analytics1053 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:55:26] <icinga-wm>	 PROBLEM - puppet last run on analytics1040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:55:28] <elukey>	 godog: Error 400 on SERVER: undefined method `function_create_resources' for nil:NilClass at /etc/puppet/modules/role/manifests/analytics_cluster/hadoop/worker.pp:120 on node analytics1032.eqiad.wmnet 
[15:55:36] <icinga-wm>	 PROBLEM - puppet last run on analytics1028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:55:52] <godog>	 lolz
[15:57:40] <wikibugs>	 06Operations, 10Cassandra, 06Services (blocked): RAID-0 volume not mounted on restbase-dev1001.eqiad.wmnet - https://phabricator.wikimedia.org/T162614#3168715 (10mobrovac)
[15:58:14] <wikibugs__>	 06Operations, 10Cassandra, 06Services (doing): RAID-0 volume not mounted on restbase-dev1001.eqiad.wmnet - https://phabricator.wikimedia.org/T162614#3168717 (10Eevans)
[15:58:27] <logmsgbot>	 !log switchdc (oblivian@sarin) START TASK - switchdc.stages.t00_disable_puppet(eqiad, codfw) Disabling puppet on MediaWiki jobrunners and videoscalers
[15:58:31] <logmsgbot>	 !log switchdc (oblivian@sarin) END TASK - switchdc.stages.t00_disable_puppet(eqiad, codfw) Successfully completed
[15:58:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:58:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:59:10] <wikibugs__>	 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint, and 2 others: Reduce number of false positive alerts on postgresql lag for maps - https://phabricator.wikimedia.org/T162345#3168718 (10Gehel) No new false positive since changes have been deployed. Things are looking good!
[15:59:26] <icinga-wm>	 PROBLEM - puppet last run on analytics1038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:01:16] <icinga-wm>	 PROBLEM - puppet last run on analytics1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:01:36] <icinga-wm>	 PROBLEM - puppet last run on analytics1037 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:02:36] <icinga-wm>	 PROBLEM - puppet last run on analytics1039 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:02:56] <logmsgbot>	 !log mobrovac@tin Finished deploy [restbase/deploy@2c70843]: Initial deployment with Scap3 (duration: 07m 52s)
[16:03:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:05:49] <papaul>	 godog: it is in graphite2001
[16:06:57] <godog>	 papaul: thanks! I'll let you know when it is safe to remove from there
[16:07:46] <wikibugs__>	 06Operations, 10ops-codfw, 06Performance-Team: Plug in ex-graphite2001 SSDs to recover coal data - https://phabricator.wikimedia.org/T161538#3168732 (10Papaul) @fgiunchedi  the drive is now in graphite2001
[16:07:57] <elukey>	 godog: I think it is related to https://gerrit.wikimedia.org/r/346812
[16:08:25] <wikibugs__>	 (03CR) 10Daniel Kinzler: [C: 031] "Yes, we want this. This has been extensively tested now by Ladsgroop, Hoo, and myself." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347395 (https://phabricator.wikimedia.org/T159826) (owner: 10Ladsgroup)
[16:08:44] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Filter out verbose output from cumin from stdout [switchdc] - 10https://gerrit.wikimedia.org/r/347402
[16:09:30] <wikibugs>	 (03PS2) 10Daniel Kinzler: Use redis lock manager for dispatching in all production repo instances [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347395 (https://phabricator.wikimedia.org/T159826) (owner: 10Ladsgroup)
[16:12:51] <wikibugs__>	 06Operations, 10Wikimedia-Mailing-lists: Reset admin password for wikimania-program mailing list - https://phabricator.wikimedia.org/T162080#3168740 (10Dzahn) @eyoung @aklapper This can be closed i assume, right?
[16:13:16] <wikibugs>	 (03CR) 10Volans: [C: 032] Make reason for disabling puppet fixed [switchdc] - 10https://gerrit.wikimedia.org/r/347401 (owner: 10Giuseppe Lavagetto)
[16:19:42] <wikibugs__>	 06Operations, 10Ops-Access-Requests: Icinga contact/permissions for cwdent (cdentinger) - https://phabricator.wikimedia.org/T159564#3168765 (10Dzahn) @cwdent great! thanks for confirming
[16:20:33] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "Yes, you should coordinate with ops before trying this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347395 (https://phabricator.wikimedia.org/T159826) (owner: 10Ladsgroup)
[16:20:43] <wikibugs__>	 (03PS1) 10Bartosz Dziewoński: Remove defunct $wgForeignUploadTestEnabled for cross-wiki upload A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347412
[16:20:48] <MatmaRex>	 anyone wants to deploy a no-op? https://gerrit.wikimedia.org/r/347412
[16:21:33] <Dereckson>	 MatmaRex: it's not a labs only, so it's better to include it to SWAT
[16:21:49] <MatmaRex>	 ok
[16:22:08] <Dereckson>	 jouncebot: next
[16:22:08] <jouncebot>	 In 0 hour(s) and 37 minute(s): Weekly Wikidata query service deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170410T1700)
[16:22:26] <Dereckson>	 jouncebot: in 1 hours 37 minutes the next one I think
[16:27:40] <wikibugs__>	 06Operations, 06Labs: Initial OpenStack Neutron PoC deployment in Labtest - https://phabricator.wikimedia.org/T153099#3168775 (10chasemp) We have neutron deployed initially but not functioning as a control plane for anything yet.  The first part of this has been working through the keystone/ldap/neutron integr...
[16:31:08] <wikibugs>	 (03PS1) 10Elukey: Revert "Add python3 packages to hadoop workers for ORES in hadoop" [puppet] - 10https://gerrit.wikimedia.org/r/347414
[16:31:22] <wikibugs__>	 (03PS2) 10Elukey: Revert "Add python3 packages to hadoop workers for ORES in hadoop" [puppet] - 10https://gerrit.wikimedia.org/r/347414
[16:34:24] <wikibugs__>	 (03CR) 10Elukey: [C: 032] Revert "Add python3 packages to hadoop workers for ORES in hadoop" [puppet] - 10https://gerrit.wikimedia.org/r/347414 (owner: 10Elukey)
[16:35:25] <wikibugs>	 (03CR) 10Volans: [C: 032] Filter out verbose output from cumin from stdout [switchdc] - 10https://gerrit.wikimedia.org/r/347402 (owner: 10Giuseppe Lavagetto)
[16:36:17] <elukey>	 godog: puppet runs should be fixed
[16:36:26] <icinga-wm>	 RECOVERY - puppet last run on analytics1040 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures
[16:36:31] <elukey>	 just did --^
[16:36:56] <godog>	 elukey: neat, thanks! I think it might be the version specification in one of the packages?
[16:37:15] <elukey>	 godog: could be yes, not sure, I'll ping Andrew! 
[16:37:34] <elukey>	 there are multiple packages with two versions available on apt-cache
[16:37:39] <elukey>	 so not sure if this could be another issue
[16:40:44] <wikibugs>	 (03PS2) 10Andrew Bogott: fullstack:  Switch to the normal jessie image [puppet] - 10https://gerrit.wikimedia.org/r/346984
[16:40:49] <wikibugs__>	 (03PS1) 10Andrew Bogott: nova-scheduler:  Temporarily depool labvirt1002 [puppet] - 10https://gerrit.wikimedia.org/r/347415 (https://phabricator.wikimedia.org/T162529)
[16:41:16] <wikibugs>	 (03CR) 10Rush: [C: 031] nova-scheduler:  Temporarily depool labvirt1002 [puppet] - 10https://gerrit.wikimedia.org/r/347415 (https://phabricator.wikimedia.org/T162529) (owner: 10Andrew Bogott)
[16:42:16] <wikibugs__>	 06Operations, 10Ops-Access-Requests, 06Labs: create a 'root' group with bdavis strictly for labs/cloud services infrastructure - https://phabricator.wikimedia.org/T162404#3161730 (10RobH) >>! In T162404#3168687, @chasemp wrote: > Adding the group with initial roles: >  > https://gerrit.wikimedia.org/r/#/c/34...
[16:42:35] <icinga-wm>	 RECOVERY - puppet last run on analytics1043 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures
[16:42:35] <icinga-wm>	 RECOVERY - puppet last run on analytics1033 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures
[16:42:39] <wikibugs__>	 (03CR) 10Andrew Bogott: [C: 032] nova-scheduler:  Temporarily depool labvirt1002 [puppet] - 10https://gerrit.wikimedia.org/r/347415 (https://phabricator.wikimedia.org/T162529) (owner: 10Andrew Bogott)
[16:43:35] <icinga-wm>	 RECOVERY - puppet last run on analytics1057 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures
[16:44:35] <icinga-wm>	 RECOVERY - puppet last run on analytics1035 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures
[16:45:48] <logmsgbot>	 !log switchdc (oblivian@sarin) START TASK - switchdc.stages.t09_restart_parsoid(codfw, eqiad) Rolling restart parsoid in eqiad and codfw
[16:45:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:46:15] <icinga-wm>	 RECOVERY - puppet last run on analytics1031 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures
[16:48:20] <_joe_>	 !log not really restarting parsoid, still testing swtichdc
[16:48:25] <icinga-wm>	 RECOVERY - puppet last run on analytics1056 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures
[16:48:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:48:45] <icinga-wm>	 RECOVERY - puppet last run on analytics1051 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures
[16:48:45] <icinga-wm>	 RECOVERY - puppet last run on analytics1029 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures
[16:50:25] <icinga-wm>	 RECOVERY - puppet last run on analytics1052 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures
[16:50:35] <icinga-wm>	 RECOVERY - puppet last run on analytics1032 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures
[16:51:25] <icinga-wm>	 RECOVERY - puppet last run on analytics1036 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures
[16:53:15] <icinga-wm>	 RECOVERY - puppet last run on analytics1053 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures
[16:53:45] <icinga-wm>	 RECOVERY - puppet last run on analytics1041 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures
[16:54:35] <icinga-wm>	 RECOVERY - puppet last run on analytics1028 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures
[16:56:31] <wikibugs>	 (03Abandoned) 10Volans: Logging: use NodeSet for more compact outputs [switchdc] - 10https://gerrit.wikimedia.org/r/347399 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans)
[16:56:38] <wikibugs__>	 (03CR) 10Dereckson: "I'm not sure about * Bureaucrats can demote sysops and bureaucrats." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347214 (https://phabricator.wikimedia.org/T162510) (owner: 10Urbanecm)
[16:57:25] <icinga-wm>	 RECOVERY - puppet last run on analytics1038 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures
[17:00:04] <jouncebot>	 gehel: Respected human, time to deploy Weekly Wikidata query service deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170410T1700). Please do the needful.
[17:00:04] <jouncebot>	 gehel: A patch you scheduled for Weekly Wikidata query service deployment window is about to be deployed. Please be available during the process.
[17:00:15] <icinga-wm>	 RECOVERY - puppet last run on analytics1034 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures
[17:00:35] <icinga-wm>	 RECOVERY - puppet last run on analytics1037 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures
[17:00:35] <icinga-wm>	 RECOVERY - puppet last run on analytics1039 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures
[17:01:05] <icinga-wm>	 RECOVERY - puppet last run on analytics1054 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures
[17:01:05] <wikibugs__>	 06Operations, 10Ops-Access-Requests, 06Labs: create a 'root' group with bdavis strictly for labs/cloud services infrastructure - https://phabricator.wikimedia.org/T162404#3161730 (10faidon) Looks good to me and there were no objections in the ops meeting either.  Bikeshedding a little bit: the "cs-roots" nam...
[17:01:26] <wikibugs>	 06Operations, 10Ops-Access-Requests, 06Labs: create a 'root' group with bdavis strictly for labs/cloud services infrastructure - https://phabricator.wikimedia.org/T162404#3161730 (10Dzahn) I would vote for **cloud-roots**.  "cloud" seems the most obvious term for the team and of 17 root groups, 16 end in -ro...
[17:01:30] <wikibugs__>	 06Operations, 10RESTBase, 13Patch-For-Review, 10Scap (Scap3-Adoption-Phase1), and 2 others: Deploy RESTBase with scap3 - https://phabricator.wikimedia.org/T116335#3168878 (10mobrovac)
[17:01:45] <icinga-wm>	 RECOVERY - puppet last run on analytics1042 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures
[17:04:25] <icinga-wm>	 RECOVERY - puppet last run on analytics1055 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures
[17:04:26] <wikibugs>	 06Operations, 10Ops-Access-Requests, 06Labs: create a 'root' group with bdavis strictly for labs/cloud services infrastructure - https://phabricator.wikimedia.org/T162404#3168881 (10chasemp) I'm open to what seems best :)  `wmcs-roots` seems most popular, but `cloud-roots` is really clear.  I'm punting to @b...
[17:04:49] <logmsgbot>	 !log gehel@tin Started deploy [wdqs/wdqs@1cfbd8d]: (no justification provided)
[17:04:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:05:05] <wikibugs>	 06Operations, 10Ops-Access-Requests, 06Labs: create a 'root' group with bdavis strictly for labs/cloud services infrastructure - https://phabricator.wikimedia.org/T162404#3161730 (10fgiunchedi) from the "naming is hard" bandwagon: I'd suggest an easily grep-able and unambiguous name like `wmcs` to be used ev...
[17:05:55] <icinga-wm>	 RECOVERY - Host lvs2002 is UP: PING OK - Packet loss = 0%, RTA = 36.59 ms
[17:06:11] <logmsgbot>	 !log gehel@tin Finished deploy [wdqs/wdqs@1cfbd8d]: (no justification provided) (duration: 01m 22s)
[17:06:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:06:27] <gehel>	 SMalyshev: deployment done and looking good
[17:06:56] <wikibugs__>	 (03PS5) 10Gehel: wdqs: active/active public interface [puppet] - 10https://gerrit.wikimedia.org/r/346543 (https://phabricator.wikimedia.org/T162111)
[17:08:05] <icinga-wm>	 PROBLEM - pybal on lvs2002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal
[17:08:05] <icinga-wm>	 PROBLEM - puppet last run on lvs2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:08:05] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2002 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090
[17:08:12] <wikibugs>	 06Operations, 10Ops-Access-Requests, 06Labs: create a 'root' group with bdavis strictly for labs/cloud services infrastructure - https://phabricator.wikimedia.org/T162404#3168886 (10chasemp) Good point @fgiunchedi -- i n purely searchable terms using `cs` is probably too diminutive and `wmcs` will be more se...
[17:09:07] <wikibugs__>	 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: codfw rack/setup 22 DB servers - https://phabricator.wikimedia.org/T162159#3168887 (10Papaul)
[17:09:10] <wikibugs>	 06Operations, 10Ops-Access-Requests, 06Labs: create a 'root' group with bdavis strictly for labs/cloud services infrastructure - https://phabricator.wikimedia.org/T162404#3168888 (10bd808) Let's do `wmcs-roots`. WMCS is our official short form written name for Wikimedia Cloud Services. Cookie licking `cloud`...
[17:09:14] <wikibugs>	 (03CR) 10Gehel: [C: 032] wdqs: active/active public interface [puppet] - 10https://gerrit.wikimedia.org/r/346543 (https://phabricator.wikimedia.org/T162111) (owner: 10Gehel)
[17:10:10] <gehel>	 SMalyshev, bblack: I merged https://gerrit.wikimedia.org/r/#/c/346543/ (WDQS going active / active). I'll keep an eye on logs / graphs, but ping me if you see anything strange...
[17:10:54] <wikibugs__>	 (03CR) 10Faidon Liambotis: [C: 04-1] "If we're going for a separate file, /etc/ferm/conntrack-sysctl.conf as I this patch has done, the applying this via a ferm post hook ('@ho" [puppet] - 10https://gerrit.wikimedia.org/r/320197 (https://phabricator.wikimedia.org/T136094) (owner: 10Muehlenhoff)
[17:11:27] <wikibugs>	 06Operations, 10Wikimedia-Mailing-lists: Reset admin password for wikimania-program mailing list - https://phabricator.wikimedia.org/T162080#3168894 (10eyoung) yes thank you!!! I'm swamped with scholarship wikimania stuff or I would have responded sooner.  thanks for your help!
[17:12:08] <wikibugs__>	 06Operations, 10Wikimedia-Mailing-lists: Reset admin password for wikimania-program mailing list - https://phabricator.wikimedia.org/T162080#3168896 (10Dzahn) 05Open>03Resolved a:03Dzahn @eyoung ok, great :) closing ticket
[17:12:49] <SMalyshev>	 gehel: ok
[17:14:55] <icinga-wm>	 PROBLEM - Host lvs2002 is DOWN: PING CRITICAL - Packet loss = 100%
[17:15:25] <elukey>	 wow
[17:15:41] <papaul>	 elukey: working on it
[17:15:45] <elukey>	 ahh okok :)
[17:16:04] <elukey>	 papaul: can you log it in the sal? 
[17:16:20] <papaul>	 sure 
[17:16:45] <elukey>	 thanks, just to have it logged somewhere for other people :)
[17:16:45] <papaul>	 !log testing lvs2002 after mainboard replacement
[17:16:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:17:05] <icinga-wm>	 RECOVERY - pybal on lvs2002 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal
[17:17:15] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2002 is OK: PYBAL OK - All pools are healthy
[17:17:15] <icinga-wm>	 RECOVERY - Host lvs2002 is UP: PING OK - Packet loss = 0%, RTA = 36.12 ms
[17:17:16] <wikibugs__>	 (03CR) 10Muehlenhoff: "The ferm post hook didn't work, see my earlier test results at https://gerrit.wikimedia.org/r/#/c/319071/" [puppet] - 10https://gerrit.wikimedia.org/r/320197 (https://phabricator.wikimedia.org/T136094) (owner: 10Muehlenhoff)
[17:18:02] <wikibugs>	 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3168914 (10Gehel) @Papaul has put in place new SSD in that server.  I've been running the same kind of load test as before for most of the day...
[17:18:04] <wikibugs__>	 (03CR) 10Muehlenhoff: "The sysctl config file is populated via https://gerrit.wikimedia.org/r/#/c/319071/" [puppet] - 10https://gerrit.wikimedia.org/r/320197 (https://phabricator.wikimedia.org/T136094) (owner: 10Muehlenhoff)
[17:18:05] <icinga-wm>	 RECOVERY - puppet last run on lvs2002 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures
[17:19:25] <wikibugs__>	 06Operations, 10ops-codfw, 10Traffic: lvs2002 random shut down - https://phabricator.wikimedia.org/T162099#3168916 (10Papaul) main board replacement complete on lvs2002, System is back up. @elukey please check everything is okay while I am on site.  Thanks.
[17:19:31] <papaul>	 elukey: https://phabricator.wikimedia.org/T162099
[17:23:09] <wikibugs>	 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files (e.g. djvu, jpg, png, webm) on Commons and other sites - https://phabricator.wikimedia.org/T161836#3144779 (10greg) >>! In T161836#3146624, @fgiunchedi wrote: > Since some files linked here seem to 200 now (instead of 40...
[17:29:14] <wikibugs__>	 06Operations, 10ops-codfw, 10Traffic: lvs2002 random shut down - https://phabricator.wikimedia.org/T162099#3168958 (10Papaul) a:05Papaul>03elukey
[17:31:55] <wikibugs__>	 06Operations, 13Patch-For-Review, 15User-Elukey, 07Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#3168964 (10elukey) Summary of the various issues as I understand them:  1) the descrip...
[17:31:56] <wikibugs>	 (03PS1) 10Catrope: Enable Flow beta feature on frwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347419 (https://phabricator.wikimedia.org/T162022)
[17:32:45] <wikibugs__>	 06Operations, 10ops-codfw, 10Traffic: lvs2002 random shut down - https://phabricator.wikimedia.org/T162099#3168972 (10BBlack) a:05elukey>03BBlack Switching this to me
[17:38:29] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Fix output typo in the redis task [switchdc] - 10https://gerrit.wikimedia.org/r/347420
[17:38:36] <_joe_>	 volans: ^^
[17:40:05] <wikibugs>	 (03CR) 10Volans: [C: 031] "LGTM" [switchdc] - 10https://gerrit.wikimedia.org/r/347420 (owner: 10Giuseppe Lavagetto)
[17:41:52] <wikibugs__>	 (03CR) 10Jforrester: [C: 031] "Cleanup noop, good to go whenever." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347412 (owner: 10Bartosz Dziewoński)
[17:44:10] <wikibugs>	 (03PS1) 10Ottomata: Revert "Revert "Add python3 packages to hadoop workers for ORES in hadoop"" [puppet] - 10https://gerrit.wikimedia.org/r/347421
[17:44:25] <wikibugs__>	 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3168987 (10Papaul) @Gehel Thank you.
[17:44:42] <wikibugs>	 (03PS1) 10Jforrester: Wikitech: Don't try to do cross-wiki uploads to Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347422 (https://phabricator.wikimedia.org/T162374)
[17:44:53] <wikibugs__>	 (03PS2) 10Ottomata: Revert "Revert "Add python3 packages to hadoop workers for ORES in hadoop"" [puppet] - 10https://gerrit.wikimedia.org/r/347421
[17:47:59] <wikibugs__>	 (03CR) 10Ottomata: [C: 032] Revert "Revert "Add python3 packages to hadoop workers for ORES in hadoop"" [puppet] - 10https://gerrit.wikimedia.org/r/347421 (owner: 10Ottomata)
[17:50:07] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Correct offset of the main task to be 0 [switchdc] - 10https://gerrit.wikimedia.org/r/347423
[17:50:47] <ottomata>	 i think i'm about to get soem puppet errors ...
[17:50:55] <wikibugs__>	 (03PS1) 10Ottomata: Separating out require_package calls in attempt to figure out weird error message [puppet] - 10https://gerrit.wikimedia.org/r/347424
[17:51:17] <wikibugs>	 (03CR) 10Ottomata: [V: 032 C: 032] Separating out require_package calls in attempt to figure out weird error message [puppet] - 10https://gerrit.wikimedia.org/r/347424 (owner: 10Ottomata)
[17:51:25] <icinga-wm>	 PROBLEM - puppet last run on analytics1052 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:51:26] <icinga-wm>	 PROBLEM - puppet last run on analytics1040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:51:29] <wikibugs>	 (03PS4) 10Catrope: Enable RCFilters beta feature on fawiki, ruwiki, trwiki, and frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343438
[17:51:30] <ottomata>	 yayayay
[17:51:31] <ottomata>	 i know
[17:51:34] <elukey>	 ahhahah
[17:51:44] <wikibugs__>	 (03PS5) 10Catrope: Enable RCFilters beta feature on fawiki, ruwiki, trwiki, and frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343438 (https://phabricator.wikimedia.org/T144458)
[17:51:44] <elukey>	 !log restore Hadoop masters to analytics1001
[17:51:45] <icinga-wm>	 PROBLEM - puppet last run on analytics1041 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:51:45] <icinga-wm>	 PROBLEM - puppet last run on analytics1042 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:51:50] <ottomata>	 i'm shooting in the dark at it
[17:51:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:51:59] <ottomata>	 wut the crap crackers is this
[17:52:19] <wikibugs__>	 (03PS4) 10Catrope: Enable RCFilters beta feature on all wikis except wikidatawiki, nlwiki, cswiki and etwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343439
[17:52:25] <icinga-wm>	 PROBLEM - puppet last run on analytics1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:52:32] <wikibugs>	 (03PS5) 10Catrope: Enable RCFilters beta feature on all wikis except wikidatawiki, nlwiki, cswiki and etwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343439 (https://phabricator.wikimedia.org/T144458)
[17:52:35] <icinga-wm>	 PROBLEM - puppet last run on analytics1032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:52:55] <wikibugs>	 (03PS1) 10Ottomata: Using array argument for require_package [puppet] - 10https://gerrit.wikimedia.org/r/347425
[17:53:09] <wikibugs>	 (03CR) 10Catrope: "For deployment 2017-04-11 13:00 UTC" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343438 (https://phabricator.wikimedia.org/T144458) (owner: 10Catrope)
[17:53:13] <wikibugs__>	 (03CR) 10Ottomata: [V: 032 C: 032] Using array argument for require_package [puppet] - 10https://gerrit.wikimedia.org/r/347425 (owner: 10Ottomata)
[17:54:26] <wikibugs__>	 (03PS2) 10Catrope: Enable RCFilters beta feature on all remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347045
[17:54:37] <wikibugs>	 (03PS1) 10Ottomata: Remove python3-numpy=version from list of installed packages to test fix require_package error [puppet] - 10https://gerrit.wikimedia.org/r/347426
[17:54:38] <wikibugs__>	 (03PS3) 10Catrope: Enable RCFilters beta feature on all remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347045 (https://phabricator.wikimedia.org/T144458)
[17:54:58] <wikibugs>	 (03CR) 10Ottomata: [V: 032 C: 032] Remove python3-numpy=version from list of installed packages to test fix require_package error [puppet] - 10https://gerrit.wikimedia.org/r/347426 (owner: 10Ottomata)
[17:55:15] <icinga-wm>	 PROBLEM - puppet last run on analytics1053 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:55:35] <icinga-wm>	 PROBLEM - puppet last run on analytics1028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:57:54] <wikibugs__>	 (03PS1) 10Ottomata: Ensuring python and python3 versions of numpy are the same on hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/347427
[17:58:17] <wikibugs>	 (03CR) 10Ottomata: [V: 032 C: 032] Ensuring python and python3 versions of numpy are the same on hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/347427 (owner: 10Ottomata)
[17:58:45] <icinga-wm>	 RECOVERY - puppet last run on analytics1041 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures
[17:59:46] <wikibugs__>	 06Operations, 10ops-eqiad, 10netops: Faulty optics on asw-b-eqiad:xe-1/1/2 - https://phabricator.wikimedia.org/T162199#3169051 (10ayounsi) 05Open>03Resolved Interface has been stable. Everything looks good. Thanks!
[17:59:51] <wikibugs>	 (03PS1) 10Ottomata: Fix missing comma in package list [puppet] - 10https://gerrit.wikimedia.org/r/347428
[18:00:04] <wikibugs__>	 (03CR) 10Ottomata: [V: 032 C: 032] Fix missing comma in package list [puppet] - 10https://gerrit.wikimedia.org/r/347428 (owner: 10Ottomata)
[18:00:04] <jouncebot>	 addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170410T1800). Please do the needful.
[18:00:04] <jouncebot>	 Amir1: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process.
[18:00:17] <Amir1>	 o/
[18:00:25] <icinga-wm>	 PROBLEM - puppet last run on analytics1038 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[python3-sklearn]
[18:01:15] <icinga-wm>	 PROBLEM - puppet last run on analytics1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:01:35] <icinga-wm>	 PROBLEM - puppet last run on analytics1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:01:45] <icinga-wm>	 PROBLEM - puppet last run on analytics1041 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:02:04] <wikibugs__>	 06Operations, 10netops: Slight packet loss observed on the network starting Nov 2016 - https://phabricator.wikimedia.org/T154507#3169060 (10ayounsi) 05Open>03Resolved a:03ayounsi > XioNoX> I'm secretly hoping that T154507 was caused by T162199, it's on the path, and the LACP hashing algorithm would expla...
[18:02:20] <wikibugs>	 (03PS1) 10Ottomata: Use package resource instead of require_package for ensuring specific python-numpy versions [puppet] - 10https://gerrit.wikimedia.org/r/347429
[18:02:35] <icinga-wm>	 PROBLEM - puppet last run on analytics1037 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:02:35] <icinga-wm>	 PROBLEM - puppet last run on analytics1039 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:03:05] <thcipriani>	 I can SWAT today
[18:03:15] <wikibugs>	 (03PS2) 10Thcipriani: Enable ORES review tool in hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345619 (https://phabricator.wikimedia.org/T161621) (owner: 10Ladsgroup)
[18:03:30] <wikibugs__>	 (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345619 (https://phabricator.wikimedia.org/T161621) (owner: 10Ladsgroup)
[18:03:46] <wikibugs__>	 (03CR) 10Ottomata: [C: 032] Use package resource instead of require_package for ensuring specific python-numpy versions [puppet] - 10https://gerrit.wikimedia.org/r/347429 (owner: 10Ottomata)
[18:04:05] <icinga-wm>	 PROBLEM - puppet last run on analytics1054 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:04:39] <wikibugs>	 (03Merged) 10jenkins-bot: Enable ORES review tool in hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345619 (https://phabricator.wikimedia.org/T161621) (owner: 10Ladsgroup)
[18:05:25] <icinga-wm>	 PROBLEM - puppet last run on analytics1055 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:06:57] <thcipriani>	 !log create ores tables on hewiki
[18:07:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:07:48] <wikibugs__>	 (03CR) 10jenkins-bot: Enable ORES review tool in hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345619 (https://phabricator.wikimedia.org/T161621) (owner: 10Ladsgroup)
[18:07:50] <wikibugs>	 (03PS1) 10Ottomata: Install python3-sklearn-lib and make sure python3-numpy is installed first [puppet] - 10https://gerrit.wikimedia.org/r/347430
[18:08:46] <thcipriani>	 Amir1: change is live on mwdebug1002, check please
[18:08:53] <Amir1>	 Thanks
[18:09:52] <wikibugs>	 (03CR) 10Ottomata: [C: 032] Install python3-sklearn-lib and make sure python3-numpy is installed first [puppet] - 10https://gerrit.wikimedia.org/r/347430 (owner: 10Ottomata)
[18:10:13] <Amir1>	 thcipriani: basic parts looks fine, can you run the maintenance script to check out the recent changes?
[18:10:32] <Amir1>	 (you need to pull the change to terbium too, last time it took some of my time :D)
[18:11:53] <wikibugs>	 06Operations, 10ops-codfw, 10Traffic: lvs2002 random shut down - https://phabricator.wikimedia.org/T162099#3169146 (10BBlack) a:05BBlack>03ayounsi @papaul Everything looks good with lvs2002 (checked icinga, interfaces on correct vlans, etc).  @ayounsi Let's let it burn in with no traffic until tomorrow s...
[18:12:09] <thcipriani>	 Amir1: yup, running now
[18:12:35] <icinga-wm>	 RECOVERY - puppet last run on analytics1043 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures
[18:12:48] <thcipriani>	 !log mwscript extensions/ORES/maintenance/CheckModelVersions.php hewiki && mwscript extensions/ORES/maintenance/PopulateDatabase.php hewiki
[18:12:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:14:48] <wikibugs__>	 (03PS1) 10Ottomata: Reorder more python sklearn deps to work around require_package issue [puppet] - 10https://gerrit.wikimedia.org/r/347431
[18:15:58] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Reorder more python sklearn deps to work around require_package issue [puppet] - 10https://gerrit.wikimedia.org/r/347431 (owner: 10Ottomata)
[18:17:00] <wikibugs__>	 (03PS2) 10Ottomata: Reorder more python sklearn deps to work around require_package issue [puppet] - 10https://gerrit.wikimedia.org/r/347431
[18:19:11] <Amir1>	 thcipriani: It looks okay
[18:19:39] <wikibugs__>	 (03CR) 10Ottomata: [C: 032] Reorder more python sklearn deps to work around require_package issue [puppet] - 10https://gerrit.wikimedia.org/r/347431 (owner: 10Ottomata)
[18:19:50] <thcipriani>	 Amir1: cool, is it fine to sync out the change while populateDatabase is finishing up?
[18:20:09] <Amir1>	 With the progress it made until now, I think so
[18:20:20] <thcipriani>	 ok, doing
[18:20:25] <icinga-wm>	 RECOVERY - puppet last run on analytics1036 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures
[18:20:27] <icinga-wm>	 RECOVERY - puppet last run on analytics1052 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures
[18:20:35] <icinga-wm>	 RECOVERY - puppet last run on analytics1032 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures
[18:22:26] <logmsgbot>	 !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:345619|Enable ORES review tool in hewiki]] T161621 (duration: 00m 39s)
[18:22:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:22:33] <stashbot>	 T161621: Deploy ORES Review Tool for hewiki - https://phabricator.wikimedia.org/T161621
[18:22:34] <thcipriani>	 ^ Amir1 live everywhere now
[18:22:49] <thcipriani>	 I'll ping you when populateDatabase finishes
[18:22:54] <Amir1>	 thanks
[18:23:45] <icinga-wm>	 RECOVERY - puppet last run on analytics1041 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures
[18:25:15] <icinga-wm>	 RECOVERY - puppet last run on analytics1053 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures
[18:25:22] <thcipriani>	 Amir1: populateDatabase just finished
[18:25:26] <icinga-wm>	 RECOVERY - puppet last run on analytics1040 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures
[18:25:35] <icinga-wm>	 RECOVERY - puppet last run on analytics1028 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures
[18:25:50] <Amir1>	 Thanks!
[18:25:54] <wikibugs>	 (03CR) 10Volans: "It looks more a workaround than a fix, probably at this point it would be better to change the menu.append() to properly save the items in" [switchdc] - 10https://gerrit.wikimedia.org/r/347423 (owner: 10Giuseppe Lavagetto)
[18:26:17] <Amir1>	 I think we might need to change the default precision but it's outside of scope of this patch
[18:28:15] <wikibugs__>	 (03PS1) 10Ottomata: Add new hadoop worker nodes to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/347432 (https://phabricator.wikimedia.org/T155065)
[18:28:25] <icinga-wm>	 RECOVERY - puppet last run on analytics1038 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures
[18:30:15] <icinga-wm>	 RECOVERY - puppet last run on analytics1034 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures
[18:30:23] <wikibugs>	 (03CR) 10Ottomata: [C: 032] Add new hadoop worker nodes to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/347432 (https://phabricator.wikimedia.org/T155065) (owner: 10Ottomata)
[18:32:35] <icinga-wm>	 RECOVERY - puppet last run on analytics1037 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures
[18:32:35] <icinga-wm>	 RECOVERY - puppet last run on analytics1039 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures
[18:32:45] <icinga-wm>	 RECOVERY - puppet last run on analytics1042 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures
[18:32:49] <ottomata>	 hey cmjohnson1 i'm starting to puppetize the new hadoop workers
[18:32:54] <ottomata>	 is 1068 still a little wonky?
[18:33:10] <cmjohnson1>	 no, it's good
[18:33:19] <ottomata>	 it looks like its in busy box console or something?
[18:34:05] <icinga-wm>	 RECOVERY - puppet last run on analytics1054 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures
[18:35:12] <ottomata>	 cmjohnson1:  ^^
[18:35:25] <icinga-wm>	 RECOVERY - puppet last run on analytics1055 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures
[18:35:40] <cmjohnson1>	 ottomata: looking 
[18:43:26] <wikibugs__>	 06Operations, 10Cassandra, 06Services (done): RAID-0 volume not mounted on restbase-dev1001.eqiad.wmnet - https://phabricator.wikimedia.org/T162614#3169273 (10Eevans) 05Open>03Resolved This should be done; I did the following:  - Brought down Cassandra, and masked the systemd units - Reformatted `/dev/re...
[18:44:01] <wikibugs__>	 (03PS6) 10Rush: admin: add a group for cloud services roots (wmcs-roots) [puppet] - 10https://gerrit.wikimedia.org/r/346838 (https://phabricator.wikimedia.org/T162404)
[18:44:04] <wikibugs>	 (03PS3) 10Rush: admin: add bd808 to wmcs-roots [puppet] - 10https://gerrit.wikimedia.org/r/347400 (https://phabricator.wikimedia.org/T162404)
[18:44:53] <wikibugs>	 (03PS7) 10Rush: admin: add a group for cloud services roots (wmcs-roots) [puppet] - 10https://gerrit.wikimedia.org/r/346838 (https://phabricator.wikimedia.org/T162404)
[18:45:08] <wikibugs__>	 (03PS4) 10Rush: admin: add bd808 to wmcs-roots [puppet] - 10https://gerrit.wikimedia.org/r/347400 (https://phabricator.wikimedia.org/T162404)
[18:47:21] <wikibugs__>	 (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/6073/" [puppet] - 10https://gerrit.wikimedia.org/r/347024 (owner: 10Dzahn)
[18:49:36] <wikibugs>	 (03CR) 10Rush: [C: 032] admin: add a group for cloud services roots (wmcs-roots) [puppet] - 10https://gerrit.wikimedia.org/r/346838 (https://phabricator.wikimedia.org/T162404) (owner: 10Rush)
[18:50:18] <wikibugs__>	 (03CR) 10Rush: [C: 032] admin: add bd808 to wmcs-roots [puppet] - 10https://gerrit.wikimedia.org/r/347400 (https://phabricator.wikimedia.org/T162404) (owner: 10Rush)
[18:56:25] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: m3 on db1048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 368.00 seconds
[18:56:46] <cmjohnson1>	 ottomata: it's failing in the partitioner
[18:56:51] <cmjohnson1>	 don't know why 
[18:56:55] <icinga-wm>	 PROBLEM - puppet last run on mw1222 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:57:00] <wikibugs__>	 06Operations, 10Ops-Access-Requests, 06Labs: create a 'root' group with bdavis strictly for labs/cloud services infrastructure - https://phabricator.wikimedia.org/T162404#3169334 (10chasemp) 05Open>03Resolved a:03chasemp >>! In T162404#3168888, @bd808 wrote: > Let's do `wmcs-roots`. WMCS is our officia...
[18:59:31] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 21 probes of 282 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[19:00:34] <wikibugs>	 (03PS2) 10Dzahn: standardize "include ::base:*" [puppet] - 10https://gerrit.wikimedia.org/r/347024
[19:02:56] <wikibugs__>	 (03CR) 10Dzahn: [C: 032] standardize "include ::base:*" [puppet] - 10https://gerrit.wikimedia.org/r/347024 (owner: 10Dzahn)
[19:03:04] <wikibugs>	 (03PS3) 10Dzahn: standardize "include ::base:*" [puppet] - 10https://gerrit.wikimedia.org/r/347024
[19:04:31] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 18 probes of 282 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[19:05:21] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: m3 on db1048 is OK: OK slave_sql_lag Replication lag: 46.00 seconds
[19:08:08] <Amir1>	 Ops, Who should I talk to if I'm going to deploy this? https://gerrit.wikimedia.org/r/#/c/347395/
[19:08:23] <Amir1>	 checking redis instances
[19:11:43] <bd808>	 Amir1: elukey has been looking at redis stuff a lot recently. That might be a good person to start with
[19:11:56] <Amir1>	 Thanks
[19:13:21] <icinga-wm>	 PROBLEM - puppet last run on californium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:17:05] <wikibugs>	 06Operations, 10ops-ulsfo, 10fundraising-tech-ops: rack/setup frbackup2001 - https://phabricator.wikimedia.org/T162469#3169388 (10Papaul)
[19:20:09] <wikibugs>	 (03PS1) 10Hashar: jenkins: tweak log permissions [puppet] - 10https://gerrit.wikimedia.org/r/347435
[19:20:43] <ottomata>	 hmm, ok weird i'll poke at it then cmjohnson1
[19:20:53] <wikibugs>	 (03PS2) 10Hashar: jenkins: tweak log permissions [puppet] - 10https://gerrit.wikimedia.org/r/347435
[19:20:57] <ottomata>	 cmjohnson1:  also, something seems weird with analytics1064 too
[19:20:59] <ottomata>	 it installed fine
[19:21:11] <ottomata>	 but it gets Error: Could not request certificate: getaddrinfo: Name or service not known when attempting to talk to puppetmsater
[19:21:29] <wikibugs__>	 (03PS1) 10Urbanecm: Increase default image thumbnail size on Finnish Wikipedia to 250px [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347436 (https://phabricator.wikimedia.org/T162376)
[19:21:49] <wikibugs>	 (03CR) 10Hashar: "That one should be better. I have dropped the dupe File[/var/log/jenkins/access.log]. It is already created by systemd::syslog." [puppet] - 10https://gerrit.wikimedia.org/r/347435 (owner: 10Hashar)
[19:22:59] <ottomata>	 cmjohnson1:  can you leave mgmt console com2 session on 1068
[19:23:05] <ottomata>	 i will powercycle and see what i can find
[19:23:23] <cmjohnson1>	 out
[19:23:51] <icinga-wm>	 RECOVERY - puppet last run on mw1222 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures
[19:24:32] <wikibugs>	 (03CR) 10Ottomata: [C: 031] Add JVM options tunables for Yarn RM and Hadoop DN/NN [puppet/cdh] - 10https://gerrit.wikimedia.org/r/347353 (https://phabricator.wikimedia.org/T159219) (owner: 10Elukey)
[19:27:49] <wikibugs>	 (03PS14) 10EBernhardson: Upgrade logstash to 5.x [puppet] - 10https://gerrit.wikimedia.org/r/344964 (https://phabricator.wikimedia.org/T154473)
[19:27:51] <icinga-wm>	 PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0]
[19:28:09] <wikibugs>	 (03PS2) 10Dzahn: typos: make "include profile::backup" a typo [puppet] - 10https://gerrit.wikimedia.org/r/347064
[19:28:42] <ottomata>	 cmjohnson1:  was it just stuck on partitioning this once?
[19:28:47] <ottomata>	 or did you try installing multiple times?
[19:28:51] <icinga-wm>	 PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 646 600 - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3080666 keys, up 18 days 3 hours - replication_delay is 646
[19:29:31] <cmjohnson1>	 ottomata, no, everytime and I went back and cleared it and did it again
[19:29:35] <cmjohnson1>	 the raid cfg
[19:29:39] <cmjohnson1>	 please feel free to look at it
[19:29:39] <wikibugs>	 06Operations, 10ops-codfw, 10Traffic: lvs2002 random shut down - https://phabricator.wikimedia.org/T162099#3169490 (10Papaul) @BBlack  Thanks.
[19:29:51] <icinga-wm>	 RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3057719 keys, up 18 days 3 hours - replication_delay is 0
[19:30:12] <ottomata>	 ok
[19:30:34] <ottomata>	 ok i'll look at 1068, if you have a sec could you see if you can figure out why 1064 isn't talking to puppetmaster?
[19:30:51] <icinga-wm>	 PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 3.33% of data above the critical threshold [1000.0]
[19:31:09] <wikibugs>	 (03CR) 10Dzahn: [C: 032] typos: make "include profile::backup" a typo [puppet] - 10https://gerrit.wikimedia.org/r/347064 (owner: 10Dzahn)
[19:33:44] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "no new "if $realm" checks please" [puppet] - 10https://gerrit.wikimedia.org/r/347189 (owner: 10Paladox)
[19:34:39] <wikibugs>	 (03CR) 10Dzahn: "no new "if $realm"-checks please. we can convert the phab class to profile soon" [puppet] - 10https://gerrit.wikimedia.org/r/347188 (owner: 10Paladox)
[19:34:45] <wikibugs__>	 (03CR) 10Dzahn: [C: 04-1] Phabricator: Make backup optional [puppet] - 10https://gerrit.wikimedia.org/r/347188 (owner: 10Paladox)
[19:35:24] <wikibugs__>	 (03PS3) 10Dzahn: jenkins: tweak log permissions [puppet] - 10https://gerrit.wikimedia.org/r/347435 (owner: 10Hashar)
[19:35:46] <hashar>	 mutante: I am not sure what happened with that jenkins log patch :/
[19:36:01] <hashar>	 the spec test should have caught it, but then I am too lazy to do the full investigation :D
[19:36:43] <mutante>	 hashar: yea, don't worry about it :) not worth it, it was just some dependency issue. i also just reverted because i didn't want to investigate it on Friday 
[19:36:55] <hashar>	 yeah make sense :]
[19:37:17] <wikibugs__>	 (03CR) 10Dzahn: [C: 032] jenkins: tweak log permissions [puppet] - 10https://gerrit.wikimedia.org/r/347435 (owner: 10Hashar)
[19:38:10] <mutante>	 Notice: /Stage[main]/Jenkins/Systemd::Syslog[jenkins]/File[/var/log/jenkins/jenkins.log]/group: group changed 'jenkins' to 'wikidev'
[19:38:18] <mutante>	 Notice: Finished catalog run in 14.10 seconds
[19:38:21] <mutante>	 hashar: all done
[19:38:26] <gehel>	 !log starting logstash upgrade - some log messages will be lost! - T161908
[19:38:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:38:35] <stashbot>	 T161908: ELK 5.x deployment plan - https://phabricator.wikimedia.org/T161908
[19:38:48] <gehel>	 !log disabling puppet on logstash1* - T161908
[19:38:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:38:57] <hashar>	 mutante: hurrah!
[19:39:14] <wikibugs__>	 (03CR) 10Dzahn: "yep, all done. changed permissions and no problem this time." [puppet] - 10https://gerrit.wikimedia.org/r/347435 (owner: 10Hashar)
[19:42:17] <wikibugs__>	 (03CR) 10Dzahn: [C: 031] "@Giuseppe what do you think? .erb templates here for Apache config? per "so we can add beta suffixes later"? seems good to me" [puppet] - 10https://gerrit.wikimedia.org/r/322602 (https://phabricator.wikimedia.org/T1256) (owner: 10Alex Monk)
[19:44:15] <hashar>	 mutante: I found out the issue :]  and it is a bit more complicated bah
[19:44:36] <mutante>	 hashar: couldn't resist?:)  of course it is, hehe
[19:46:22] <ottomata>	 cmjohnson1:  ah!
[19:46:35] <ottomata>	 i think the partman is failing because the flex bay is showing up as sdb instead of sda
[19:46:36] <ottomata>	 on 1068
[19:46:45] <ottomata>	 i just manually partitioned, and that's what happened when it came up
[19:46:47] <ottomata>	 not good really
[19:47:14] <cmjohnson1>	 that makes sense but very odd that only that one would have that issue
[19:47:22] <ottomata>	 ya
[19:47:42] <ottomata>	 wait
[19:47:42] <ottomata>	 hang on
[19:48:09] <ottomata>	 i take that back
[19:48:10] <wikibugs__>	 (03PS1) 10Hashar: jenkins: also fix permissions for access.log [puppet] - 10https://gerrit.wikimedia.org/r/347437
[19:48:16] <ottomata>	 i was looking at the wrong terminal.  i'm looking at 1064
[19:48:32] <ottomata>	 it partitioned, but has sdb as the flex bay?!
[19:49:35] <hashar>	 mutante: yeah could not resist. I am running the new one via the puppet compiler
[19:50:08] <icinga-wm>	 PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack
[19:51:18] <andrewbogott>	 !log upgrading qemu and oslo packages on labvirt1002
[19:51:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:54:08] <icinga-wm>	 RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack
[19:54:11] <ottomata>	 i'm going to reboot analytics1064
[19:55:44] <wikibugs>	 (03PS8) 10Dzahn: lists: convert to role/profile structure [puppet] - 10https://gerrit.wikimedia.org/r/346923
[19:56:57] <mutante>	 hashar: base = /var/log    filename = access.log     but it's /var/log/jenkins/access.log  is the jenkins part really implied there?
[19:58:04] <wikibugs__>	 06Operations, 10ops-eqiad, 13Patch-For-Review: rack and set up analyics1058-1069 - https://phabricator.wikimedia.org/T162216#3169539 (10Ottomata) Ok!  All but 2 of the nodes are up and running as Hadoop worker nodes.   analytics1064 doesn't seem to be able to contact puppetmaster1001:  ``` puppet agent -t Er...
[19:58:25] <ottomata>	 cmjohnson1:  phew, ok, after reboot sda is now the flex bays
[19:58:29] <ottomata>	 on 1064
[19:58:32] <ottomata>	 it still can't talk to puppet though
[19:59:32] <wikibugs__>	 (03CR) 10Dzahn: [C: 031] lists: convert to role/profile structure (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/346923 (owner: 10Dzahn)
[20:00:05] <jouncebot>	 gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170410T2000). Please do the needful.
[20:01:55] <subbu>	 no parsoid deploy today
[20:02:11] <wikibugs__>	 (03PS9) 10Dzahn: lists: convert to role/profile structure [puppet] - 10https://gerrit.wikimedia.org/r/346923
[20:02:23] <wikibugs>	 (03CR) 10Dzahn: [C: 031] lists: convert to role/profile structure (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/346923 (owner: 10Dzahn)
[20:03:44] <Amir1>	 none for ores
[20:04:06] <halfak>	 :) 
[20:04:12] <wikibugs>	 06Operations, 10ops-eqiad, 13Patch-For-Review: rack and set up analyics1058-1069 - https://phabricator.wikimedia.org/T162216#3169543 (10Ottomata)
[20:04:33] <wikibugs__>	 (03PS1) 10Legoktm: Deploy Linter to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347439 (https://phabricator.wikimedia.org/T148609)
[20:04:41] <ottomata>	 cmjohnson1:  can I assume you installed these in the racks in the order you listed?
[20:04:53] <ottomata>	 1058 in A1, 1059 and a1060 in A2, etc.
[20:04:53] <ottomata>	 ?
[20:05:12] <cmjohnson1>	 yes
[20:05:29] <cmjohnson1>	 they're in racktables
[20:07:34] <SMalyshev>	 question: we have production proxy e.g. webproxy.eqiad.wmnet. Do we log requests to these proxies?
[20:07:41] <wikibugs>	 (03PS15) 10Gehel: Upgrade logstash to 5.x [puppet] - 10https://gerrit.wikimedia.org/r/344964 (https://phabricator.wikimedia.org/T154473) (owner: 10EBernhardson)
[20:07:48] <SMalyshev>	 i.e. requests going *through* these proxies
[20:08:02] <wikibugs>	 (03PS10) 10Dzahn: lists: convert to role/profile structure [puppet] - 10https://gerrit.wikimedia.org/r/346923
[20:08:10] <wikibugs__>	 (03CR) 10Jforrester: [C: 031] "\o/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347439 (https://phabricator.wikimedia.org/T148609) (owner: 10Legoktm)
[20:08:12] <mutante>	 SMalyshev: yes, i think we do
[20:08:30] <SMalyshev>	 mutante: do you know where these logs are and in which form?
[20:08:52] <mutante>	 SMalyshev: install1002:/var/log/squid3/access.log  
[20:08:58] <mutante>	 for eqiad
[20:09:03] <wikibugs>	 (03CR) 10Gehel: [C: 032] Upgrade logstash to 5.x [puppet] - 10https://gerrit.wikimedia.org/r/344964 (https://phabricator.wikimedia.org/T154473) (owner: 10EBernhardson)
[20:09:12] <mutante>	 SMalyshev: what are you looking for?
[20:09:14] <SMalyshev>	 mutante: aha, thanks. any analysis done on them currently?
[20:09:28] <mutante>	 SMalyshev: i dont think there is any analysis, no
[20:09:42] <SMalyshev>	 mutante: I'd like to see (probably not now, but in the future) how sparql federation is being used
[20:10:46] <mutante>	 SMalyshev: yea, so it's on the install servers, 1002 and 2002 respectively, and access logs are there and rotated. looks like we have 10 days
[20:10:49] <logmsgbot>	 !log bsitzmann@tin Started deploy [mobileapps/deploy@9bc8c07]: Update mobileapps to 1695900
[20:10:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:10:59] <mutante>	 should be no problem i guess
[20:11:42] <SMalyshev>	 mutante: 10 days is kinda short but I wonder if it'll be possible to run some kind of script maybe... I'll think about it.
[20:12:10] <SMalyshev>	 for now I wanted to know whether we have any logs at all, now that I know they're there I'll think how to use them
[20:12:30] <mutante>	 SMalyshev: alright, *nod*
[20:12:41] <SMalyshev>	 mutante: thanks for the info :)
[20:12:51] <mutante>	 yw
[20:15:52] <wikibugs__>	 (03CR) 10Dzahn: [C: 04-1] "this will create /var/log/jenkins_access_log/access.log next to existing /var/log/jenkins.access.log. That is not intended is it?  Apparen" [puppet] - 10https://gerrit.wikimedia.org/r/347437 (owner: 10Hashar)
[20:16:17] <logmsgbot>	 !log bsitzmann@tin Finished deploy [mobileapps/deploy@9bc8c07]: Update mobileapps to 1695900 (duration: 05m 27s)
[20:16:19] <wikibugs>	 (03PS2) 10Hashar: jenkins: also fix permissions for access.log [puppet] - 10https://gerrit.wikimedia.org/r/347437
[20:16:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:16:39] <wikibugs>	 (03PS1) 10Ottomata: Add new hadoop workers to hadoop/net-topology.py.erb Bug: T152713 [puppet] - 10https://gerrit.wikimedia.org/r/347447 (https://phabricator.wikimedia.org/T152713)
[20:17:06] <hashar>	 mutante: yeah it is plain wrong :D
[20:17:42] <wikibugs__>	 (03CR) 10EBernhardson: [V: 032 C: 032] "deploying logstash 5.x" [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/344704 (owner: 10EBernhardson)
[20:19:10] <mutante>	 ok, and in my comment itself it also is :)
[20:19:22] <mutante>	  /var/log/jenkins/access.log is it :p
[20:19:45] <mutante>	 just use "jenkins" as resource name i guess
[20:20:06] <hashar>	 https://puppet-compiler.wmflabs.org/6080/contint1001.wikimedia.org/ !
[20:20:13] <wikibugs__>	 (03CR) 10Ottomata: [C: 032] Add new hadoop workers to hadoop/net-topology.py.erb Bug: T152713 [puppet] - 10https://gerrit.wikimedia.org/r/347447 (https://phabricator.wikimedia.org/T152713) (owner: 10Ottomata)
[20:20:17] <wikibugs>	 (03PS2) 10Ottomata: Add new hadoop workers to hadoop/net-topology.py.erb Bug: T152713 [puppet] - 10https://gerrit.wikimedia.org/r/347447 (https://phabricator.wikimedia.org/T152713)
[20:20:22] <wikibugs__>	 (03CR) 10Ottomata: [V: 032 C: 032] Add new hadoop workers to hadoop/net-topology.py.erb Bug: T152713 [puppet] - 10https://gerrit.wikimedia.org/r/347447 (https://phabricator.wikimedia.org/T152713) (owner: 10Ottomata)
[20:20:28] <wikibugs>	 (03CR) 10Hashar: "Ran it via the compiler https://puppet-compiler.wmflabs.org/6080/contint1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/347437 (owner: 10Hashar)
[20:20:31] <wikibugs__>	 (03PS1) 10Mobrovac: [WIP] RESTBase: Switch to Scap3 config deploys [puppet] - 10https://gerrit.wikimedia.org/r/347452
[20:20:53] <Zppix>	 im correct in saying no mediawiki train week of april 17th right?
[20:21:05] <hashar>	 mutante: i think the last iteration would work :]
[20:21:22] <hashar>	 Zppix: https://wikitech.wikimedia.org/wiki/Deployments is the reference
[20:21:38] <hashar>	 Zppix: and yeah that week is frozen
[20:21:54] <Zppix>	 hashar:  i know about deployments page i was just confirming, and thanks
[20:24:15] <wikibugs__>	 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Make WDQS active / active - https://phabricator.wikimedia.org/T162111#3169608 (10Gehel) Using the following curl to test, I don't see an entry in the nginx access log: ``` curl 'https://query.wikidata.org/bigdata/namespace/wdq/s...
[20:25:15] <wikibugs__>	 (03CR) 10Dzahn: [C: 032] "yep, looks good now" [puppet] - 10https://gerrit.wikimedia.org/r/347437 (owner: 10Hashar)
[20:25:32] <wikibugs__>	 (03PS3) 10Dzahn: jenkins: also fix permissions for access.log [puppet] - 10https://gerrit.wikimedia.org/r/347437 (owner: 10Hashar)
[20:25:46] <hashar>	 mutante: the reason is that I had a before systemd::syslog, which happens to generate a File resource for the log directory
[20:26:07] <hashar>	 which lead to a cycle because I defined two files in that same dir but with conflicting dependencies
[20:26:08] <hashar>	 bah
[20:26:10] <mutante>	 hashar: yea, i noticed it uses the resource name as part of the path
[20:26:22] <mutante>	 aha, yea, duplicate def
[20:26:36] <mutante>	 gotcha
[20:27:45] <ebernhardson>	 !log deployed new logstash plugins to logstash100[123]
[20:27:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:28:33] <wikibugs>	 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: Make WDQS active / active - https://phabricator.wikimedia.org/T162111#3169636 (10Gehel) I'm not sure the change is effective. While I do see a few requests (outside of pyball / icinga) in the nginx logs on the wdqs codwf servers, I don't see a...
[20:28:53] <ottomata>	 hm, cmjohnson1 what about analytics1069
[20:28:58] <ottomata>	 is it around somewhere?
[20:29:36] <cmjohnson1>	 sort of
[20:29:44] <ottomata>	 oh?  hah
[20:29:53] <cmjohnson1>	 long story
[20:29:59] <gehel>	 !log upgrading logstash on logstash1001 - T161908
[20:30:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:30:06] <stashbot>	 T161908: ELK 5.x deployment plan - https://phabricator.wikimedia.org/T161908
[20:31:33] <hashar>	 mutante: submit ?  ( https://gerrit.wikimedia.org/r/347437 ) :D
[20:31:58] <mutante>	 i did once i could
[20:32:04] <mutante>	 Notice: /Stage[main]/Jenkins/File[/var/log/jenkins/access.log]/group: group changed 'jenkins' to 'wikidev'
[20:32:15] <wikibugs__>	 (03CR) 10Dzahn: "Notice: /Stage[main]/Jenkins/File[/var/log/jenkins/access.log]/group: group changed 'jenkins' to 'wikidev'" [puppet] - 10https://gerrit.wikimedia.org/r/347437 (owner: 10Hashar)
[20:32:32] <hashar>	 hurrah
[20:32:48] <icinga-wm>	 RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0]
[20:33:30] <wikibugs>	 (03CR) 10Dzahn: [C: 031] "http://puppet-compiler.wmflabs.org/6079/fermium.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/346923 (owner: 10Dzahn)
[20:33:35] <mutante>	 hashar: :)
[20:36:37] <wikibugs__>	 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: Make WDQS active / active - https://phabricator.wikimedia.org/T162111#3169658 (10Smalyshev) @Gehel: you can check x-served-by headers in the responses - half of those should have codfw hosts there now.
[20:42:08] <icinga-wm>	 PROBLEM - logstash process on logstash1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 998 (logstash), command name java, args logstash
[20:45:27] <wikibugs__>	 (03PS1) 10Kaldari: Adding pageassessments.dblist for maintanence script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347468
[20:45:56] <wikibugs__>	 (03PS2) 10Kaldari: Adding pageassessments.dblist for maintanence script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347468
[20:46:55] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Adding pageassessments.dblist for maintanence script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347468 (owner: 10Kaldari)
[20:47:10] <Revent>	 godog: Ping?
[20:47:25] <Revent>	 Serious wierdness, and you are on call.
[20:47:32] <Revent>	 https://commons.wikimedia.org/w/index.php?title=Commons:Village_pump&diff=240380222&oldid=240367865
[20:48:09] <icinga-wm>	 RECOVERY - logstash process on logstash1001 is OK: PROCS OK: 1 process with UID = 998 (logstash), command name java, args logstash
[20:48:12] <Revent>	 Logged as “Undo revision 239878186 by Revent (talk (A/OS)) WTF?”, but that revision was not by me.
[20:49:06] <Revent>	 Oh wait, nevermind…
[20:49:53] <wikibugs__>	 (03PS2) 10Mobrovac: [WIP] RESTBase: Switch to Scap3 config deploys [puppet] - 10https://gerrit.wikimedia.org/r/347452
[20:51:41] <wikibugs__>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] RESTBase: Switch to Scap3 config deploys [puppet] - 10https://gerrit.wikimedia.org/r/347452 (owner: 10Mobrovac)
[20:54:27] <wikibugs>	 (03PS3) 10Mobrovac: [WIP] RESTBase: Switch to Scap3 config deploys [puppet] - 10https://gerrit.wikimedia.org/r/347452
[20:55:55] <wikibugs__>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] RESTBase: Switch to Scap3 config deploys [puppet] - 10https://gerrit.wikimedia.org/r/347452 (owner: 10Mobrovac)
[20:56:07] <wikibugs>	 (03PS3) 10Kaldari: Adding pageassessments.dblist for maintanence script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347468 (https://phabricator.wikimedia.org/T159438)
[20:56:55] <wikibugs__>	 (03PS4) 10Kaldari: Adding pageassessments.dblist for maintanence script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347468 (https://phabricator.wikimedia.org/T159438)
[20:58:35] <wikibugs__>	 (03PS1) 10EBernhardson: Update de_dot plugin for logstash 5.x [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/347488
[20:59:01] <wikibugs>	 (03PS2) 10EBernhardson: Update de_dot plugin for logstash 5.x [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/347488
[21:00:05] <jouncebot>	 dapatrick, bawolff, and Reedy: Dear anthropoid, the time has come. Please deploy Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170410T2100).
[21:00:09] <wikibugs>	 (03PS4) 10Mobrovac: [WIP] RESTBase: Switch to Scap3 config deploys [puppet] - 10https://gerrit.wikimedia.org/r/347452
[21:01:04] <wikibugs>	 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: Make WDQS active / active - https://phabricator.wikimedia.org/T162111#3169714 (10Gehel) grafana dashboard was wrongly filtering on eqiad only (that's why I did not see any traffic there). More tests and checking x-cache and x-served-by headers...
[21:01:40] <wikibugs__>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] RESTBase: Switch to Scap3 config deploys [puppet] - 10https://gerrit.wikimedia.org/r/347452 (owner: 10Mobrovac)
[21:01:48] <icinga-wm>	 PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers
[21:06:02] <wikibugs__>	 (03PS3) 10EBernhardson: Update de_dot plugin for logstash 5.x [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/347488
[21:07:36] <wikibugs__>	 (03CR) 10BryanDavis: [C: 032] Update de_dot plugin for logstash 5.x [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/347488 (owner: 10EBernhardson)
[21:08:54] <wikibugs__>	 (03CR) 10BryanDavis: [V: 032 C: 032] Update de_dot plugin for logstash 5.x [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/347488 (owner: 10EBernhardson)
[21:13:51] <gehel>	 !log running puppet on logstash1001 to deploy new logstash plugins - T161908
[21:13:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:13:59] <stashbot>	 T161908: ELK 5.x deployment plan - https://phabricator.wikimedia.org/T161908
[21:15:49] <wikibugs__>	 (03CR) 10Krinkle: [C: 031] Wikitech: Don't try to do cross-wiki uploads to Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347422 (https://phabricator.wikimedia.org/T162374) (owner: 10Jforrester)
[21:16:20] <wikibugs>	 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3169755 (10RobH) It seems that these were in the initial Dasher orders where Intel disks were Dasher supported, not HP.  @Papaul: Can you provi...
[21:17:40] <gehel>	 !log logstash upgrade on logstash1001 completed - T161908
[21:17:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:22:15] <gehel>	 !log upgrading logstash on logstash1002 - T161908
[21:22:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:22:22] <stashbot>	 T161908: ELK 5.x deployment plan - https://phabricator.wikimedia.org/T161908
[21:22:37] <wikibugs__>	 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3169778 (10Papaul) Drive Model ATA INTEL SSDSC2BB80
[21:26:08] <icinga-wm>	 PROBLEM - puppet last run on logstash1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[install-logstash-plugins]
[21:27:08] <icinga-wm>	 RECOVERY - puppet last run on logstash1002 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures
[21:29:11] <wikibugs__>	 (03CR) 10Dzahn: [C: 031] mw_rc_irc: convert to profile/role structure (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/346926 (owner: 10Dzahn)
[21:29:14] <wikibugs>	 06Operations, 10Wikimedia-Site-requests, 07CSS, 07Russian-Sites: Add unique class for 'migration warningbox' on Watchlist page in Russian Wikipedia - https://phabricator.wikimedia.org/T162639#3169785 (10Iniquity)
[21:29:21] <wikibugs__>	 (03PS2) 10Dzahn: mw_rc_irc: convert to profile/role structure [puppet] - 10https://gerrit.wikimedia.org/r/346926
[21:31:26] <gehel>	 !log upgrading logstash on logstash1003 - T161908
[21:31:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:31:34] <stashbot>	 T161908: ELK 5.x deployment plan - https://phabricator.wikimedia.org/T161908
[21:31:50] <wikibugs__>	 (03CR) 10Dzahn: [C: 031] "http://puppet-compiler.wmflabs.org/6086/" [puppet] - 10https://gerrit.wikimedia.org/r/346926 (owner: 10Dzahn)
[21:32:28] <icinga-wm>	 PROBLEM - logstash process on logstash1003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 998 (logstash), command name java, args logstash
[21:33:28] <icinga-wm>	 RECOVERY - logstash process on logstash1003 is OK: PROCS OK: 1 process with UID = 998 (logstash), command name java, args logstash
[21:33:44] <gehel>	 !log logstash upgrade on all logstash1* nodes completed- T161908
[21:33:48] <icinga-wm>	 PROBLEM - puppet last run on logstash1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Service[logstash]
[21:33:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:35:48] <icinga-wm>	 RECOVERY - puppet last run on logstash1003 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures
[21:36:02] <wikibugs>	 (03PS5) 10Mobrovac: [WIP] RESTBase: Switch to Scap3 config deploys [puppet] - 10https://gerrit.wikimedia.org/r/347452
[21:36:05] <wikibugs__>	 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3169816 (10Papaul) @RobH I was able to pull the information from the HW diagnostic i did last week please see below for information   Disk 1 SC...
[21:39:00] <wikibugs__>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] RESTBase: Switch to Scap3 config deploys [puppet] - 10https://gerrit.wikimedia.org/r/347452 (owner: 10Mobrovac)
[21:42:32] <wikibugs>	 (03PS1) 10Volans: Disable puppet: add videoscalers [switchdc] - 10https://gerrit.wikimedia.org/r/347511 (https://phabricator.wikimedia.org/T160178)
[21:45:29] <wikibugs>	 (03CR) 10Volans: "I know they are already included by the previous selection, but I think it's more explicit and in case we'll split them in the future." [switchdc] - 10https://gerrit.wikimedia.org/r/347511 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans)
[21:47:07] <wikibugs__>	 (03PS3) 10Andrew Bogott: fullstack:  Switch to the normal jessie image [puppet] - 10https://gerrit.wikimedia.org/r/346984
[21:47:09] <wikibugs>	 (03PS1) 10Andrew Bogott: Remove the striker-users admin group [puppet] - 10https://gerrit.wikimedia.org/r/347513
[21:49:12] <wikibugs__>	 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3169847 (10RobH) This lists two SSDs, which one is the failed one?
[21:49:28] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] Remove the striker-users admin group [puppet] - 10https://gerrit.wikimedia.org/r/347513 (owner: 10Andrew Bogott)
[21:50:18] <icinga-wm>	 RECOVERY - puppet last run on californium is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures
[21:51:50] <wikibugs__>	 (03PS13) 10Andrew Bogott: wmfkeystonehooks:  Create project page on wikitech on project creation [puppet] - 10https://gerrit.wikimedia.org/r/323117 (https://phabricator.wikimedia.org/T150091)
[21:51:52] <wikibugs>	 (03PS4) 10Andrew Bogott: Keystonehooks:  Add two more ldap ous for sudo handling. [puppet] - 10https://gerrit.wikimedia.org/r/346880
[21:52:19] <wikibugs__>	 (03CR) 10Andrew Bogott: [C: 032] fullstack:  Switch to the normal jessie image [puppet] - 10https://gerrit.wikimedia.org/r/346984 (owner: 10Andrew Bogott)
[21:52:47] <wikibugs__>	 (03CR) 10jerkins-bot: [V: 04-1] Keystonehooks:  Add two more ldap ous for sudo handling. [puppet] - 10https://gerrit.wikimedia.org/r/346880 (owner: 10Andrew Bogott)
[21:53:48] <icinga-wm>	 RECOVERY - Check HHVM threads for leakage on mw1169 is OK: OK
[21:56:23] <wikibugs__>	 (03PS6) 10Mobrovac: [WIP] RESTBase: Switch to Scap3 config deploys [puppet] - 10https://gerrit.wikimedia.org/r/347452
[21:56:39] <wikibugs>	 (03PS5) 10Andrew Bogott: Keystonehooks:  Add two more ldap ous for sudo handling. [puppet] - 10https://gerrit.wikimedia.org/r/346880
[21:58:32] <wikibugs__>	 (03CR) 10Andrew Bogott: [C: 032] Keystonehooks:  Add two more ldap ous for sudo handling. [puppet] - 10https://gerrit.wikimedia.org/r/346880 (owner: 10Andrew Bogott)
[22:09:25] <wikibugs__>	 (03PS14) 10Andrew Bogott: wmfkeystonehooks:  Create project page on wikitech on project creation [puppet] - 10https://gerrit.wikimedia.org/r/323117 (https://phabricator.wikimedia.org/T150091)
[22:10:10] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wmfkeystonehooks:  Create project page on wikitech on project creation [puppet] - 10https://gerrit.wikimedia.org/r/323117 (https://phabricator.wikimedia.org/T150091) (owner: 10Andrew Bogott)
[22:11:35] <wikibugs>	 (03PS7) 10Mobrovac: RESTBase: Switch to Scap3 config deploys [puppet] - 10https://gerrit.wikimedia.org/r/347452 (https://phabricator.wikimedia.org/T116335)
[22:12:54] <wikibugs>	 (03PS2) 10Thcipriani: Scap: update version to 3.5.5-1 [puppet] - 10https://gerrit.wikimedia.org/r/346579 (https://phabricator.wikimedia.org/T127762)
[22:15:05] <wikibugs__>	 (03PS15) 10Andrew Bogott: wmfkeystonehooks:  Create project page on wikitech on project creation [puppet] - 10https://gerrit.wikimedia.org/r/323117 (https://phabricator.wikimedia.org/T150091)
[22:16:25] <wikibugs__>	 (03CR) 10Dzahn: deployment::server: convert to profile/role (pt. 1) (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/344728 (owner: 10Dzahn)
[22:18:31] <wikibugs__>	 06Operations, 10Wikimedia-Site-requests, 07CSS, 07Russian-Sites: Add unique class for 'migration warningbox' on Watchlist page in Russian Wikipedia - https://phabricator.wikimedia.org/T162639#3169914 (10Iniquity) 05Open>03Invalid Sorry. Figured out.
[22:20:13] <wikibugs>	 (03PS1) 10Jdlrobson: Remove deprecated config option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347521
[22:20:44] <wikibugs>	 (03PS1) 10Andrew Bogott: keystone.conf:  Whitespace cleanups [puppet] - 10https://gerrit.wikimedia.org/r/347522
[22:22:26] <wikibugs__>	 (03CR) 10Andrew Bogott: [C: 032] keystone.conf:  Whitespace cleanups [puppet] - 10https://gerrit.wikimedia.org/r/347522 (owner: 10Andrew Bogott)
[22:24:46] <wikibugs>	 (03PS16) 10Andrew Bogott: wmfkeystonehooks:  Create project page on wikitech on project creation [puppet] - 10https://gerrit.wikimedia.org/r/323117 (https://phabricator.wikimedia.org/T150091)
[22:28:57] <wikibugs>	 (03PS10) 10BBlack: dnsrecursor: 4.x backport and edns-client-subnet [puppet] - 10https://gerrit.wikimedia.org/r/346937
[22:28:58] <wikibugs__>	 (03PS7) 10BBlack: dnsrecursor: update to backports for transition [puppet] - 10https://gerrit.wikimedia.org/r/346980
[22:31:10] <wikibugs__>	 (03PS4) 10Dzahn: deployment::server: convert to profile/role [puppet] - 10https://gerrit.wikimedia.org/r/344728
[22:31:42] <wikibugs__>	 (03CR) 10BBlack: [C: 031] "For PS10, I re-worked the config diff to net zero lines for trusty dnsrecursor installs, avoiding unnecessary service restarts on those ho" [puppet] - 10https://gerrit.wikimedia.org/r/346937 (owner: 10BBlack)
[22:33:00] <wikibugs__>	 (03CR) 10jerkins-bot: [V: 04-1] deployment::server: convert to profile/role [puppet] - 10https://gerrit.wikimedia.org/r/344728 (owner: 10Dzahn)
[22:41:14] <wikibugs__>	 (03PS5) 10Dzahn: deployment::server: convert to profile/role [puppet] - 10https://gerrit.wikimedia.org/r/344728
[22:42:56] <wikibugs>	 (03Draft1) 10Paladox: irc echo: Convert to base::service class to maintain the script [puppet] - 10https://gerrit.wikimedia.org/r/347518
[22:42:59] <wikibugs__>	 (03PS2) 10Paladox: irc echo: Convert to base::service class to maintain the script [puppet] - 10https://gerrit.wikimedia.org/r/347518
[22:43:10] <paladox>	 mutante ^^ :)
[22:43:12] <wikibugs>	 (03PS3) 10Paladox: irc echo: Convert to base::service class to maintain the script [puppet] - 10https://gerrit.wikimedia.org/r/347518
[22:45:09] <dapatrick>	 !log Deployed patch for T162621 to wmf18 and wmf19
[22:45:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:45:32] <wikibugs__>	 (03PS1) 10Catrope: Add ORES thresholds for fawiki, ruwiki, trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347526
[22:47:03] <wikibugs__>	 (03PS6) 10Dzahn: deployment::server: convert to profile/role [puppet] - 10https://gerrit.wikimedia.org/r/344728
[22:58:00] <mutante>	 paladox: compiling 
[22:58:17] <paladox>	 ok
[22:58:18] <paladox>	 thanks
[23:00:05] <jouncebot>	 addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170410T2300). Please do the needful.
[23:00:05] <jouncebot>	 RoanKattouw: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process.
[23:00:32] <RoanKattouw>	 I think I'm the only one, so I'll deploy my patch myself
[23:00:45] <wikibugs>	 (03CR) 10Catrope: [C: 032] Add ORES thresholds for fawiki, ruwiki, trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347526 (owner: 10Catrope)
[23:03:32] <wikibugs>	 (03CR) 10BBlack: [C: 032] dnsrecursor: 4.x backport and edns-client-subnet [puppet] - 10https://gerrit.wikimedia.org/r/346937 (owner: 10BBlack)
[23:03:38] <wikibugs__>	 (03CR) 10BBlack: [C: 032] dnsrecursor: update to backports for transition [puppet] - 10https://gerrit.wikimedia.org/r/346980 (owner: 10BBlack)
[23:04:03] <wikibugs__>	 (03Merged) 10jenkins-bot: Add ORES thresholds for fawiki, ruwiki, trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347526 (owner: 10Catrope)
[23:04:11] <bblack>	 !log puppet disabled on jessie recdns (maerlant, nescio, hydrogen, chromium) for complex upgrade process ( https://gerrit.wikimedia.org/r/#/c/346937/ )
[23:04:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:04:17] <wikibugs>	 (03CR) 10jenkins-bot: Add ORES thresholds for fawiki, ruwiki, trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347526 (owner: 10Catrope)
[23:09:54] <wikibugs__>	 (03PS1) 10BBlack: ulsfo lvs: prefer codfw dns [puppet] - 10https://gerrit.wikimedia.org/r/347531
[23:09:56] <wikibugs>	 (03PS1) 10BBlack: eqiad lvs: do not directly use hydrogen temporarily [puppet] - 10https://gerrit.wikimedia.org/r/347532
[23:10:17] <wikibugs__>	 (03CR) 10BBlack: [V: 032 C: 032] ulsfo lvs: prefer codfw dns [puppet] - 10https://gerrit.wikimedia.org/r/347531 (owner: 10BBlack)
[23:14:31] <wikibugs__>	 (03CR) 10BBlack: [C: 032] eqiad lvs: do not directly use hydrogen temporarily [puppet] - 10https://gerrit.wikimedia.org/r/347532 (owner: 10BBlack)
[23:17:23] <wikibugs>	 (03PS1) 10Volans: Logging: filter out all cumin's messages from stderr [switchdc] - 10https://gerrit.wikimedia.org/r/347534 (https://phabricator.wikimedia.org/T160178)
[23:18:53] <logmsgbot>	 !log bblack@neodymium conftool action : set/pooled=no; selector: name=hydrogen.wikimedia.org,service=pdns_recursor
[23:19:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:21:18] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[23:21:18] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[23:21:28] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[23:21:28] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[23:21:28] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[23:21:28] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[23:21:38] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[23:21:38] <icinga-wm>	 PROBLEM - eventlogging-service-eventbus endpoints health on kafka1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[23:21:38] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[23:21:38] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[23:21:38] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[23:21:52] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on eventbus.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:21:58] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - eventbus_8085 - Could not depool server kafka1002.eqiad.wmnet because of too many down!
[23:21:58] <icinga-wm>	 PROBLEM - eventlogging-service-eventbus endpoints health on kafka1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[23:21:58] <icinga-wm>	 PROBLEM - eventlogging-service-eventbus endpoints health on kafka1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[23:21:58] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - eventbus_8085 - Could not depool server kafka1002.eqiad.wmnet because of too many down!
[23:22:29] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[23:22:51] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on eventbus.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 2247 bytes in 4.661 second response time
[23:22:52] <bblack>	 I don't think that's me
[23:22:56] <robh>	 hrmm
[23:23:09] <bblack>	 the timing is suspicious though, so I've paused what I'm working on
[23:23:18] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy
[23:23:20] <robh>	 over aggressive monitoring perhaps?
[23:23:28] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy
[23:23:29] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy
[23:23:38] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy
[23:23:38] <icinga-wm>	 RECOVERY - eventlogging-service-eventbus endpoints health on kafka1001 is OK: All endpoints are healthy
[23:23:38] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy
[23:23:38] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy
[23:23:38] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy
[23:23:48] <icinga-wm>	 RECOVERY - eventlogging-service-eventbus endpoints health on kafka1003 is OK: All endpoints are healthy
[23:23:48] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy
[23:23:48] <icinga-wm>	 RECOVERY - eventlogging-service-eventbus endpoints health on kafka1002 is OK: All endpoints are healthy
[23:23:53] <bblack>	 I haven't stopped anything yet
[23:24:08] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy
[23:24:12] <bblack>	 (also the things I'm doing definitely couldn't impact codfw along with eqiad all at the same time)
[23:24:18] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy
[23:24:18] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy
[23:24:24] <wikibugs>	 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3170063 (10RobH) Update from IRC:  Papaul wasn't sure which SSD failed, he just pulled both.  He'll place one of the two back in and run the di...
[23:24:28] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy
[23:24:28] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy
[23:24:28] <icinga-wm>	 PROBLEM - puppet last run on db1070 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[23:25:11] <chasemp>	 What happened?  Eventbus paged flapping
[23:25:16] <ottomata>	 got it too
[23:25:17] <ottomata>	 dunno
[23:25:28] <ottomata>	 looks like maybe a more general scb problem based on other pages too?
[23:25:28] <chasemp>	 ah, just saw backscroll
[23:25:30] <ottomata>	 eventbus looks ok
[23:25:34] <bblack>	 I think eventbus just flapped on its own, but it's also in the midst of me working on recdns
[23:25:40] <logmsgbot>	 !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Set ORES thresholds for fawiki, ruwiki, trwiki (duration: 00m 39s)
[23:25:45] <bblack>	 I haven't stopped any recdns service though, just doing the prerequisite prep work
[23:25:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:25:59] <ottomata>	 don't have backscroll, just logged on
[23:26:09] <bblack>	 the only way they could be related, I think, is if when I depooled hydrogen, chromium can't handle normal eqiad recdns load
[23:26:16] <bblack>	 but it doesn't look overloaded, either
[23:26:58] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1299 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:27:16] <ottomata>	 did all scb services flap?
[23:28:24] <ottomata>	 hm, dunno, everything looks ok.  i'll check back later
[23:28:33] <bblack>	 I see eventlogging, mobileapps, restbase, 
[23:28:38] <bblack>	 I don't know if that's all or not
[23:28:48] <bblack>	 it's odd that it's scb in both datacenters at the same time, though
[23:29:04] <ottomata>	 that is odd
[23:29:15] <ottomata>	 eventbus processes are still running fine
[23:29:25] <ottomata>	 at least on one box
[23:32:29] <bblack>	 in any case, they all recovered, and I'm continuing in my steps
[23:32:46] <bblack>	 (I still don't think there's any relation between my recdns and whatever happened there with eventbus, afaics)
[23:33:24] <bblack>	 !log upgrading hydrogen to pdns-recursor 4.x
[23:33:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:34:31] <logmsgbot>	 !log bblack@neodymium conftool action : set/pooled=yes; selector: name=hydrogen.wikimedia.org,service=pdns_recursor
[23:34:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:35:52] <mutante>	 paladox: i cant compile changes on einsteinium:  http://puppet-compiler.wmflabs.org/6093/einsteinium.wikimedia.org/change.einsteinium.wikimedia.org.err
[23:36:19] <wikibugs__>	 06Operations, 07HHVM, 13Patch-For-Review, 07Upstream: Build / migrate to HHVM 3.18 - https://phabricator.wikimedia.org/T158176#3170099 (10Krinkle)
[23:36:19] <paladox>	 oh
[23:36:21] <mutante>	 notices "Warning: Scope(Class[Base::Standard_packages]): os_version(): obsolete distribution check in ubuntu >= precise"  too, unrelatedly
[23:36:21] <wikibugs>	 06Operations, 10MediaWiki-Internationalization, 07HHVM, 05MW-1.29-release (WMF-deploy-2017-03-21_(1.29.0-wmf.17)), and 3 others: Uninitialized string offset warnings with HHVM 3.18 in LanguageAz.php and LanguageKk.php - https://phabricator.wikimedia.org/T161095#3170096 (10Krinkle) 05Open>03Resolved a:...
[23:36:33] <wikibugs>	 (03PS1) 10BBlack: Revert "eqiad lvs: do not directly use hydrogen temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/347536
[23:36:34] <paladox>	 mutante those failures look un related to my patch.
[23:36:45] <wikibugs__>	 (03CR) 10BBlack: [V: 032 C: 032] Revert "eqiad lvs: do not directly use hydrogen temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/347536 (owner: 10BBlack)
[23:36:57] <mutante>	 paladox: i know, they are unrelated. but it means i cant test your change
[23:37:10] <paladox>	 ok
[23:37:19] <mutante>	 i am just telling you that compiling it failed (for other reasons)
[23:37:44] <paladox>	 ok
[23:43:22] <mutante>	 hrhr, of course i get "Error: Could not find class ::role::backup::host"
[23:43:59] <bblack>	 !log jessie-recdns: upgrade to pdns-recursor 4.x paused - hydrogen updated and in-service; chromium/nescio/maerlant still puppet-disabled.  Going to leave things in this state for a while.  If something seems amiss, hydrogen can be re-depooled via confctl: confctl select name=hydrogen.wikimedia.org,service=pdns_recursor set/pooled=no
[23:44:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:46:11] <wikibugs__>	 (03PS7) 10Dzahn: deployment::server: convert to profile/role [puppet] - 10https://gerrit.wikimedia.org/r/344728
[23:46:56] <wikibugs>	 (03PS4) 10Dzahn: ircecho: Convert to base::service class to maintain the script [puppet] - 10https://gerrit.wikimedia.org/r/347518 (owner: 10Paladox)
[23:48:21] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "http://puppet-compiler.wmflabs.org/6095/" [puppet] - 10https://gerrit.wikimedia.org/r/342777 (owner: 10Dzahn)
[23:52:29] <icinga-wm>	 RECOVERY - puppet last run on db1070 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures
[23:55:32] <wikibugs__>	 (03PS4) 10Dzahn: mediawiki::maintenance: convert to profile/role (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/342777
[23:55:46] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mediawiki::maintenance: convert to profile/role (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/342777 (owner: 10Dzahn)
[23:57:23] <wikibugs__>	 (03PS5) 10Dzahn: mediawiki::maintenance: convert to profile/role (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/342777
[23:59:47] <wikibugs__>	 (03PS1) 10Tim Starling: Use EtcdConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347537 (https://phabricator.wikimedia.org/T156924)