[01:29:48] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 83745.669558 Seconds [01:29:48] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 83747.662735 Seconds [01:29:58] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 83753.859079 Seconds [01:30:28] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 83708.853332 Seconds [01:30:28] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 83708.900511 Seconds [01:30:28] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 83713.735369 Seconds [01:39:28] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds [01:42:28] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 84433.751575 Seconds [01:51:48] RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 0.0 Seconds [01:51:48] RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 38.904991 Seconds [01:51:58] RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 44.959266 Seconds [01:52:28] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds [01:52:28] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [01:52:28] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 4.929204 Seconds [02:01:48] PROBLEM - puppet last run on mw1183 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:23:54] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.18) (duration: 08m 17s) [02:24:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:29:48] RECOVERY - puppet last run on mw1183 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [02:41:58] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 1807.696802 Seconds [02:42:48] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 1858.927911 Seconds [02:42:58] RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 0.0 Seconds [02:43:26] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.19) (duration: 07m 32s) [02:43:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:45:48] RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 0.0 Seconds [02:45:58] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 2047.673058 Seconds [02:47:28] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 2010.687699 Seconds [02:48:48] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 2218.95296 Seconds [02:48:48] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 2221.06471 Seconds [02:49:07] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Apr 10 02:49:06 UTC 2017 (duration 5m 40s) [02:49:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:49:58] RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 0.0 Seconds [02:50:48] RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 0.0 Seconds [02:51:29] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds [02:51:48] RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 0.0 Seconds [03:17:28] PROBLEM - puppet last run on labcontrol1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:24:59] (03PS4) 10Mobrovac: RESTBase: Migrate to Scap3 deployment [puppet] - 10https://gerrit.wikimedia.org/r/346248 (https://phabricator.wikimedia.org/T116335) [03:37:06] (03CR) 10Chad: [C: 04-1] "I'm not a fan of this approach." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/347189 (owner: 10Paladox) [03:40:24] (03CR) 10Chad: [C: 04-1] "Identical to my comments on I22f902372db79abec006b01f5590a087b67641d7" [puppet] - 10https://gerrit.wikimedia.org/r/347188 (owner: 10Paladox) [03:46:28] RECOVERY - puppet last run on labcontrol1001 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [03:56:28] PROBLEM - puppet last run on wtp1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:08:18] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=1758.20 Read Requests/Sec=5308.10 Write Requests/Sec=293.40 KBytes Read/Sec=22951.20 KBytes_Written/Sec=4159.20 [04:18:18] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=65.00 Read Requests/Sec=5.30 Write Requests/Sec=10.60 KBytes Read/Sec=86.40 KBytes_Written/Sec=302.00 [04:23:38] PROBLEM - puppet last run on aqs1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:24:48] RECOVERY - puppet last run on wtp1016 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [04:33:48] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 639 600 - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3075262 keys, up 17 days 12 hours - replication_delay is 639 [04:51:38] RECOVERY - puppet last run on aqs1005 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [04:52:48] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3059919 keys, up 17 days 12 hours - replication_delay is 4 [05:03:08] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 617 600 - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3059919 keys, up 17 days 12 hours - replication_delay is 617 [05:17:08] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3060018 keys, up 17 days 13 hours - replication_delay is 0 [05:57:21] (03PS1) 10Marostegui: db-eqiad.php: Depool db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347316 (https://phabricator.wikimedia.org/T160390) [05:58:07] (03PS2) 10Marostegui: db-codfw.php: Add tempdb2001 to x1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346948 (https://phabricator.wikimedia.org/T162290) [05:58:39] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347316 (https://phabricator.wikimedia.org/T160390) (owner: 10Marostegui) [05:59:57] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347316 (https://phabricator.wikimedia.org/T160390) (owner: 10Marostegui) [06:00:08] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347316 (https://phabricator.wikimedia.org/T160390) (owner: 10Marostegui) [06:00:58] (03CR) 10Marostegui: [C: 032] db-codfw.php: Add tempdb2001 to x1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346948 (https://phabricator.wikimedia.org/T162290) (owner: 10Marostegui) [06:01:05] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1034 - T160390 (duration: 00m 39s) [06:01:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:12] T160390: Unify revision table on s7 - https://phabricator.wikimedia.org/T160390 [06:01:21] (03PS7) 10Marostegui: mariadb: Disable trx, sync_binlog for tempdb2001 [puppet] - 10https://gerrit.wikimedia.org/r/346773 (https://phabricator.wikimedia.org/T162290) [06:02:15] (03Merged) 10jenkins-bot: db-codfw.php: Add tempdb2001 to x1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346948 (https://phabricator.wikimedia.org/T162290) (owner: 10Marostegui) [06:02:24] (03CR) 10jenkins-bot: db-codfw.php: Add tempdb2001 to x1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346948 (https://phabricator.wikimedia.org/T162290) (owner: 10Marostegui) [06:03:30] (03CR) 10Marostegui: [C: 032] mariadb: Disable trx, sync_binlog for tempdb2001 [puppet] - 10https://gerrit.wikimedia.org/r/346773 (https://phabricator.wikimedia.org/T162290) (owner: 10Marostegui) [06:03:48] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Add tempdb2001 to x1 as a slave - T162290 (duration: 00m 38s) [06:03:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:03:55] T162290: setup tempdb2001(WMF6407) - https://phabricator.wikimedia.org/T162290 [06:05:39] 06Operations, 10ops-codfw, 10DBA, 10hardware-requests: codfw: (1) spare pool system for temp allocation as database failover - https://phabricator.wikimedia.org/T161712#3167374 (10Marostegui) [06:05:41] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: setup tempdb2001(WMF6407) - https://phabricator.wikimedia.org/T162290#3167372 (10Marostegui) 05Open>03Resolved Hi, I have merged the patch to add it as a slave in codfw and the puppet temporary change to get it with sync_binlog=0 and innodb_flush... [06:07:19] !log Deploy schema change db1034 (s7) - T160390 [06:07:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:07:27] T160390: Unify revision table on s7 - https://phabricator.wikimedia.org/T160390 [06:08:43] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1034" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347317 [06:13:33] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1034" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347317 (owner: 10Marostegui) [06:14:38] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1034" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347317 (owner: 10Marostegui) [06:14:51] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1034" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347317 (owner: 10Marostegui) [06:15:28] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1034 - T160390 (duration: 00m 38s) [06:15:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:35] T160390: Unify revision table on s7 - https://phabricator.wikimedia.org/T160390 [06:18:14] (03PS1) 10Marostegui: db-eqiad.php: Depool db1028 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347318 (https://phabricator.wikimedia.org/T160390) [06:22:37] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1028 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347318 (https://phabricator.wikimedia.org/T160390) (owner: 10Marostegui) [06:23:42] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1028 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347318 (https://phabricator.wikimedia.org/T160390) (owner: 10Marostegui) [06:23:55] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1028 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347318 (https://phabricator.wikimedia.org/T160390) (owner: 10Marostegui) [06:24:36] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1028 - T160390 (duration: 00m 39s) [06:24:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:43] T160390: Unify revision table on s7 - https://phabricator.wikimedia.org/T160390 [06:24:48] !log Deploy schema change db1028 (s7) - T160390 [06:24:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:27] !log Deploy schema change labsdb1001 (s7) - T160390 [06:26:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:48] PROBLEM - puppet last run on labvirt1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:28:38] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:44:10] (03Abandoned) 10Giuseppe Lavagetto: Initial stub of a dry_run functionality [switchdc] - 10https://gerrit.wikimedia.org/r/346953 (owner: 10Giuseppe Lavagetto) [06:46:34] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "Use profile::gerrit::server in labs instead of the role." [puppet] - 10https://gerrit.wikimedia.org/r/347189 (owner: 10Paladox) [06:47:00] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "If so, we should create two separate roles, one including the backup profile and one that doesn't, or define the correct hiera data." [puppet] - 10https://gerrit.wikimedia.org/r/347188 (owner: 10Paladox) [06:49:48] PROBLEM - puppet last run on mw1238 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:51:04] (03PS1) 10Marostegui: x1.hosts: Swap positions db2033 and tempdb2001 [software] - 10https://gerrit.wikimedia.org/r/347319 [06:54:03] (03CR) 10Marostegui: [C: 032] x1.hosts: Swap positions db2033 and tempdb2001 [software] - 10https://gerrit.wikimedia.org/r/347319 (owner: 10Marostegui) [06:54:48] RECOVERY - puppet last run on labvirt1004 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:55:03] (03Merged) 10jenkins-bot: x1.hosts: Swap positions db2033 and tempdb2001 [software] - 10https://gerrit.wikimedia.org/r/347319 (owner: 10Marostegui) [06:57:38] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [06:59:46] (03PS1) 10Marostegui: x1.hosts: Remove dbstore2002 from x1 [software] - 10https://gerrit.wikimedia.org/r/347321 [07:01:03] (03CR) 10Marostegui: [C: 032] x1.hosts: Remove dbstore2002 from x1 [software] - 10https://gerrit.wikimedia.org/r/347321 (owner: 10Marostegui) [07:01:51] (03Merged) 10jenkins-bot: x1.hosts: Remove dbstore2002 from x1 [software] - 10https://gerrit.wikimedia.org/r/347321 (owner: 10Marostegui) [07:02:05] !log installing pam updates from jessie point update [07:02:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:37] (03CR) 10Giuseppe Lavagetto: [C: 031] Remove role::backup::config [puppet] - 10https://gerrit.wikimedia.org/r/347184 (owner: 10Alexandros Kosiaris) [07:10:13] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Good job, a small detail missing" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/346926 (owner: 10Dzahn) [07:10:59] (03CR) 10Giuseppe Lavagetto: [C: 031] mediawiki::maintenance: convert to profile/role (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/342777 (owner: 10Dzahn) [07:14:48] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Very good work but let's try to avoid if $realm conditionals" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/344728 (owner: 10Dzahn) [07:18:48] RECOVERY - puppet last run on mw1238 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [07:20:49] (03PS11) 10Matthias Mullie: Add 3d2png deploy repo to image scalers [puppet] - 10https://gerrit.wikimedia.org/r/345377 (https://phabricator.wikimedia.org/T160185) (owner: 10MarkTraceur) [07:28:48] PROBLEM - Check HHVM threads for leakage on mw1260 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [07:37:38] PROBLEM - puppet last run on elastic1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:37:50] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "I didn't finish reading the patch, but what I saw this far is enough to make me not like it:" (0310 comments) [switchdc] - 10https://gerrit.wikimedia.org/r/347180 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [07:45:48] PROBLEM - Check HHVM threads for leakage on mw1260 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [07:46:38] PROBLEM - puppet last run on mw1294 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:48:06] 06Operations, 06Office-IT, 07LDAP: Make disabled accounts visible in the corp mirror LDAP replica - https://phabricator.wikimedia.org/T160158#3167544 (10MoritzMuehlenhoff) > I do not think this is the issue because GCDS does not generate LDAP data and it does not change our LDAP data. The GCDS just queries o... [07:48:18] <_joe_> !log testing a dry-run of the switchdc software on sarin [07:48:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:29] (03PS4) 10Hashar: wmflib: switch to puppetlabs_spec_helper/rake_tasks [puppet] - 10https://gerrit.wikimedia.org/r/332475 [07:50:31] (03PS12) 10Hashar: wmflib: basic spec for os_version() [puppet] - 10https://gerrit.wikimedia.org/r/178810 [07:50:48] PROBLEM - Check HHVM threads for leakage on mw1259 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [07:52:42] (03CR) 10Hashar: [C: 031] "Thank you to have integrated my older change to raise an exception whenever lsb variables are missing. That saves a lot of headhaches :)" [puppet] - 10https://gerrit.wikimedia.org/r/346673 (owner: 10Faidon Liambotis) [07:53:48] PROBLEM - Check HHVM threads for leakage on mw1168 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [08:04:37] (03PS1) 10Urbanecm: Create editprotected right on ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347330 (https://phabricator.wikimedia.org/T162577) [08:06:38] RECOVERY - puppet last run on elastic1022 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [08:08:07] (03PS1) 10Giuseppe Lavagetto: Fix redis directory lookup [switchdc] - 10https://gerrit.wikimedia.org/r/347331 [08:08:48] PROBLEM - Check HHVM threads for leakage on mw1168 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [08:10:02] (03CR) 10Giuseppe Lavagetto: [C: 032] Fix redis directory lookup [switchdc] - 10https://gerrit.wikimedia.org/r/347331 (owner: 10Giuseppe Lavagetto) [08:16:38] RECOVERY - puppet last run on mw1294 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [08:17:20] (03CR) 10Volans: "General reply. With the current very incoherent logging it's impossible to run in dry-run mode and be able to print a human-comprehensible" (032 comments) [switchdc] - 10https://gerrit.wikimedia.org/r/347180 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [08:21:58] PROBLEM - puppet last run on mw1295 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:27:38] PROBLEM - puppet last run on lvs1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:37:48] PROBLEM - Check HHVM threads for leakage on mw1168 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [08:39:36] !log manual failover of Hadoop master daemons from analyitics1001 to analytics1002 (T160333) [08:39:38] PROBLEM - puppet last run on thumbor1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:39:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:44] T160333: Reimage the Hadoop Cluster to Debian Jessie - https://phabricator.wikimedia.org/T160333 [08:43:19] !log rolling restart of maps-test cluster [08:43:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:40] !log reimage elastic2020 - T149006 [08:48:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:47] T149006: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006 [08:49:39] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3167656 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic2020.codfw.wmnet'... [08:51:05] !log swift codfw-prod: bump ms-be2028 ms-be2039 object weight to 4000 - T158337 [08:51:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:12] T158337: codfw: ms-be2028-ms-be2039 rack/setup - https://phabricator.wikimedia.org/T158337 [08:51:58] RECOVERY - puppet last run on mw1295 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [08:52:34] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#2739503 (10Marostegui) This looks similar to: https://phabricator.wikimedia.org/T149553 Which took us quite some time to debug, but in the end... [08:52:49] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=wdqs [08:52:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:38] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: ocg_job_status 878389 msg (=800000 warning): ocg_render_job_queue 3027 msg (=3000 critical) [08:54:48] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: ocg_job_status 878440 msg (=800000 warning): ocg_render_job_queue 3037 msg (=3000 critical) [08:54:48] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: ocg_job_status 878442 msg (=800000 warning): ocg_render_job_queue 3034 msg (=3000 critical) [08:55:38] RECOVERY - puppet last run on lvs1003 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [08:59:38] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: ocg_job_status 879692 msg (=800000 warning): ocg_render_job_queue 3017 msg (=3000 critical) [08:59:48] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: ocg_job_status 879746 msg (=800000 warning): ocg_render_job_queue 3033 msg (=3000 critical) [08:59:48] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: ocg_job_status 879754 msg (=800000 warning): ocg_render_job_queue 3040 msg (=3000 critical) [09:02:37] tons of jobs submitted? https://grafana.wikimedia.org/dashboard/db/ocg?orgId=1 [09:02:49] not even sure what is the status of OCG nowadays [09:06:21] <_joe_> ahahahh [09:06:28] <_joe_> elukey: it's active! [09:06:43] <_joe_> elukey: also, we're the only ones who should care, apparently [09:07:17] <_joe_> elukey: tbh, I think something is wrong [09:07:38] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: ocg_job_status 881849 msg (=800000 warning): ocg_render_job_queue 3015 msg (=3000 critical) [09:07:38] RECOVERY - puppet last run on thumbor1001 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [09:07:48] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: ocg_job_status 881892 msg (=800000 warning): ocg_render_job_queue 3017 msg (=3000 critical) [09:07:48] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: ocg_job_status 881896 msg (=800000 warning): ocg_render_job_queue 3018 msg (=3000 critical) [09:08:05] <_joe_> ok, 2k jobs/minute seems serious enough [09:12:18] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 49, down: 5, dormant: 0, excluded: 0, unused: 0BRxe-0/1/0: down - Transit: TeliaEU (IC-316335, donated) {#A0010239} [10Gbps]BRxe-0/1/3: down - Core: cr2-eqiad:xe-4/1/3 (Level3, BDFS2448, 84ms) {#A0010621} [10Gbps wave]BRxe-0/1/1: down - Peering: AMS-IX (EvoSwitch SMF.2-9/ab) {#SMF3836} [10Gbps DF]BRae2: down - AMS-IXBRxe-0/1/2: d [09:13:08] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/1/3: down - Core: cr2-esams:xe-0/1/3 (Level3, BDFS2448, 84ms) {#2013} [10Gbps wave]BR [09:13:41] XioNoX: is this you? ^^^ [09:14:09] volans: it might be maintenance [09:14:16] only one port down from L3 [09:14:46] elukey: I know it might, but better to SAL it if it is... ;) [09:15:22] elukey, volans, seems like all the 10G interface on that device are down. The DC haven't confirmed they started the work, but they might be [09:15:32] okok :) [09:15:38] great! :/ [09:15:44] yeah, all are down, so I think they are unplugging the links to replace the card [09:15:48] PROBLEM - Host cr2-esams is DOWN: CRITICAL - Network Unreachable (91.198.174.244) [09:15:56] feel free to SAL the maintenance then XioNoX :) [09:16:10] (here with !log ... T12345) [09:16:10] T12345: Create "annotation" namespace on Hebrew Wikisource - https://phabricator.wikimedia.org/T12345 [09:16:28] !log rolling restart of maps2* cluster [09:16:30] lol stashbot, 12345 was just an example :D [09:16:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:09] !log remote hands work started to replace the FPC on cr2-esams T162239 [09:17:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:16] T162239: cr2-esams FPC 0 is dead - https://phabricator.wikimedia.org/T162239 [09:17:20] 06Operations, 10ops-codfw, 06Performance-Team: Plug in ex-graphite2001 SSDs to recover coal data - https://phabricator.wikimedia.org/T161538#3167703 (10fgiunchedi) a:05fgiunchedi>03Papaul thanks @robh! I've archived and copied `/var/lib/coal` to `/mnt/sde` and umounted the usb drive. @Papaul you can plu... [09:18:13] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps2001.codfw.wmnet [09:18:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:08] PROBLEM - Host cr2-esams IPv6 is DOWN: CRITICAL - Destination Unreachable (2620:0:862:ffff::3) [09:22:15] expected spike of 503s when the router went off: https://grafana.wikimedia.org/dashboard/file/varnish-http-errors.json?refresh=5m&orgId=1 [09:24:48] RECOVERY - Check HHVM threads for leakage on mw1260 is OK: OK [09:25:06] 06Operations, 07HHVM, 13Patch-For-Review, 07Upstream: Build / migrate to HHVM 3.18 - https://phabricator.wikimedia.org/T158176#3167727 (10MoritzMuehlenhoff) [09:27:59] 06Operations, 07HHVM, 07Upstream: HHVM segfault in memory cleanup - https://phabricator.wikimedia.org/T162586#3167730 (10MoritzMuehlenhoff) [09:29:36] PROBLEM - puppet last run on db1049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:29:47] 06Operations, 10Traffic: Select location for Asia Cache DC - https://phabricator.wikimedia.org/T156029#3167754 (10Aklapper) Feel free to bring up any further discussion topics on the talk page of https://meta.wikimedia.org/wiki/Sustainability_Initiative which is the centralized place. [09:29:55] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps2001.codfw.wmnet [09:30:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:16] PROBLEM - Check systemd state on elastic2020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:31:26] PROBLEM - Elasticsearch HTTPS on elastic2020 is CRITICAL: SSL CRITICAL - failed to verify search.svc.codfw.wmnet against elastic2020.codfw.wmnet [09:32:08] elastic2020 is me... reimaged for testing, silencing... - T149006 [09:32:09] T149006: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006 [09:33:18] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps2002.codfw.wmnet [09:33:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:26] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [09:39:36] RECOVERY - Host cr2-esams is UP: PING OK - Packet loss = 0%, RTA = 84.23 ms [09:40:16] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 59, down: 0, dormant: 0, excluded: 0, unused: 0 [09:41:36] welcome back cr2-esams :) [09:41:46] RECOVERY - Host cr2-esams IPv6 is UP: PING OK - Packet loss = 0%, RTA = 85.90 ms [09:43:46] PROBLEM - Check HHVM threads for leakage on mw1168 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [09:44:28] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps2002.codfw.wmnet [09:44:29] !log all interfaces back up on cr2-esams, BGP sessions up as well T162239 [09:44:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:38] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps2003.codfw.wmnet [09:44:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:41] T162239: cr2-esams FPC 0 is dead - https://phabricator.wikimedia.org/T162239 [09:44:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:42] (03CR) 10Alexandros Kosiaris: [C: 032] Remove role::backup::config [puppet] - 10https://gerrit.wikimedia.org/r/347184 (owner: 10Alexandros Kosiaris) [09:48:46] (03PS2) 10Alexandros Kosiaris: Remove role::backup::config [puppet] - 10https://gerrit.wikimedia.org/r/347184 [09:48:52] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Remove role::backup::config [puppet] - 10https://gerrit.wikimedia.org/r/347184 (owner: 10Alexandros Kosiaris) [09:52:37] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps2003.codfw.wmnet [09:52:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:50] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps2004.codfw.wmnet [09:52:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:04] (03CR) 10Alexandros Kosiaris: [C: 04-2] "Just to point out that Giuseppe's -2 is justified based on" [puppet] - 10https://gerrit.wikimedia.org/r/347188 (owner: 10Paladox) [09:53:13] (03CR) 10Alexandros Kosiaris: [C: 04-2] "Just to point out that Giuseppe's -2 is justified based on" [puppet] - 10https://gerrit.wikimedia.org/r/347189 (owner: 10Paladox) [09:57:36] RECOVERY - puppet last run on db1049 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [09:57:58] (03CR) 10Alexandros Kosiaris: [C: 031] standardize "include ::base:*" [puppet] - 10https://gerrit.wikimedia.org/r/347024 (owner: 10Dzahn) [09:59:48] from https://grafana.wikimedia.org/dashboard/db/ocg it seems that the pending jobs are going down [09:59:52] as FYI [10:01:09] !log rolling restart of maps1* (eqiad) cluster [10:01:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:12] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps2004.codfw.wmnet [10:02:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:25] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps1001.eqiad.wmnet [10:03:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:54] (03CR) 10Alexandros Kosiaris: [C: 04-1] lists: convert to role/profile structure (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/346923 (owner: 10Dzahn) [10:10:46] PROBLEM - Check HHVM threads for leakage on mw1168 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [10:11:54] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps1001.eqiad.wmnet [10:11:56] (03CR) 10Alexandros Kosiaris: [C: 04-1] changeprop: Add an ores_uris parameter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/345826 (https://phabricator.wikimedia.org/T159615) (owner: 10Alexandros Kosiaris) [10:11:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:02] (03PS3) 10Alexandros Kosiaris: changeprop: Add an ores_uris parameter [puppet] - 10https://gerrit.wikimedia.org/r/345826 (https://phabricator.wikimedia.org/T159615) [10:12:03] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps1002.eqiad.wmnet [10:12:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:57] (03PS4) 10Alexandros Kosiaris: changeprop: Add an ores_uris parameter [puppet] - 10https://gerrit.wikimedia.org/r/345826 (https://phabricator.wikimedia.org/T159615) [10:16:59] (03PS2) 10Alexandros Kosiaris: changeprop: Remove the ores_uri parameter [puppet] - 10https://gerrit.wikimedia.org/r/345827 (https://phabricator.wikimedia.org/T159615) [10:17:21] (03CR) 10Alexandros Kosiaris: [C: 032] etherpad: convert to profile/role structure [puppet] - 10https://gerrit.wikimedia.org/r/346921 (owner: 10Dzahn) [10:17:26] (03PS5) 10Alexandros Kosiaris: etherpad: convert to profile/role structure [puppet] - 10https://gerrit.wikimedia.org/r/346921 (owner: 10Dzahn) [10:17:33] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] etherpad: convert to profile/role structure [puppet] - 10https://gerrit.wikimedia.org/r/346921 (owner: 10Dzahn) [10:19:49] <_joe_> /win 19 [10:23:16] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps1002.eqiad.wmnet [10:23:23] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps1003.eqiad.wmnet [10:23:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:50] 06Operations, 06Labs: Standalone puppet masters are broken (uninstallable packages) - https://phabricator.wikimedia.org/T162462#3167816 (10akosiaris) >>! In T162462#3164242, @faidon wrote: > My suggestion, which needs a little more time to be fully tested is: > - Take the latest 3.8 jessie-backport (from snaps... [10:27:36] PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:29:06] PROBLEM - puppet last run on cp3036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:31:59] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps1003.eqiad.wmnet [10:32:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:06] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps1004.eqiad.wmnet [10:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:12] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-spa: New upstream release [debs/contenttranslation/apertium-spa] - 10https://gerrit.wikimedia.org/r/346748 (https://phabricator.wikimedia.org/T161511) (owner: 10KartikMistry) [10:32:14] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-cat: New upstream release [debs/contenttranslation/apertium-cat] - 10https://gerrit.wikimedia.org/r/346696 (https://phabricator.wikimedia.org/T161511) (owner: 10KartikMistry) [10:32:26] RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.003 second response time [10:33:26] https://grafana-admin.wikimedia.org/dashboard/db/prometheus-apache-hhvm-dc-stats?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=videoscaler&var-instance=mw1169&from=now-1h&to=now [10:34:01] so mw1169 is a videoscaler, the load spiked to 20 (max hhvm threads configured) and then it started queueing for a bit [10:34:06] (03PS1) 10Giuseppe Lavagetto: Actually select parsoid hosts when doing a rolling restart [switchdc] - 10https://gerrit.wikimedia.org/r/347342 [10:34:08] (03PS1) 10Giuseppe Lavagetto: Send output of systemctl to /dev/null [switchdc] - 10https://gerrit.wikimedia.org/r/347343 [10:34:10] (03PS1) 10Giuseppe Lavagetto: Verify jobrunner is stopped on the videoscalers as well [switchdc] - 10https://gerrit.wikimedia.org/r/347344 [10:34:12] (03PS1) 10Giuseppe Lavagetto: Use pgrep -c as we don't care about the output [switchdc] - 10https://gerrit.wikimedia.org/r/347345 [10:37:21] (03CR) 10Paladox: "@Alexandros Kosiaris and @Giuseppe Lavagetto hi, the phabricator class has not been converted to profile yet. So this means that the new r" [puppet] - 10https://gerrit.wikimedia.org/r/347188 (owner: 10Paladox) [10:38:47] RECOVERY - Check HHVM threads for leakage on mw1259 is OK: OK [10:40:46] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [10:41:36] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [10:41:36] (03CR) 10Volans: [C: 032] Actually select parsoid hosts when doing a rolling restart [switchdc] - 10https://gerrit.wikimedia.org/r/347342 (owner: 10Giuseppe Lavagetto) [10:43:18] (03CR) 10Paladox: "@Giuseppe Lavagetto using just the profile class will break. Because backup::set is in the profile class. Even though it says != undef tha" [puppet] - 10https://gerrit.wikimedia.org/r/347189 (owner: 10Paladox) [10:43:55] _joe_ Hi, see ^^ [10:44:19] I carn't use profile as backup::set is used there which will then lead to a carn't find that error. [10:45:23] (03CR) 10Volans: [C: 032] Send output of systemctl to /dev/null [switchdc] - 10https://gerrit.wikimedia.org/r/347343 (owner: 10Giuseppe Lavagetto) [10:45:53] <_joe_> paladox: then we need to fix that profile to work when no backup is available, I guess [10:46:09] <_joe_> paladox: but then your patch is wrong too [10:46:24] Oh. [10:46:44] <_joe_> paladox: meaning your patch won't work either [10:46:51] It works for me [10:47:03] tested on puppetmaster [10:47:25] <_joe_> oh you also patched the profile [10:47:36] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:47:38] <_joe_> ok, so we just need your patch to fix the profile [10:48:00] Oh i havent patched profile [10:48:03] <_joe_> but I would not use the if $::realm != 'labs' there [10:48:07] I meant i tested ^^ [10:48:11] (03CR) 10Volans: [C: 032] Verify jobrunner is stopped on the videoscalers as well [switchdc] - 10https://gerrit.wikimedia.org/r/347344 (owner: 10Giuseppe Lavagetto) [10:48:12] <_joe_> https://gerrit.wikimedia.org/r/#/c/347189/5/modules/profile/manifests/gerrit/server.pp [10:48:12] oh [10:48:43] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps1004.eqiad.wmnet [10:48:43] _joe_ but as we doint set default now [10:48:44] $bacula = hiera('gerrit::server::bacula'), [10:48:46] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:48:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:56] doing gerrit::server::bacula: undef [10:49:02] will not work in hiera [10:49:02] (03CR) 10Daniel Kinzler: [C: 031] "agree with intent" [puppet] - 10https://gerrit.wikimedia.org/r/347234 (https://phabricator.wikimedia.org/T155103) (owner: 10Hoo man) [10:49:11] i tryed doing that [10:49:36] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 2368.673321 Seconds [10:49:46] in https://wikitech.wikimedia.org/wiki/Hiera:Git [10:49:59] i had to resort to doing "gerrit::server::bacula": srv-gerrit-git [10:50:14] <_joe_> paladox: gerrit::server::bacula: false maybe? [10:50:22] Hmm, let me try that [10:51:36] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [10:52:08] <_joe_> so that conditional will evaluate to false [10:54:34] _joe_ that dosent seem to work [10:54:34] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Puppet::Parser::AST::Resource failed with error ArgumentError: No title provided and "backup::set" is not a valid resource reference at /etc/puppet/modules/profile/manifests/gerrit/server.pp:62 on node gerrit-test3.git.eqiad.wmflabs [10:54:34] Warning: Not using cache on failed catalog [10:54:35] Error: Could not retrieve catalog; skipping run [10:54:43] "gerrit::server::bacula": false [10:54:46] PROBLEM - Check HHVM threads for leakage on mw1168 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [10:55:43] <_joe_> paladox: sorry, I can't really work on this now, but my suggestion is to modify profile::gerrit::server to make it easy to include/exclude the backup [10:56:02] ok [10:56:15] i doint think it will be easy unless we can set defaults. [10:56:28] it's not easy [10:56:48] the entire backup thing where there are conflicting requirements and no sane defaults is not easy [10:57:06] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 282 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [10:57:10] The class worked before we moved it to profile [10:57:18] by conflicting requirements I mean, production wants to have it enabled, labs does not but no sane defaults exist [10:57:27] cause it was an ugly hack by me [10:57:47] it was pretending labs wanted backups [10:57:52] which was not true [10:58:02] oh [10:58:06] RECOVERY - puppet last run on cp3036 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [10:58:42] !log starting load test on elstic2020 - T149006 [10:58:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:50] T149006: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006 [10:59:09] paladox: let me have a look at how to fix that, maybe I can think of something [10:59:21] Ok thanks :) [11:00:28] akosiaris changing != undef to != false seemed to work. [11:00:36] !log upload apertium-cat_2.0.0~r77286-1+wmf1, apertium-spa_1.0.0~r77293-1+wmf1 on apt.wikmedia.org/jessie-wikimedia [11:00:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:18] Puppet has bad caching [11:01:19] (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-spa-cat] - 10https://gerrit.wikimedia.org/r/346780 (https://phabricator.wikimedia.org/T161511) (owner: 10KartikMistry) [11:01:36] setting gerrit::server::bacula to true still passes puppet [11:01:36] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 3088.644616 Seconds [11:01:38] (03CR) 10Volans: [C: 032] Use pgrep -c as we don't care about the output [switchdc] - 10https://gerrit.wikimedia.org/r/347345 (owner: 10Giuseppe Lavagetto) [11:01:56] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 18 probes of 282 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [11:02:37] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [11:03:40] 06Operations, 06Labs: Standalone puppet masters are broken (uninstallable packages) - https://phabricator.wikimedia.org/T162462#3167853 (10akosiaris) I 'll do the first task from above (get the 3.8 bpo and put it in jessie-wikimedia/main) and possibly do the puppet upgrade on the jessie hosts. [11:05:36] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 3328.674771 Seconds [11:05:45] paladox: yeah, I 'll take a look later when my queue of TODOs is a bit emptier. I think some more architectural changes make sense [11:06:02] ok [11:06:03] thanks [11:09:36] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [11:10:44] 06Operations, 10Traffic, 10Wikimedia-Logstash, 13Patch-For-Review: Investigate 502 errors from nginx when backend returns 302 - https://phabricator.wikimedia.org/T161819#3167863 (10fgiunchedi) p:05Triage>03Normal [11:11:27] 06Operations, 06Commons, 10Traffic, 10Wikimedia-Site-requests, and 2 others: Allow anonymous users to change interface language on Commons with ULS - https://phabricator.wikimedia.org/T161517#3167864 (10fgiunchedi) p:05Triage>03Normal [11:11:46] !log upload puppet_3.8.5-2~bpo8+1 on apt.wikimedia.org jessie-wikimedia/main [11:11:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:43] (03PS1) 10KartikMistry: Rename apertium-es-ca -> apertium-spa-cat [puppet] - 10https://gerrit.wikimedia.org/r/347351 (https://phabricator.wikimedia.org/T161511) [11:17:32] akosiaris hi, trying to upgrade puppet on the puppetmaster results in puppet : Breaks: facter (< 2.4.0~) but 2.2.0-1 is to be installed [11:18:37] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 4108.550043 Seconds [11:20:04] Jessie backport has 2.4 https://packages.debian.org/jessie-backports/facter but not sure how to install that. [11:21:36] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [11:24:28] (03CR) 10Faidon Liambotis: "Interesting! I wasn't aware of this old patch. Minor nit: shouldn't ruby code follow a 2-space rather than 4-space indentation?" [puppet] - 10https://gerrit.wikimedia.org/r/178810 (owner: 10Hashar) [11:24:58] hashar: ^ [11:25:15] paravoid: hello! [11:25:30] yeah that is an old bit rotting changes I have kept in my Gerrit attic for a while [11:25:37] for indentation, I dont mind changing to 2 spaces [11:25:57] dunno, I think that's what we follow elsewhere too [11:26:03] I guess it depends whether ones apply the 4 space convention we have in puppet or the 2 spaces from ruby :D [11:26:17] let me adjust [11:27:36] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 4648.532983 Seconds [11:27:58] (03PS13) 10Hashar: wmflib: basic spec for os_version() [puppet] - 10https://gerrit.wikimedia.org/r/178810 [11:28:29] (03CR) 10Hashar: "Changed indentation to two spaces to be consistent with other ruby files :] Thanks Faidon!" [puppet] - 10https://gerrit.wikimedia.org/r/178810 (owner: 10Hashar) [11:28:36] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [11:28:47] the tests themselves are rather lame but should give a good enough coverage [11:29:04] then wmflib has some other specs failing :/ [11:29:07] (03CR) 10Faidon Liambotis: [C: 032] wmflib: basic spec for os_version() [puppet] - 10https://gerrit.wikimedia.org/r/178810 (owner: 10Hashar) [11:29:41] there is also a parent change that replaces the modules/wmflib/Rakefile to use puppetlabs_spec_helper [11:30:00] I saw [11:30:45] I'm not much of a spec/rake expert, akosiaris is the person most familiar with that I think [11:31:06] ah right, I have added him as a reviewer [11:31:32] (03PS1) 10Elukey: Add JVM options tunables for Yarn RM and Hadoop DN/NN [puppet/cdh] - 10https://gerrit.wikimedia.org/r/347353 (https://phabricator.wikimedia.org/T159219) [11:31:36] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 4888.471615 Seconds [11:34:36] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [11:37:36] PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:40:36] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 5428.662417 Seconds [11:41:36] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [11:41:37] jouncebot: refresh [11:41:40] I refreshed my knowledge about deployments. [11:41:45] jouncebot: next [11:41:45] In 1 hour(s) and 18 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170410T1300) [11:42:39] (03PS2) 10Hashar: Increase default thumb size to 250px at nowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334787 (https://phabricator.wikimedia.org/T155892) (owner: 10Urbanecm) [11:43:11] (03CR) 10Hashar: [C: 031] "That has been cleared out:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334787 (https://phabricator.wikimedia.org/T155892) (owner: 10Urbanecm) [11:43:26] RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [11:43:44] (03CR) 10Hashar: [C: 031] Create editprotected right on ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347330 (https://phabricator.wikimedia.org/T162577) (owner: 10Urbanecm) [11:43:46] (03PS1) 10Giuseppe Lavagetto: Add a free-form 'any' type [software/conftool] - 10https://gerrit.wikimedia.org/r/347356 (https://phabricator.wikimedia.org/T156924) [11:46:36] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 5788.605667 Seconds [11:47:36] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [11:50:36] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 6028.629504 Seconds [11:50:47] <_joe_> sigh [11:53:56] PROBLEM - puppet last run on es2016 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tzdata] [11:56:36] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [11:59:24] hashar: may I ask for earlier start (if possible now) of SWAT? It [11:59:27] It [11:59:30] Ehh [11:59:36] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 6568.621692 Seconds [11:59:41] Itll help me :) [12:00:09] Urbanecm_: yup certainly [12:00:22] Thanks! [12:01:06] we can surely do [config] 347330 Create editprotected right on ptwikinews [12:01:24] the one to raise thumb size I can handle it with ops during the regular window [12:01:43] (03PS2) 10BBlack: tlsproxy: double the response buffer size [puppet] - 10https://gerrit.wikimedia.org/r/345767 (https://phabricator.wikimedia.org/T161819) [12:02:19] Urbanecm_: and maybe we should move the swat window an hour earlier ( 2 PM CEST ) [12:02:43] So you wont need me? I must go to train and didn [12:02:45] t expect it [12:03:00] Urbanecm_: yeah catch your train!! it is more important :] [12:03:00] hashar: you mean permanent moving? [12:03:15] I will figure out how to verify the ptwikinews [12:03:32] and the thumb size bump looks easy as well [12:03:54] (03CR) 10BBlack: [C: 032] tlsproxy: double the response buffer size [puppet] - 10https://gerrit.wikimedia.org/r/345767 (https://phabricator.wikimedia.org/T161819) (owner: 10BBlack) [12:04:02] I could reach out to people participating in the european swat and see whetehr an hour earlier would be better [12:04:09] Oka [12:04:11] y [12:04:14] currently it is at 3pm, right in the middle of the afternoon [12:04:30] Regarding ptwikines: I think verifying Special:usergrouprights should do [12:04:31] now I guess, rush to your train!!! :] [12:05:24] Nope. I cant be here during the regular window but can be here now. [12:06:41] Urbanecm_: I will handle them both dont worry :] [12:06:56] and will make sure to !log to the relevant tasks [12:07:03] thanks for all the patches ! [12:07:04] The departure time is in an hour (so a few minutes after the window starts). [12:07:19] You're welcome [12:07:36] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [12:07:50] (03PS2) 10BBlack: VCL: do not allow empty url when un-proxying [puppet] - 10https://gerrit.wikimedia.org/r/339648 [12:08:09] (03CR) 10BBlack: [V: 032 C: 032] VCL: do not allow empty url when un-proxying [puppet] - 10https://gerrit.wikimedia.org/r/339648 (owner: 10BBlack) [12:10:37] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 7228.538775 Seconds [12:10:55] paladox: https://phabricator.wikimedia.org/T162462 [12:11:13] ok thanks [12:14:42] (03PS1) 10Giuseppe Lavagetto: conftool: add mwconfig object type, define the first couple variables [puppet] - 10https://gerrit.wikimedia.org/r/347360 [12:18:16] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 1811.999282 Seconds [12:18:46] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 1842.711178 Seconds [12:18:56] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 1856.25057 Seconds [12:20:36] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [12:21:51] 06Operations, 06Labs: Standalone puppet masters are broken (uninstallable packages) - https://phabricator.wikimedia.org/T162462#3168039 (10akosiaris) The package has been uploaded on jessie-wikimedia/backports as of a while ago and some basic checks seem to be fine. For what is worth puppet-master is a new p... [12:22:06] RECOVERY - puppet last run on es2016 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [12:22:31] (03PS3) 10Gehel: postgresql - introduce the check-postgres package for postgres monitoring [puppet] - 10https://gerrit.wikimedia.org/r/346962 (https://phabricator.wikimedia.org/T162345) [12:23:36] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 8008.488649 Seconds [12:25:29] (03CR) 10Gehel: [C: 032] postgresql - introduce the check-postgres package for postgres monitoring [puppet] - 10https://gerrit.wikimedia.org/r/346962 (https://phabricator.wikimedia.org/T162345) (owner: 10Gehel) [12:29:33] (03Abandoned) 10Ema: cp1008: override cache::route_table [puppet] - 10https://gerrit.wikimedia.org/r/346733 (owner: 10Ema) [12:29:46] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [12:30:36] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [12:33:46] PROBLEM - puppet last run on notebook1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tshark] [12:34:04] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-spa-cat: New upstream release [debs/contenttranslation/apertium-spa-cat] - 10https://gerrit.wikimedia.org/r/346780 (https://phabricator.wikimedia.org/T161511) (owner: 10KartikMistry) [12:36:21] (03PS9) 10BBlack: dnsrecursor: 4.x backport and edns-client-subnet [puppet] - 10https://gerrit.wikimedia.org/r/346937 [12:36:23] (03PS6) 10BBlack: dnsrecursor: update to backports for transition [puppet] - 10https://gerrit.wikimedia.org/r/346980 [12:36:44] the 5xx spike above seems to be mostly due to phabricator ^ [12:37:16] RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 0.0 Seconds [12:37:57] it's the source of the 500 on misc I guess? [12:38:04] yep [12:38:07] there's also some other odd small bumps [12:38:11] e.g. 501 on upload [12:38:36] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [12:38:37] RECOVERY - Postgres Replication Lag on maps1002 is OK: (OK - Rep Delay is:, 0.0, Seconds, [12:38:43] (03PS1) 10Gehel: postgresql - replication lag needs to use --output=simple [puppet] - 10https://gerrit.wikimedia.org/r/347364 [12:38:46] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [12:39:36] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [12:40:16] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 3132.134867 Seconds [12:40:39] !log upload apertium-spa-cat_2.0.0~r77288-2+wmf1 on apt.wikimedia.org jessie-wikimedia/main [12:40:45] kart_: all 3 uploaded [12:40:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:53] I've captured one of the phab 500s: https://phabricator.wikimedia.org/P5231 [12:41:01] akosiaris: nice! [12:41:31] akosiaris: need to rename package, but that's not urgent. do let me know when it is merged. [12:41:53] akosiaris: we may need to restart apertium-apy service. [12:43:02] kart_: you mean https://gerrit.wikimedia.org/r/347351 ? [12:43:08] I can merge it now [12:43:18] Yes. [12:43:22] Thanks! [12:43:28] (03PS2) 10Alexandros Kosiaris: Rename apertium-es-ca -> apertium-spa-cat [puppet] - 10https://gerrit.wikimedia.org/r/347351 (https://phabricator.wikimedia.org/T161511) (owner: 10KartikMistry) [12:43:34] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Rename apertium-es-ca -> apertium-spa-cat [puppet] - 10https://gerrit.wikimedia.org/r/347351 (https://phabricator.wikimedia.org/T161511) (owner: 10KartikMistry) [12:45:01] 06Operations, 10Traffic, 10Wikimedia-Logstash, 13Patch-For-Review: Investigate 502 errors from nginx when backend returns 302 - https://phabricator.wikimedia.org/T161819#3168117 (10BBlack) 05Open>03Resolved a:03BBlack Merge above should fix this, at least for this case and any others on our cache ter... [12:46:24] looking at the phab exception in a browser, the response to URL: https://phabricator.wikimedia.org/source/RevisionSlider/history/wmf%252F1.29.0-wmf.13/jsduck.json [12:46:30] is a 500 error which says in the html text: [12:46:36] Unhandled Exception ("DiffusionRefNotFoundException") [12:46:36] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: (CRITICAL - Rep Delay is:, 9388.562921, Seconds, [12:46:37] Ref "wmf/1.29.0-wmf.13" does not exist in this repository. [12:46:51] I've also just got this in my phab activity feed: [12:46:53] PhabricatorClusterStrandedException: Unable to establish a connection to any database host (while trying "phabricator_file"). All masters and replicas are completely unreachable. [12:47:30] (it seems odd to me that an unfound thing throws an unhandled exception that becomes a 500, instead of some kind of caught exception that results in 4xx) [12:50:36] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [12:51:46] RECOVERY - Postgres Replication Lag on maps2002 is OK: (OK - Rep Delay is:, 0.0, Seconds, [12:52:36] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [12:53:33] jouncebot next [12:53:33] In 0 hour(s) and 6 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170410T1300) [12:54:24] (03CR) 10Alexandros Kosiaris: [C: 032] wmflib: switch to puppetlabs_spec_helper/rake_tasks [puppet] - 10https://gerrit.wikimedia.org/r/332475 (owner: 10Hashar) [12:54:36] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [12:54:46] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: (CRITICAL - Rep Delay is:, 4003.747921, Seconds, [12:54:46] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [12:55:56] !log Run pt-table-checksum on s4 - T162593 [12:56:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:03] T162593: Run pt-table-checksum on s4 (commonswiki) - https://phabricator.wikimedia.org/T162593 [12:56:46] RECOVERY - Postgres Replication Lag on maps2002 is OK: (OK - Rep Delay is:, 0.0, Seconds, [12:59:46] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: (CRITICAL - Rep Delay is:, 4303.665043, Seconds, [12:59:58] 06Operations, 10Traffic, 10netops: knams equipment move - https://phabricator.wikimedia.org/T162601#3168183 (10ayounsi) [13:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170410T1300). Please do the needful. [13:00:04] hashar, Urbanecm, and phuedx: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:26] 06Operations, 10DBA, 13Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3168199 (10Marostegui) [13:00:46] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [13:00:50] o/ [13:00:56] o/ [13:00:58] !log stopped search indexer on iridium to lighten load on m3 databases. [13:01:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:33] twentyafterfour: hi! Looks like you're already aware of the phab issues [13:01:39] 06Operations, 05Prometheus-metrics-monitoring, 15User-fgiunchedi: Effects on adjusting Prometheus retention - https://phabricator.wikimedia.org/T160677#3168202 (10fgiunchedi) p:05Triage>03Normal [13:01:41] o/ [13:01:44] phuedx: lets do your :) [13:01:53] (03PS2) 10Hashar: pagePreviews: Enable NavPopups gadget detection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347291 (https://phabricator.wikimedia.org/T160081) (owner: 10Phuedx) [13:01:58] ema: yeah the phab database is pretty heavily loaded, I think the search indexer was at least part of the problem [13:02:01] though I am not entirely sure what it does :) [13:02:11] I've been backfilling the elasticsearch index for a while [13:02:16] RECOVERY - Postgres Replication Lag on maps2003 is OK: (OK - Rep Delay is:, 0.0, Seconds, [13:02:36] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [13:02:41] twentyafterfour: OK. If it helps debugging, we've had a few 500 spikes today: https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes?panelId=2&fullscreen&orgId=1&from=1491818336930&to=1491829320586&var-site=All&var-cache_type=misc&var-status_type=5 [13:02:45] (03PS2) 10Alexandros Kosiaris: wmflib: multiple os_version changes [puppet] - 10https://gerrit.wikimedia.org/r/346673 (owner: 10Faidon Liambotis) [13:02:47] RECOVERY - puppet last run on notebook1001 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [13:02:51] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] wmflib: multiple os_version changes [puppet] - 10https://gerrit.wikimedia.org/r/346673 (owner: 10Faidon Liambotis) [13:02:57] 06Operations: videoscalers (mw1168, mw1169) - high load / overheating - https://phabricator.wikimedia.org/T161918#3168223 (10fgiunchedi) p:05Triage>03Normal [13:03:19] hashar: related task: https://phabricator.wikimedia.org/T160081 [13:03:46] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:03:47] (03CR) 10Gehel: [C: 032] postgresql - replication lag needs to use --output=simple [puppet] - 10https://gerrit.wikimedia.org/r/347364 (owner: 10Gehel) [13:03:51] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347291 (https://phabricator.wikimedia.org/T160081) (owner: 10Phuedx) [13:03:53] (03PS2) 10Gehel: postgresql - replication lag needs to use --output=simple [puppet] - 10https://gerrit.wikimedia.org/r/347364 [13:04:01] twentyafterfour: perhaps related to the recent upgrade? [13:04:12] (03CR) 10Gehel: [V: 032 C: 032] postgresql - replication lag needs to use --output=simple [puppet] - 10https://gerrit.wikimedia.org/r/347364 (owner: 10Gehel) [13:04:36] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:04:42] ema: nope, I see what's up - it's search engines indexing us hardcore [13:04:42] (03PS3) 10Hashar: Increase default thumb size to 250px at nowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334787 (https://phabricator.wikimedia.org/T155892) (owner: 10Urbanecm) [13:04:44] (03PS2) 10Hashar: Create editprotected right on ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347330 (https://phabricator.wikimedia.org/T162577) (owner: 10Urbanecm) [13:04:52] (03PS5) 10Alexandros Kosiaris: wmflib: switch to puppetlabs_spec_helper/rake_tasks [puppet] - 10https://gerrit.wikimedia.org/r/332475 (owner: 10Hashar) [13:04:56] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] wmflib: switch to puppetlabs_spec_helper/rake_tasks [puppet] - 10https://gerrit.wikimedia.org/r/332475 (owner: 10Hashar) [13:05:16] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: (CRITICAL - Rep Delay is:, 4632.256525, Seconds, [13:06:24] phuedx: ahh so we have an extension for the popups written by wmf AND a gadget written by the community ? :D [13:06:28] (03PS14) 10Alexandros Kosiaris: wmflib: basic spec for os_version() [puppet] - 10https://gerrit.wikimedia.org/r/178810 (owner: 10Hashar) [13:06:36] (03Merged) 10jenkins-bot: pagePreviews: Enable NavPopups gadget detection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347291 (https://phabricator.wikimedia.org/T160081) (owner: 10Phuedx) [13:06:45] (03CR) 10jenkins-bot: pagePreviews: Enable NavPopups gadget detection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347291 (https://phabricator.wikimedia.org/T160081) (owner: 10Phuedx) [13:06:57] hashar: yup, the wmf was a rebuild of the community one, but to avoid any confusion... etc etc [13:07:30] gehel: since you're doing postgres stuff, https://gerrit.wikimedia.org/r/#/c/345837/ (cc: akosiaris) [13:07:32] phuedx: I better understand the context now :D I guess with time the community ones will be integrated in Hovercards [13:08:03] phuedx: patch is on mwdebug1001now [13:08:06] ta [13:08:08] 'Mozilla/5.0 (compatible; mbot/1.8; cust0002; +https://www.teorem.se/bot.html)' [13:08:15] paravoid: kool! [13:08:18] that's no search engine I know of [13:08:40] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334787 (https://phabricator.wikimedia.org/T155892) (owner: 10Urbanecm) [13:09:11] (03CR) 10Faidon Liambotis: [C: 04-1] "Agreed, let's hold until Apr 26th to merge this." [puppet] - 10https://gerrit.wikimedia.org/r/345838 (owner: 10Faidon Liambotis) [13:09:30] (03CR) 10Gehel: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/345837 (owner: 10Faidon Liambotis) [13:09:33] twentyafterfour there certificate is invalid [13:09:56] RECOVERY - Postgres Replication Lag on maps2004 is OK: (OK - Rep Delay is:, 1.705307, Seconds, [13:10:07] (03Merged) 10jenkins-bot: Increase default thumb size to 250px at nowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334787 (https://phabricator.wikimedia.org/T155892) (owner: 10Urbanecm) [13:10:15] (03CR) 10jenkins-bot: Increase default thumb size to 250px at nowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334787 (https://phabricator.wikimedia.org/T155892) (owner: 10Urbanecm) [13:10:17] RECOVERY - Postgres Replication Lag on maps2003 is OK: (OK - Rep Delay is:, 17.643769, Seconds, [13:10:20] 06Operations, 13Patch-For-Review, 15User-Elukey: Tracking and Reducing cron-spam from root@ - https://phabricator.wikimedia.org/T132324#3168247 (10fgiunchedi) [13:10:23] 06Operations, 10Monitoring: certspotter on einsteinium has issues talking to external - https://phabricator.wikimedia.org/T162327#3168246 (10fgiunchedi) [13:10:46] RECOVERY - Postgres Replication Lag on maps2002 is OK: (OK - Rep Delay is:, 50.151135, Seconds, [13:10:47] 06Operations, 10Monitoring: certspotter on einsteinium has issues talking to external - https://phabricator.wikimedia.org/T162327#3159311 (10fgiunchedi) p:05Triage>03Low [13:12:37] RECOVERY - Postgres Replication Lag on maps1002 is OK: (OK - Rep Delay is:, 0.0, Seconds, [13:13:32] paladox: yeah, it looks pretty sketchy [13:14:01] Yep. I didnt press continue just in case they downloaded a virus. [13:14:12] (03PS2) 10Faidon Liambotis: postgresql: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345837 [13:14:14] (03PS4) 10Faidon Liambotis: mediawiki: kill more precise references [puppet] - 10https://gerrit.wikimedia.org/r/345546 [13:14:17] (03PS4) 10Faidon Liambotis: apache: remove precise (and Apache 2.2) support [puppet] - 10https://gerrit.wikimedia.org/r/345548 [13:14:18] (03PS2) 10Faidon Liambotis: typos: add (requires_os|os_version)\([^)]*[[:upper:]] [puppet] - 10https://gerrit.wikimedia.org/r/346677 (owner: 10Dzahn) [13:14:21] (03PS5) 10Faidon Liambotis: releases: remove the precise suite [puppet] - 10https://gerrit.wikimedia.org/r/345838 [13:14:34] twentyafterfour: yeah I see the requests with User-Agent ~ teorem.se in esams [13:15:06] twentyafterfour: im on a vm right now want me to check it out? [13:15:17] hashar: thumbs up [13:15:27] \^/ [13:15:36] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: (CRITICAL - Rep Delay is:, 11128.704343, Seconds, [13:15:45] (03CR) 10Faidon Liambotis: [C: 032] postgresql: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345837 (owner: 10Faidon Liambotis) [13:16:08] ema: theres nothing there at teorem.se [13:16:25] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: pagePreviews: Enable NavPopups gadget detection - T160081 (duration: 00m 40s) [13:16:28] phuedx: it is live. Congratulations. [13:16:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:33] T160081: Navigational popups and page previews appearing simultaneously - https://phabricator.wikimedia.org/T160081 [13:16:50] gehel: ty, merged [13:17:14] nowiki default thumbsize is bumped to 250px (tested/works) <-- gilles [13:17:15] nice to see this! At some point, the maps puppet code will look clean... [13:17:22] Zppix: the User-Agent is 'Mozilla/5.0 (compatible; mbot/1.8; cust0002; +https://www.teorem.se/bot.html)', invalid cert as paladox mentioned [13:17:36] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347330 (https://phabricator.wikimedia.org/T162577) (owner: 10Urbanecm) [13:17:36] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: Increase default thumb size to 250px at nowiki - T155892 (duration: 00m 45s) [13:17:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:43] T155892: Increase default thumb size to 250px at nowiki - https://phabricator.wikimedia.org/T155892 [13:18:01] I guess it is someone wanting to cause problems to wikimedias domain. [13:18:02] (03CR) 10jerkins-bot: [V: 04-1] releases: remove the precise suite [puppet] - 10https://gerrit.wikimedia.org/r/345838 (owner: 10Faidon Liambotis) [13:18:36] paladox: i would stay off it i went to it and lets just say im still getting notifications from my vm (luckily it was my throw away nuclear crash course vm) [13:18:44] oh my bad [13:18:57] (03Merged) 10jenkins-bot: Create editprotected right on ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347330 (https://phabricator.wikimedia.org/T162577) (owner: 10Urbanecm) [13:19:06] (03CR) 10jenkins-bot: Create editprotected right on ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347330 (https://phabricator.wikimedia.org/T162577) (owner: 10Urbanecm) [13:19:09] !log reboot analytics1040->1050 to pick up the new kernel [13:19:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:24] ok [13:20:06] (03PS6) 10Faidon Liambotis: releases: remove the precise suite [puppet] - 10https://gerrit.wikimedia.org/r/345838 [13:20:21] Zppix what type of notifications are you getting? [13:20:25] Random ads [13:20:38] paladox: the antivirus i have on it is giving me malware trojan the whole 9 [13:20:46] RECOVERY - Check HHVM threads for leakage on mw1168 is OK: OK [13:20:48] (03PS2) 10Volans: Logging: add multiple handlers to the logger [switchdc] - 10https://gerrit.wikimedia.org/r/347180 (https://phabricator.wikimedia.org/T160178) [13:20:52] (03PS1) 10Volans: Logging: use the new handler, remove log_dry_run [switchdc] - 10https://gerrit.wikimedia.org/r/347369 (https://phabricator.wikimedia.org/T160178) [13:20:52] (03PS1) 10Volans: Logging: uniformed log levels and messages [switchdc] - 10https://gerrit.wikimedia.org/r/347370 (https://phabricator.wikimedia.org/T160178) [13:20:55] (03PS1) 10Volans: MySQL: better dry-run handling [switchdc] - 10https://gerrit.wikimedia.org/r/347371 (https://phabricator.wikimedia.org/T160178) [13:20:56] (03PS1) 10Volans: Tendril: fix an error in the exception raising [switchdc] - 10https://gerrit.wikimedia.org/r/347372 (https://phabricator.wikimedia.org/T160178) [13:20:59] (03PS1) 10Volans: Disable puppet: fix title and docstring [switchdc] - 10https://gerrit.wikimedia.org/r/347373 (https://phabricator.wikimedia.org/T160178) [13:21:01] (03PS2) 10Gehel: postgresql - use the check-postgres package for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/346963 (https://phabricator.wikimedia.org/T162345) [13:21:02] (03PS1) 10Volans: Formatting improvements [switchdc] - 10https://gerrit.wikimedia.org/r/347374 (https://phabricator.wikimedia.org/T160178) [13:21:04] Zppix lol, someone wants to infect phabricator. [13:21:04] (03PS1) 10Volans: MediaWiki: announce explicitly the read-only period [switchdc] - 10https://gerrit.wikimedia.org/r/347375 (https://phabricator.wikimedia.org/T160178) [13:21:07] (03PS1) 10Volans: Menu: avoid double failing message [switchdc] - 10https://gerrit.wikimedia.org/r/347376 (https://phabricator.wikimedia.org/T160178) [13:21:09] (03PS1) 10Jcrespo: Repool db1055 after maintenance with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347377 (https://phabricator.wikimedia.org/T159319) [13:21:11] (03PS5) 10Faidon Liambotis: mediawiki: kill more precise references [puppet] - 10https://gerrit.wikimedia.org/r/345546 [13:21:15] ^^ thats alot of changes [13:21:34] (03CR) 10Faidon Liambotis: [C: 032] mediawiki: kill more precise references [puppet] - 10https://gerrit.wikimedia.org/r/345546 (owner: 10Faidon Liambotis) [13:21:42] paladox: i mean we just have it backupped so they can try all they want, we can just switch to cofw's backup [13:22:48] Zppix they are some how hitting the phabricator domain not the backend as it is un assessible to the outside world. Though someone may be trying to exploit phab. [13:23:10] paladox: cant we just block their traffic? [13:23:17] Yes [13:23:37] I say just hit em with a good ol rangeblock internally [13:24:40] (03PS3) 10Gehel: postgresql - use the check-postgres package for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/346963 (https://phabricator.wikimedia.org/T162345) [13:25:23] (03PS1) 10Muehlenhoff: Allow silencing a debconf query [puppet] - 10https://gerrit.wikimedia.org/r/347378 [13:25:58] (03PS2) 10Hashar: Set wgTranslateNumerals false on bhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346965 (https://phabricator.wikimedia.org/T160098) (owner: 10Sfic) [13:26:11] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346965 (https://phabricator.wikimedia.org/T160098) (owner: 10Sfic) [13:26:20] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: Create editprotected right on ptwikinews - T162577 (duration: 00m 40s) [13:26:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:27] T162577: Create "editprotected" right on Portuguese Wikinews - https://phabricator.wikimedia.org/T162577 [13:27:29] (03Merged) 10jenkins-bot: Set wgTranslateNumerals false on bhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346965 (https://phabricator.wikimedia.org/T160098) (owner: 10Sfic) [13:27:38] (03CR) 10jenkins-bot: Set wgTranslateNumerals false on bhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346965 (https://phabricator.wikimedia.org/T160098) (owner: 10Sfic) [13:30:07] PROBLEM - Check HHVM threads for leakage on mw1259 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [13:30:10] 06Operations, 10Traffic, 10media-storage, 13Patch-For-Review, 15User-Urbanecm: Some PNG thumbnails and JPEG originals delivered as [text/html] content-type and hence not rendered in browser - https://phabricator.wikimedia.org/T162035#3168304 (10Aklapper) I fail to load https://upload.wikimedia.org/wikipe... [13:30:23] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: Set wgTranslateNumerals false on bhwiki - T160098 (duration: 00m 40s) [13:30:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:30] T160098: Change default numerals on Bhojpuri Wikipedia from Devanagari to Arabic numerals - https://phabricator.wikimedia.org/T160098 [13:31:18] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Small error in Outputhandler, but lgtm otherwise." (031 comment) [switchdc] - 10https://gerrit.wikimedia.org/r/347180 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [13:32:02] (03CR) 10Alexandros Kosiaris: [C: 032] apache: remove precise (and Apache 2.2) support [puppet] - 10https://gerrit.wikimedia.org/r/345548 (owner: 10Faidon Liambotis) [13:32:06] (03PS5) 10Alexandros Kosiaris: apache: remove precise (and Apache 2.2) support [puppet] - 10https://gerrit.wikimedia.org/r/345548 (owner: 10Faidon Liambotis) [13:32:09] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] apache: remove precise (and Apache 2.2) support [puppet] - 10https://gerrit.wikimedia.org/r/345548 (owner: 10Faidon Liambotis) [13:32:50] 06Operations, 06Commons, 10media-storage: Commons File:Assemblea_Costituente_1946_(2).svg missing after file move - https://phabricator.wikimedia.org/T161476#3132721 (10fgiunchedi) I believe this is related to T111838 and similar bugs, I'm tentatively merging it there [13:32:55] (03PS3) 10Faidon Liambotis: typos: add (requires_os|os_version)\([^)]*[[:upper:]] [puppet] - 10https://gerrit.wikimedia.org/r/346677 (owner: 10Dzahn) [13:33:01] (03CR) 10Faidon Liambotis: [C: 032] typos: add (requires_os|os_version)\([^)]*[[:upper:]] [puppet] - 10https://gerrit.wikimedia.org/r/346677 (owner: 10Dzahn) [13:33:13] 06Operations, 06Commons, 10media-storage, 05MW-1.27-release (WMF-deploy-2015-11-03_(1.27.0-wmf.5)), 05MW-1.27-release-notes: Some files had disappeared from Commons after renaming - https://phabricator.wikimedia.org/T111838#1617529 (10fgiunchedi) [13:33:16] 06Operations, 06Commons, 10media-storage: Commons File:Assemblea_Costituente_1946_(2).svg missing after file move - https://phabricator.wikimedia.org/T161476#3168333 (10fgiunchedi) [13:33:34] (03PS3) 10Volans: Logging: add multiple handlers to the logger [switchdc] - 10https://gerrit.wikimedia.org/r/347180 (https://phabricator.wikimedia.org/T160178) [13:33:43] (03CR) 10Volans: "Fixed" (031 comment) [switchdc] - 10https://gerrit.wikimedia.org/r/347180 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [13:35:14] 06Operations, 10Traffic, 10media-storage, 13Patch-For-Review, 15User-Urbanecm: Some PNG thumbnails and JPEG originals delivered as [text/html] content-type and hence not rendered in browser - https://phabricator.wikimedia.org/T162035#3168334 (10ema) >>! In T162035#3168304, @Aklapper wrote: > I fail to lo... [13:39:56] RECOVERY - Postgres Replication Lag on maps1002 is OK: (OK - Rep Delay is:, 0.0, Seconds, [13:40:06] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: (CRITICAL - Rep Delay is:, 1811.714751, Seconds, [13:40:16] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: (CRITICAL - Rep Delay is:, 1817.467036, Seconds, [13:40:59] (03PS2) 10Volans: Logging: use the new handler, remove log_dry_run [switchdc] - 10https://gerrit.wikimedia.org/r/347369 (https://phabricator.wikimedia.org/T160178) [13:41:20] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "A couple of minor things to change, but LGTM otherwise." (032 comments) [switchdc] - 10https://gerrit.wikimedia.org/r/347369 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [13:41:47] !log upgrading wtp1002-wtp1005 to Linux 4.9 [13:41:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:47] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: (CRITICAL - Rep Delay is:, 1972.674695, Seconds, [13:43:01] (03CR) 10Volans: "Fixed" (032 comments) [switchdc] - 10https://gerrit.wikimedia.org/r/347369 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [13:44:06] RECOVERY - Postgres Replication Lag on maps2004 is OK: (OK - Rep Delay is:, 0.0, Seconds, [13:44:32] (03CR) 10Jcrespo: [C: 032] Repool db1055 after maintenance with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347377 (https://phabricator.wikimedia.org/T159319) (owner: 10Jcrespo) [13:44:52] I am about to deploy gerrit:347377 [13:45:35] go for it! [13:45:53] Needs Verified Label [13:46:47] 06Operations, 13Patch-For-Review, 15User-Elukey, 07Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#3168342 (10elukey) Thanks a lot! The errors logged at the moment are clouded by https:... [13:46:48] (03Merged) 10jenkins-bot: Repool db1055 after maintenance with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347377 (https://phabricator.wikimedia.org/T159319) (owner: 10Jcrespo) [13:46:59] (03CR) 10jenkins-bot: Repool db1055 after maintenance with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347377 (https://phabricator.wikimedia.org/T159319) (owner: 10Jcrespo) [13:47:06] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: (CRITICAL - Rep Delay is:, 2231.640589, Seconds, [13:49:16] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1055 after maintenance with low weight (duration: 00m 38s) [13:49:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:46] PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [13:50:07] RECOVERY - Postgres Replication Lag on maps2004 is OK: (OK - Rep Delay is:, 0.0, Seconds, [13:53:06] PROBLEM - Postgres Replication Lag on maps-test2004 is CRITICAL: (CRITICAL - Rep Delay is:, 1872.688149, Seconds, [13:53:06] PROBLEM - Postgres Replication Lag on maps-test2003 is CRITICAL: (CRITICAL - Rep Delay is:, 1875.298713, Seconds, [13:53:07] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: (CRITICAL - Rep Delay is:, 2591.594838, Seconds, [13:53:16] RECOVERY - Postgres Replication Lag on maps2003 is OK: (OK - Rep Delay is:, 0.0, Seconds, [13:53:47] RECOVERY - Postgres Replication Lag on maps2002 is OK: (OK - Rep Delay is:, 0.0, Seconds, [13:53:58] (03CR) 10Gehel: "Data from Grafana (https://grafana-admin.wikimedia.org/dashboard/db/maps-performances?orgId=1&from=now-3h&to=now) indicates that a 1Mb war" [puppet] - 10https://gerrit.wikimedia.org/r/346963 (https://phabricator.wikimedia.org/T162345) (owner: 10Gehel) [13:54:42] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "A few nitpicks, and a couple of details to fix." (034 comments) [switchdc] - 10https://gerrit.wikimedia.org/r/347370 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [13:55:06] RECOVERY - Postgres Replication Lag on maps2004 is OK: (OK - Rep Delay is:, 0.0, Seconds, [13:55:17] 06Operations, 10Traffic: cache_upload: uncompressed images with Content-Encoding: gzip cause content decoding issues - https://phabricator.wikimedia.org/T148830#3168355 (10ema) 05Resolved>03Open Reopening, another instance of this bug has been reported in T162035#3168304. [13:56:06] RECOVERY - Postgres Replication Lag on maps-test2004 is OK: (OK - Rep Delay is:, 0.0, Seconds, [13:56:11] (03CR) 10Giuseppe Lavagetto: [C: 031] "LGTM" [switchdc] - 10https://gerrit.wikimedia.org/r/347371 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [13:56:16] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: (CRITICAL - Rep Delay is:, 2777.50309, Seconds, [13:56:46] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: (CRITICAL - Rep Delay is:, 2812.911793, Seconds, [13:57:02] (03CR) 10Giuseppe Lavagetto: [C: 031] Tendril: fix an error in the exception raising [switchdc] - 10https://gerrit.wikimedia.org/r/347372 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [13:57:19] gehel: I've seen a lot of those maps pg repl alerts flapping lately, what's up with it? (bad check?) [13:57:26] (03CR) 10Giuseppe Lavagetto: [C: 031] Disable puppet: fix title and docstring [switchdc] - 10https://gerrit.wikimedia.org/r/347373 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [13:57:52] bblack: yep, I have a change of check coming up, testing it right now... [13:58:04] nice :) [13:58:06] RECOVERY - Postgres Replication Lag on maps-test2003 is OK: (OK - Rep Delay is:, 0.0, Seconds, [13:58:07] gotta love bots gehel [13:58:52] bblack: since the maps servers are updated once a day, they see almost no write traffic. And when a vacuum takes place, suddenly there is a large change in lag for a short time. [13:59:06] (03CR) 10Giuseppe Lavagetto: [C: 031] Formatting improvements [switchdc] - 10https://gerrit.wikimedia.org/r/347374 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [13:59:06] PROBLEM - Postgres Replication Lag on maps-test2002 is CRITICAL: (CRITICAL - Rep Delay is:, 2232.550606, Seconds, [13:59:06] PROBLEM - Postgres Replication Lag on maps-test2004 is CRITICAL: (CRITICAL - Rep Delay is:, 2232.609279, Seconds, [13:59:06] PROBLEM - Check HHVM threads for leakage on mw1259 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [13:59:06] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: (CRITICAL - Rep Delay is:, 2951.668787, Seconds, [13:59:11] solution is to stop looking at lag in time, but start looking at lag in bytes [13:59:38] gehel: or not for lag during the times of the write? [13:59:55] (03CR) 10Volans: "Replies inline" (034 comments) [switchdc] - 10https://gerrit.wikimedia.org/r/347370 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [14:00:03] (03CR) 10Giuseppe Lavagetto: [C: 031] "great!" [switchdc] - 10https://gerrit.wikimedia.org/r/347375 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [14:00:06] Zppix: not sure I understood... [14:00:06] RECOVERY - Postgres Replication Lag on maps-test2002 is OK: (OK - Rep Delay is:, 0.0, Seconds, [14:00:06] RECOVERY - Postgres Replication Lag on maps-test2004 is OK: (OK - Rep Delay is:, 0.0, Seconds, [14:00:18] gehel: you could also not check for lag while the write occours [14:00:51] Zppix: you mean disable the check during write operations? [14:00:57] yes gehel [14:00:59] (03CR) 10Giuseppe Lavagetto: [C: 031] Menu: avoid double failing message [switchdc] - 10https://gerrit.wikimedia.org/r/347376 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [14:01:24] that seems more complex, especially since some writes (like the vacuum) are started more or less automagically by postgres itself [14:02:08] gehel: they arent on a timer? [14:02:37] (03CR) 10Giuseppe Lavagetto: [C: 032] Logging: add multiple handlers to the logger [switchdc] - 10https://gerrit.wikimedia.org/r/347180 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [14:02:39] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1028" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347382 [14:02:43] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1028" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347382 [14:03:16] RECOVERY - Postgres Replication Lag on maps2003 is OK: (OK - Rep Delay is:, 0.0, Seconds, [14:03:23] (03CR) 10Giuseppe Lavagetto: [C: 032] Logging: use the new handler, remove log_dry_run [switchdc] - 10https://gerrit.wikimedia.org/r/347369 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [14:03:28] (03PS3) 10Giuseppe Lavagetto: Logging: use the new handler, remove log_dry_run [switchdc] - 10https://gerrit.wikimedia.org/r/347369 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [14:03:34] Zppix: nope, we use auto vacuum (https://www.postgresql.org/docs/9.4/static/runtime-config-autovacuum.html) [14:03:35] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Logging: use the new handler, remove log_dry_run [switchdc] - 10https://gerrit.wikimedia.org/r/347369 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [14:04:20] gehel: oh... [14:04:55] in any case, I prefer to check a stable metric than disable checks in some conditions :) [14:05:09] !log reimage anaytics1001 to Debian Jessie [14:05:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:37] akosiaris: if you have a minute to check https://gerrit.wikimedia.org/r/#/c/346963/ (trying to fix those postgresql alerts...) [14:06:56] RECOVERY - Postgres Replication Lag on maps2002 is OK: (OK - Rep Delay is:, 0.0, Seconds, [14:08:56] (03CR) 10Giuseppe Lavagetto: [C: 04-1] Logging: uniformed log levels and messages (033 comments) [switchdc] - 10https://gerrit.wikimedia.org/r/347370 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [14:09:06] RECOVERY - Postgres Replication Lag on maps2004 is OK: (OK - Rep Delay is:, 0.0, Seconds, [14:09:13] (03CR) 10Giuseppe Lavagetto: [C: 032] Logging: uniformed log levels and messages [switchdc] - 10https://gerrit.wikimedia.org/r/347370 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [14:09:18] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1028" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347382 (owner: 10Marostegui) [14:09:20] (03PS2) 10Giuseppe Lavagetto: Logging: uniformed log levels and messages [switchdc] - 10https://gerrit.wikimedia.org/r/347370 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [14:10:16] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: (CRITICAL - Rep Delay is:, 3617.546789, Seconds, [14:10:57] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: (CRITICAL - Rep Delay is:, 3662.675495, Seconds, [14:11:24] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1028" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347382 (owner: 10Marostegui) [14:11:33] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1028" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347382 (owner: 10Marostegui) [14:12:15] gehel: is maps pg using autovacuum or some manual thing? it's been a long time since I used pg in prod myself, but I was thinking if the data updates are once a day, you could ensure vacuums are manually executed daily (e.g. from cron) so that they're on opposite timing from the data updates, too. [14:12:23] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1028 - T160390 (duration: 00m 38s) [14:12:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:30] T160390: Unify revision table on s7 - https://phabricator.wikimedia.org/T160390 [14:12:45] (03PS2) 10Giuseppe Lavagetto: MySQL: better dry-run handling [switchdc] - 10https://gerrit.wikimedia.org/r/347371 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [14:12:49] bblack: autov [14:13:34] (03CR) 10Giuseppe Lavagetto: [C: 032] MySQL: better dry-run handling [switchdc] - 10https://gerrit.wikimedia.org/r/347371 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [14:13:46] (03PS2) 10Giuseppe Lavagetto: Tendril: fix an error in the exception raising [switchdc] - 10https://gerrit.wikimedia.org/r/347372 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [14:14:56] (03CR) 10Giuseppe Lavagetto: [C: 032] Tendril: fix an error in the exception raising [switchdc] - 10https://gerrit.wikimedia.org/r/347372 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [14:14:58] bblack: it is using autovacuum. [14:15:16] RECOVERY - Postgres Replication Lag on maps2003 is OK: (OK - Rep Delay is:, 0.0, Seconds, [14:15:47] bblack: yep, daily vacuum might make sense! But we are thinking of increasing the OSM replication frequency, so that's probably the way to go... [14:16:26] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Tendril: fix an error in the exception raising [switchdc] - 10https://gerrit.wikimedia.org/r/347372 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [14:16:29] yeah if the replication frequency goes up and each replication is a smaller chunk of writes, eventually down that path you reach a point where you're better off with autovac I assume [14:17:06] bblack: totally unrelated, but wdqs codfw is now up to date, ready for active active... [14:17:39] https://gerrit.wikimedia.org/r/#/c/346543/ ? [14:17:50] (03CR) 10Giuseppe Lavagetto: [C: 032] Disable puppet: fix title and docstring [switchdc] - 10https://gerrit.wikimedia.org/r/347373 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [14:17:54] (03PS2) 10Giuseppe Lavagetto: Disable puppet: fix title and docstring [switchdc] - 10https://gerrit.wikimedia.org/r/347373 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [14:17:55] whenever you're really ready basically. I've already tested the a/a basics using noc and config-master services [14:17:58] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Disable puppet: fix title and docstring [switchdc] - 10https://gerrit.wikimedia.org/r/347373 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [14:17:58] bblack: I'm planning to add that to the deployment window this evening (CEST). Anything particular I should check before merging that one? [14:18:05] (03PS2) 10Giuseppe Lavagetto: Formatting improvements [switchdc] - 10https://gerrit.wikimedia.org/r/347374 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [14:18:07] no, nothing in particular [14:18:34] ok, then we'll see this evening... I'll ping you if things don't look good :P [14:18:40] don't even have to force that change or use any special timing [14:18:49] just merge it up and let it roll out on its own [14:19:16] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: (CRITICAL - Rep Delay is:, 4157.537083, Seconds, [14:20:43] 06Operations, 10Traffic, 10netops: knams equipment move - https://phabricator.wikimedia.org/T162601#3168403 (10ayounsi) After discussion with @BBlack As knams going down will not impact connectivity between esams and eqiad, and esams has enough transit capacity to take over knams transits, the following pla... [14:21:06] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: (CRITICAL - Rep Delay is:, 4271.782465, Seconds, [14:21:34] (03CR) 10Giuseppe Lavagetto: [C: 032] Formatting improvements [switchdc] - 10https://gerrit.wikimedia.org/r/347374 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [14:21:45] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Formatting improvements [switchdc] - 10https://gerrit.wikimedia.org/r/347374 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [14:22:07] (03PS2) 10Giuseppe Lavagetto: MediaWiki: announce explicitly the read-only period [switchdc] - 10https://gerrit.wikimedia.org/r/347375 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [14:23:03] (03PS2) 10Urbanecm: Give sysops ability to promote users to eliminator at fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347166 (https://phabricator.wikimedia.org/T162396) [14:23:18] (03PS1) 1020after4: Phab: add 90.227.4.251 to ban list [puppet] - 10https://gerrit.wikimedia.org/r/347384 [14:23:23] (03PS4) 10Gehel: wdqs: active/active public interface [puppet] - 10https://gerrit.wikimedia.org/r/346543 (https://phabricator.wikimedia.org/T162111) [14:23:27] (03PS3) 10Urbanecm: Give sysops ability to promote users to eliminator at fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347166 (https://phabricator.wikimedia.org/T162396) [14:23:40] (03CR) 10Giuseppe Lavagetto: [C: 032] MediaWiki: announce explicitly the read-only period [switchdc] - 10https://gerrit.wikimedia.org/r/347375 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [14:24:10] (03PS1) 10Jcrespo: Repool db1055 with full weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347385 (https://phabricator.wikimedia.org/T159319) [14:24:27] (03CR) 10Alexandros Kosiaris: [C: 031] "LGTM, but keep in mind that there is a window during which there will be alerts as the old script is being absent but the icinga configura" [puppet] - 10https://gerrit.wikimedia.org/r/346963 (https://phabricator.wikimedia.org/T162345) (owner: 10Gehel) [14:24:42] (03PS2) 10Giuseppe Lavagetto: Menu: avoid double failing message [switchdc] - 10https://gerrit.wikimedia.org/r/347376 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [14:25:14] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Phab: add 90.227.4.251 to ban list [puppet] - 10https://gerrit.wikimedia.org/r/347384 (owner: 1020after4) [14:25:21] (03CR) 10Jcrespo: [C: 04-1] "Waiting for buffer pool to stabilize: https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347385 (https://phabricator.wikimedia.org/T159319) (owner: 10Jcrespo) [14:25:46] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: (CRITICAL - Rep Delay is:, 2135.104163, Seconds, [14:25:48] (03PS1) 10Marostegui: db-eqiad.php: Depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347386 (https://phabricator.wikimedia.org/T142725) [14:25:56] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: (CRITICAL - Rep Delay is:, 2144.805083, Seconds, [14:25:59] jynus: can I deploy ^ (after rebase?) [14:26:15] (03CR) 10Giuseppe Lavagetto: [C: 032] Menu: avoid double failing message [switchdc] - 10https://gerrit.wikimedia.org/r/347376 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [14:26:37] (03PS4) 10Gehel: postgresql - use the check-postgres package for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/346963 (https://phabricator.wikimedia.org/T162345) [14:26:41] Oh, I thought you committed yours, anyways, is it fine for you to deploy? I am asking because you are also playing with s1, so just in case [14:26:46] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: (CRITICAL - Rep Delay is:, 2195.24401, Seconds, [14:27:07] RECOVERY - Postgres Replication Lag on maps2004 is OK: (OK - Rep Delay is:, 0.0, Seconds, [14:27:33] (03PS2) 10Marostegui: db-eqiad.php: Depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347386 (https://phabricator.wikimedia.org/T132416) [14:27:46] RECOVERY - Postgres Replication Lag on maps1003 is OK: (OK - Rep Delay is:, 0.0, Seconds, [14:29:06] RECOVERY - Check HHVM threads for leakage on mw1259 is OK: OK [14:29:26] (03CR) 10Gehel: [C: 032] postgresql - use the check-postgres package for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/346963 (https://phabricator.wikimedia.org/T162345) (owner: 10Gehel) [14:29:46] RECOVERY - Postgres Replication Lag on maps1004 is OK: (OK - Rep Delay is:, 0.0, Seconds, [14:31:12] !log deploying new psotgresql replication check, might generate a few icinga alerts -T162345 [14:31:18] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 3016 [14:31:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:19] T162345: Reduce number of false positive alerts on postgresql lag for maps - https://phabricator.wikimedia.org/T162345 [14:31:56] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1528 [14:32:06] RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 207216 [14:39:00] (03PS1) 10Alexandros Kosiaris: Make backup::set effectively a virtual resource [puppet] - 10https://gerrit.wikimedia.org/r/347388 [14:39:07] !log switchdc (oblivian@sarin) START TASK - switchdc.stages.t00_disable_puppet(eqiad, codfw) Disabling puppet on MediaWiki jobrunners and videoscalers [14:39:12] !log switchdc (oblivian@sarin) END TASK - switchdc.stages.t00_disable_puppet(eqiad, codfw) Successfully completed [14:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:27] (03CR) 10Marostegui: [C: 032] "for heads up" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347386 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui) [14:41:06] PROBLEM - puppet last run on db1048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:42:10] spike of high 500, phabricator? [14:42:52] 06Operations, 10media-storage: Swift version and distro upgrade - https://phabricator.wikimedia.org/T162609#3168478 (10fgiunchedi) [14:42:56] jynus known [14:43:03] twentyafterfour is aware [14:43:22] 06Operations, 10media-storage: swift backend machines load spike: cause and remediation - https://phabricator.wikimedia.org/T84385#3168497 (10fgiunchedi) [14:43:25] it just had another spike in queries so looks related [14:43:27] 06Operations, 10media-storage, 13Patch-For-Review: swift upgrade plans: jessie and swift 2.x - https://phabricator.wikimedia.org/T117972#3168493 (10fgiunchedi) 05Open>03Invalid Superseded by T162609 [14:43:46] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [14:44:46] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [14:45:02] !log switchdc (oblivian@sarin) START TASK - switchdc.stages.t09_start_maintenance(codfw, eqiad) Start MediaWiki maintenance in the new master DC [14:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:09] !log switchdc (oblivian@sarin) END TASK - switchdc.stages.t09_start_maintenance(codfw, eqiad) Failed to execute [14:45:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:33] !log switchdc (oblivian@sarin) START TASK - switchdc.stages.t09_restore_ttl(codfw, eqiad) Restore the TTL of all the MediaWiki discovery records [14:45:35] !log upgrade cache_maps to linux 4.9 T162029 [14:45:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:46] T162029: Migrate all jessie hosts to Linux 4.9 - https://phabricator.wikimedia.org/T162029 [14:45:46] !log switchdc (oblivian@sarin) END TASK - switchdc.stages.t09_restore_ttl(codfw, eqiad) Successfully completed [14:45:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:54] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347386 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui) [14:46:57] !log switchdc (oblivian@sarin) START TASK - switchdc.stages.t09_start_maintenance(codfw, eqiad) Start MediaWiki maintenance in the new master DC [14:47:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:03] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347386 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui) [14:47:05] !log switchdc (oblivian@sarin) END TASK - switchdc.stages.t09_start_maintenance(codfw, eqiad) Failed to execute [14:47:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:04] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1073 - T132416 (duration: 00m 39s) [14:48:08] !log Deploy alter table enwiki.revision db1073 - T132416 [14:48:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:11] T132416: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416 [14:48:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:46] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:51:01] 06Operations, 06Office-IT, 07LDAP: Make disabled accounts visible in the corp mirror LDAP replica - https://phabricator.wikimedia.org/T160158#3168517 (10MoritzMuehlenhoff) It seems my test set was a case of really bad luck. The two disabled users accounts I was testing against, in fact still show up under cn... [14:52:22] !log switchdc (oblivian@sarin) START TASK - switchdc.stages.t00_disable_puppet(eqiad, codfw) Disabling puppet on MediaWiki jobrunners and videoscalers [14:52:27] !log switchdc (oblivian@sarin) END TASK - switchdc.stages.t00_disable_puppet(eqiad, codfw) Successfully completed [14:52:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:46] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:54:47] (03PS1) 10Volans: Logging: fix and simplify the stderr logging [switchdc] - 10https://gerrit.wikimedia.org/r/347390 (https://phabricator.wikimedia.org/T160178) [14:54:49] (03CR) 10Alexandros Kosiaris: Add 3d2png deploy repo to image scalers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/345377 (https://phabricator.wikimedia.org/T160185) (owner: 10MarkTraceur) [14:55:58] (03CR) 10Giuseppe Lavagetto: [C: 032] Logging: fix and simplify the stderr logging [switchdc] - 10https://gerrit.wikimedia.org/r/347390 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [14:57:45] PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:58:16] !log upgrading wtp1006-wtp1009 to Linux 4.9 [14:58:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:05] PROBLEM - puppet last run on nihal is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:01:25] PROBLEM - YARN NodeManager Node-State on analytics1037 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:01:26] PROBLEM - YARN NodeManager Node-State on analytics1040 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:01:27] PROBLEM - YARN NodeManager Node-State on analytics1056 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:01:28] PROBLEM - YARN NodeManager Node-State on analytics1048 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:01:31] PROBLEM - YARN NodeManager Node-State on analytics1046 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:01:31] PROBLEM - YARN NodeManager Node-State on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:01:32] PROBLEM - YARN NodeManager Node-State on analytics1045 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:01:33] PROBLEM - YARN NodeManager Node-State on analytics1042 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:01:34] PROBLEM - YARN NodeManager Node-State on analytics1035 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:01:36] PROBLEM - YARN NodeManager Node-State on analytics1041 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:01:37] PROBLEM - YARN NodeManager Node-State on analytics1043 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:01:38] PROBLEM - YARN NodeManager Node-State on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:01:40] PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:01:40] PROBLEM - YARN NodeManager Node-State on analytics1044 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:01:42] PROBLEM - YARN NodeManager Node-State on analytics1054 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:01:42] !log disabling puppet on labcontrol1001 to raise log levels [15:01:43] PROBLEM - YARN NodeManager Node-State on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:01:44] PROBLEM - YARN NodeManager Node-State on analytics1051 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:01:45] PROBLEM - YARN NodeManager Node-State on analytics1052 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:01:47] PROBLEM - YARN NodeManager Node-State on analytics1055 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:01:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:48] PROBLEM - YARN NodeManager Node-State on analytics1036 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:01:49] PROBLEM - YARN NodeManager Node-State on analytics1057 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:01:50] PROBLEM - YARN NodeManager Node-State on analytics1049 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:01:51] PROBLEM - YARN NodeManager Node-State on analytics1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:02:02] RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [15:02:12] PROBLEM - Disk space on restbase-dev1001 is CRITICAL: DISK CRITICAL - free space: / 450 MB (1% inode=95%) [15:02:12] PROBLEM - YARN NodeManager Node-State on analytics1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:03:12] PROBLEM - cassandra-a CQL 10.64.0.36:9042 on restbase-dev1001 is CRITICAL: connect to address 10.64.0.36 and port 9042: Connection refused [15:03:12] PROBLEM - cassandra-b CQL 10.64.0.37:9042 on restbase-dev1001 is CRITICAL: connect to address 10.64.0.37 and port 9042: Connection refused [15:03:13] PROBLEM - Check systemd state on restbase-dev1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:03:13] PROBLEM - cassandra-b SSL 10.64.0.37:7001 on restbase-dev1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [15:03:14] 06Operations, 10media-storage: Swift version and distro upgrade - https://phabricator.wikimedia.org/T162609#3168565 (10faidon) [15:03:22] PROBLEM - puppet last run on db1090 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:03:32] PROBLEM - cassandra-b service on restbase-dev1001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [15:03:32] PROBLEM - cassandra-a SSL 10.64.0.36:7001 on restbase-dev1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [15:03:33] RECOVERY - YARN NodeManager Node-State on analytics1045 is OK: OK: YARN NodeManager analytics1045.eqiad.wmnet:8041 Node-State: RUNNING [15:03:34] RECOVERY - YARN NodeManager Node-State on analytics1048 is OK: OK: YARN NodeManager analytics1048.eqiad.wmnet:8041 Node-State: RUNNING [15:03:35] RECOVERY - YARN NodeManager Node-State on analytics1046 is OK: OK: YARN NodeManager analytics1046.eqiad.wmnet:8041 Node-State: RUNNING [15:03:36] RECOVERY - YARN NodeManager Node-State on analytics1056 is OK: OK: YARN NodeManager analytics1056.eqiad.wmnet:8041 Node-State: RUNNING [15:03:37] RECOVERY - YARN NodeManager Node-State on analytics1037 is OK: OK: YARN NodeManager analytics1037.eqiad.wmnet:8041 Node-State: RUNNING [15:03:39] RECOVERY - YARN NodeManager Node-State on analytics1040 is OK: OK: YARN NodeManager analytics1040.eqiad.wmnet:8041 Node-State: RUNNING [15:03:40] RECOVERY - YARN NodeManager Node-State on analytics1032 is OK: OK: YARN NodeManager analytics1032.eqiad.wmnet:8041 Node-State: RUNNING [15:03:41] RECOVERY - YARN NodeManager Node-State on analytics1042 is OK: OK: YARN NodeManager analytics1042.eqiad.wmnet:8041 Node-State: RUNNING [15:03:42] RECOVERY - YARN NodeManager Node-State on analytics1043 is OK: OK: YARN NodeManager analytics1043.eqiad.wmnet:8041 Node-State: RUNNING [15:03:43] RECOVERY - YARN NodeManager Node-State on analytics1044 is OK: OK: YARN NodeManager analytics1044.eqiad.wmnet:8041 Node-State: RUNNING [15:03:45] RECOVERY - YARN NodeManager Node-State on analytics1047 is OK: OK: YARN NodeManager analytics1047.eqiad.wmnet:8041 Node-State: RUNNING [15:03:46] RECOVERY - YARN NodeManager Node-State on analytics1054 is OK: OK: YARN NodeManager analytics1054.eqiad.wmnet:8041 Node-State: RUNNING [15:03:47] RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING [15:03:48] RECOVERY - YARN NodeManager Node-State on analytics1038 is OK: OK: YARN NodeManager analytics1038.eqiad.wmnet:8041 Node-State: RUNNING [15:03:49] RECOVERY - YARN NodeManager Node-State on analytics1041 is OK: OK: YARN NodeManager analytics1041.eqiad.wmnet:8041 Node-State: RUNNING [15:03:51] RECOVERY - YARN NodeManager Node-State on analytics1035 is OK: OK: YARN NodeManager analytics1035.eqiad.wmnet:8041 Node-State: RUNNING [15:03:53] we are checking [15:03:57] not sure what happened [15:04:01] PROBLEM - cassandra-a service on restbase-dev1001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [15:04:01] RECOVERY - YARN NodeManager Node-State on analytics1049 is OK: OK: YARN NodeManager analytics1049.eqiad.wmnet:8041 Node-State: RUNNING [15:04:02] RECOVERY - YARN NodeManager Node-State on analytics1055 is OK: OK: YARN NodeManager analytics1055.eqiad.wmnet:8041 Node-State: RUNNING [15:04:04] RECOVERY - YARN NodeManager Node-State on analytics1051 is OK: OK: YARN NodeManager analytics1051.eqiad.wmnet:8041 Node-State: RUNNING [15:04:05] RECOVERY - YARN NodeManager Node-State on analytics1057 is OK: OK: YARN NodeManager analytics1057.eqiad.wmnet:8041 Node-State: RUNNING [15:04:06] RECOVERY - YARN NodeManager Node-State on analytics1053 is OK: OK: YARN NodeManager analytics1053.eqiad.wmnet:8041 Node-State: RUNNING [15:04:07] RECOVERY - YARN NodeManager Node-State on analytics1052 is OK: OK: YARN NodeManager analytics1052.eqiad.wmnet:8041 Node-State: RUNNING [15:04:08] RECOVERY - YARN NodeManager Node-State on analytics1036 is OK: OK: YARN NodeManager analytics1036.eqiad.wmnet:8041 Node-State: RUNNING [15:04:10] analytics1002 was the master [15:04:21] RECOVERY - Check systemd state on restbase-dev1001 is OK: OK - running: The system is fully operational [15:04:22] RECOVERY - YARN NodeManager Node-State on analytics1050 is OK: OK: YARN NodeManager analytics1050.eqiad.wmnet:8041 Node-State: RUNNING [15:04:28] (03CR) 10Alexandros Kosiaris: [C: 032] RESTBase: Migrate to Scap3 deployment [puppet] - 10https://gerrit.wikimedia.org/r/346248 (https://phabricator.wikimedia.org/T116335) (owner: 10Mobrovac) [15:04:31] RECOVERY - cassandra-b service on restbase-dev1001 is OK: OK - cassandra-b is active [15:04:32] 06Operations, 10media-storage: Swift version and distro upgrade - https://phabricator.wikimedia.org/T162609#3168478 (10faidon) The plan above totally makes sense to me and sounds like the path of the least amount of work with the maximum amount of consistency. I'd add a final step of upgrading the jessie syst... [15:04:34] (03PS5) 10Alexandros Kosiaris: RESTBase: Migrate to Scap3 deployment [puppet] - 10https://gerrit.wikimedia.org/r/346248 (https://phabricator.wikimedia.org/T116335) (owner: 10Mobrovac) [15:04:42] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] RESTBase: Migrate to Scap3 deployment [puppet] - 10https://gerrit.wikimedia.org/r/346248 (https://phabricator.wikimedia.org/T116335) (owner: 10Mobrovac) [15:04:56] looks like just the icinga checks failed temporarily? [15:05:01] RECOVERY - cassandra-a service on restbase-dev1001 is OK: OK - cassandra-a is active [15:05:06] !log restbase disabling puppet for upgrade to scap3 deploys [15:05:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:21] RECOVERY - Disk space on restbase-dev1001 is OK: DISK OK [15:06:21] RECOVERY - cassandra-b CQL 10.64.0.37:9042 on restbase-dev1001 is OK: TCP OK - 0.001 second response time on 10.64.0.37 port 9042 [15:06:22] RECOVERY - cassandra-b SSL 10.64.0.37:7001 on restbase-dev1001 is OK: SSL OK - Certificate restbase-dev1001-b valid until 2018-01-05 22:53:03 +0000 (expires in 270 days) [15:06:36] any networking issue causing this? [15:06:41] RECOVERY - cassandra-a SSL 10.64.0.36:7001 on restbase-dev1001 is OK: SSL OK - Certificate restbase-dev1001-a valid until 2018-01-05 22:53:02 +0000 (expires in 270 days) [15:06:47] (03Abandoned) 10Faidon Liambotis: Disable RESTBase config.yaml deploys in puppet [puppet] - 10https://gerrit.wikimedia.org/r/229306 (https://phabricator.wikimedia.org/T107532) (owner: 10GWicke) [15:07:08] elukey: causing what exactly? [15:07:11] (03PS1) 10Volans: Fix logging/dry-run setup orders [switchdc] - 10https://gerrit.wikimedia.org/r/347393 (https://phabricator.wikimedia.org/T160178) [15:07:21] RECOVERY - cassandra-a CQL 10.64.0.36:9042 on restbase-dev1001 is OK: TCP OK - 0.001 second response time on 10.64.0.36 port 9042 [15:07:29] paravoid: the above alarms of yarn (hadoop workers) [15:07:41] (03PS2) 10Jcrespo: Repool db1055 with full weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347385 (https://phabricator.wikimedia.org/T159319) [15:07:51] RECOVERY - puppet last run on db1090 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [15:07:56] yeah, but what was the underlying cause for those alerts? [15:08:04] well will double check, spot checking on hosts seems like the daemon stayed up [15:08:10] communication lost and between which hosts for example [15:08:22] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 seconds ago with 1 failures. Failed resources (up to 3 shown): Scap_source[restbase/deploy] [15:08:41] RECOVERY - puppet last run on db1048 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [15:09:00] paravoid: einstenium and all the hadoop workers (so analytics1049.eqiad.wmnet for example) [15:09:25] (03CR) 10Giuseppe Lavagetto: [C: 031] Fix logging/dry-run setup orders [switchdc] - 10https://gerrit.wikimedia.org/r/347393 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [15:10:00] (03CR) 10Jcrespo: [C: 032] Repool db1055 with full weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347385 (https://phabricator.wikimedia.org/T159319) (owner: 10Jcrespo) [15:10:10] (03PS1) 10Ladsgroup: Use redis lock manager for dispatching in all production repo instances [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347395 (https://phabricator.wikimedia.org/T159826) [15:10:16] or analytics1002 <-> analytics1049 [15:10:17] (03CR) 10Volans: [C: 032] Fix logging/dry-run setup orders [switchdc] - 10https://gerrit.wikimedia.org/r/347393 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [15:10:24] (current master - worker) [15:10:29] checking logs in the meantime [15:11:22] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [15:11:45] analytics1067 & analytics1068 are flapping [15:12:00] but that has been the case for hours [15:12:04] (03Merged) 10jenkins-bot: Repool db1055 with full weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347385 (https://phabricator.wikimedia.org/T159319) (owner: 10Jcrespo) [15:12:14] (03CR) 10jenkins-bot: Repool db1055 with full weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347385 (https://phabricator.wikimedia.org/T159319) (owner: 10Jcrespo) [15:12:22] PROBLEM - Disk space on restbase-dev1001 is CRITICAL: DISK CRITICAL - free space: / 127 MB (0% inode=95%) [15:12:24] yeah those are new ones... I think that somehow the current master (analytics1002) had a blip for a moment [15:12:26] 06Operations, 10Ops-Access-Requests: Icinga contact/permissions for cwdent (cdentinger) - https://phabricator.wikimedia.org/T159564#3168602 (10cwdent) 05Open>03Resolved @dzahn thank you for the help, everything seems to be working now. [15:12:39] other than that, I don't see anything relevant [15:12:41] PROBLEM - cassandra-a SSL 10.64.0.36:7001 on restbase-dev1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [15:12:59] paravoid: thanks, will keep checking [15:13:01] PROBLEM - cassandra-a service on restbase-dev1001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [15:13:21] PROBLEM - cassandra-a CQL 10.64.0.36:9042 on restbase-dev1001 is CRITICAL: connect to address 10.64.0.36 and port 9042: Connection refused [15:13:21] PROBLEM - cassandra-b CQL 10.64.0.37:9042 on restbase-dev1001 is CRITICAL: connect to address 10.64.0.37 and port 9042: Connection refused [15:13:22] PROBLEM - Check systemd state on restbase-dev1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:13:22] PROBLEM - cassandra-b SSL 10.64.0.37:7001 on restbase-dev1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [15:13:31] PROBLEM - cassandra-b service on restbase-dev1001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [15:13:35] I am going to deploy https://gerrit.wikimedia.org/r/347385 [15:14:16] +1 [15:15:28] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1055 after maintenance with full weight (duration: 00m 39s) [15:15:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:20] 06Operations: codfw hosts occasionally spend > 3 minutes starting networking.service with linux 4.9 - https://phabricator.wikimedia.org/T162612#3168608 (10ema) [15:17:01] RECOVERY - cassandra-a service on restbase-dev1001 is OK: OK - cassandra-a is active [15:17:31] RECOVERY - cassandra-b service on restbase-dev1001 is OK: OK - cassandra-b is active [15:18:06] ^^ on that (also not production) [15:18:08] !log mobrovac@tin Started deploy [restbase/deploy@a8d4d02]: (no justification provided) [15:18:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:21] RECOVERY - cassandra-b CQL 10.64.0.37:9042 on restbase-dev1001 is OK: TCP OK - 0.000 second response time on 10.64.0.37 port 9042 [15:18:21] RECOVERY - Disk space on restbase-dev1001 is OK: DISK OK [15:18:22] RECOVERY - cassandra-b SSL 10.64.0.37:7001 on restbase-dev1001 is OK: SSL OK - Certificate restbase-dev1001-b valid until 2018-01-05 22:53:03 +0000 (expires in 270 days) [15:18:33] <_joe_> urandom: might be related to the deploy of restbase via scap to the test cluster? [15:18:38] <_joe_> oh, no [15:18:40] _joe_: nope [15:19:31] !log mobrovac@tin Finished deploy [restbase/deploy@a8d4d02]: (no justification provided) (duration: 01m 22s) [15:19:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:39] elukey: ping? [15:20:01] PROBLEM - cassandra-a service on restbase-dev1001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [15:20:11] (03Abandoned) 10Gehel: postgresql - clean up of python / bash code after volans review [puppet] - 10https://gerrit.wikimedia.org/r/346952 (owner: 10Gehel) [15:20:19] urandom: o/ [15:20:31] elukey: hi! :) [15:20:43] elukey: remember when you reimaged restbase-devv1001 for us? [15:20:44] (03CR) 10Volans: [C: 031] "LGTM, but let's wait to understand the MW side of it?" [software/conftool] - 10https://gerrit.wikimedia.org/r/347356 (https://phabricator.wikimedia.org/T156924) (owner: 10Giuseppe Lavagetto) [15:21:41] PROBLEM - puppet last run on mira is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Scap_source[restbase/deploy] [15:22:00] elukey: i've no idea what that process looks like on your end, but somehow that resulted in /srv/ not being the raid0 volume [15:22:30] urandom: :( [15:22:41] (03PS2) 10Gehel: postgresql - cleanup dead code after migration to check-postgres package [puppet] - 10https://gerrit.wikimedia.org/r/346964 (https://phabricator.wikimedia.org/T162345) [15:22:51] (03CR) 10Gehel: "rebased" [puppet] - 10https://gerrit.wikimedia.org/r/346964 (https://phabricator.wikimedia.org/T162345) (owner: 10Gehel) [15:22:56] urandom: I assumed that the partman config was taking care of everything, but probably I missed something [15:23:01] RECOVERY - cassandra-a service on restbase-dev1001 is OK: OK - cassandra-a is active [15:23:08] urandom: can you open a phab task? I'll try to fix it [15:23:15] elukey: yeah [15:23:21] RECOVERY - Check systemd state on restbase-dev1001 is OK: OK - running: The system is fully operational [15:23:39] 06Operations: codfw hosts occasionally spend > 3 minutes starting networking.service with linux 4.9 - https://phabricator.wikimedia.org/T162612#3168635 (10MoritzMuehlenhoff) These are tracked under the perf section (Performance monitoring) in Kconfig: "Include support for Intel uncore performance events. These a... [15:24:21] RECOVERY - cassandra-a CQL 10.64.0.36:9042 on restbase-dev1001 is OK: TCP OK - 0.000 second response time on 10.64.0.36 port 9042 [15:24:37] 06Operations: codfw hosts occasionally spend > 3 minutes starting networking.service with linux 4.9 - https://phabricator.wikimedia.org/T162612#3168636 (10MoritzMuehlenhoff) JFTR, this code was only made modular in between Linux 4.4 and 4.9; in Linux 4.4 the code is built-in. [15:24:41] RECOVERY - cassandra-a SSL 10.64.0.36:7001 on restbase-dev1001 is OK: SSL OK - Certificate restbase-dev1001-a valid until 2018-01-05 22:53:02 +0000 (expires in 270 days) [15:26:16] (03CR) 10Gehel: [C: 032] postgresql - cleanup dead code after migration to check-postgres package [puppet] - 10https://gerrit.wikimedia.org/r/346964 (https://phabricator.wikimedia.org/T162345) (owner: 10Gehel) [15:27:36] (03Abandoned) 10Gehel: WIP - postgresql::user should be idempotent [puppet] - 10https://gerrit.wikimedia.org/r/298960 (https://phabricator.wikimedia.org/T138092) (owner: 10Gehel) [15:28:06] !log mobrovac@tin Started deploy [restbase/deploy@a8d4d02]: Initial deployment with Scap3 on staging [15:28:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:36] RECOVERY - puppet last run on nihal is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:29:58] (03CR) 10Ottomata: [C: 032] Add python3 packages to hadoop workers for ORES in hadoop [puppet] - 10https://gerrit.wikimedia.org/r/346812 (owner: 10Ottomata) [15:30:04] (03PS2) 10Ottomata: Add python3 packages to hadoop workers for ORES in hadoop [puppet] - 10https://gerrit.wikimedia.org/r/346812 [15:30:10] (03CR) 10Ottomata: [V: 032 C: 032] Add python3 packages to hadoop workers for ORES in hadoop [puppet] - 10https://gerrit.wikimedia.org/r/346812 (owner: 10Ottomata) [15:31:37] !log mobrovac@tin Finished deploy [restbase/deploy@a8d4d02]: Initial deployment with Scap3 on staging (duration: 03m 31s) [15:31:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:46] RECOVERY - puppet last run on mira is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [15:32:50] 06Operations, 10ops-codfw, 06Performance-Team: Plug in ex-graphite2001 SSDs to recover coal data - https://phabricator.wikimedia.org/T161538#3168657 (10Papaul) a:05Papaul>03fgiunchedi @fgiunchedi the flash drive is not in graphite2001 [15:33:07] (03PS1) 10Cmjohnson: Adding mac address that was missing for analytics1068 T162216 [puppet] - 10https://gerrit.wikimedia.org/r/347398 [15:33:29] !log restbase enabling back puppet in prod [15:33:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:36] PROBLEM - puppet last run on analytics1054 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:33:56] PROBLEM - puppet last run on analytics1042 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:34:00] 06Operations, 10Cassandra, 06Services: RAID-0 volume not mounted on restbase-dev1001.eqiad.wmnet - https://phabricator.wikimedia.org/T162614#3168663 (10Eevans) [15:34:06] elukey: ^^ [15:34:07] (03PS1) 10Volans: Logging: use NodeSet for more compact outputs [switchdc] - 10https://gerrit.wikimedia.org/r/347399 (https://phabricator.wikimedia.org/T160178) [15:34:21] thanks urandom, will check tomorrow and try to fix [15:34:25] (03Abandoned) 10Cmjohnson: Adding mac address that was missing for analytics1068 T162216 [puppet] - 10https://gerrit.wikimedia.org/r/347398 (owner: 10Cmjohnson) [15:35:06] !log mobrovac@tin Started deploy [restbase/deploy@a8d4d02]: Initial deployment with Scap3 [15:35:12] (03PS4) 10Rush: admin: add a group for cloud services roots [puppet] - 10https://gerrit.wikimedia.org/r/346838 (https://phabricator.wikimedia.org/T162404) [15:35:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:16] !log mobrovac@tin Finished deploy [restbase/deploy@a8d4d02]: Initial deployment with Scap3 (duration: 00m 10s) [15:35:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:36] PROBLEM - puppet last run on analytics1055 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:35:48] (03PS5) 10Rush: admin: add a group for cloud services roots [puppet] - 10https://gerrit.wikimedia.org/r/346838 (https://phabricator.wikimedia.org/T162404) [15:36:09] 06Operations, 06Labs: create a 'root' group with bdavis strictly for labs/cloud services infrastructure - https://phabricator.wikimedia.org/T162404#3168682 (10chasemp) [15:36:33] 06Operations, 10Ops-Access-Requests, 06Labs: create a 'root' group with bdavis strictly for labs/cloud services infrastructure - https://phabricator.wikimedia.org/T162404#3161730 (10chasemp) [15:38:46] (03PS1) 10Rush: admin: add bd808 to cs-roots [puppet] - 10https://gerrit.wikimedia.org/r/347400 (https://phabricator.wikimedia.org/T162404) [15:39:06] (03PS2) 10Rush: admin: add bd808 to cs-roots [puppet] - 10https://gerrit.wikimedia.org/r/347400 (https://phabricator.wikimedia.org/T162404) [15:39:45] 06Operations, 10Ops-Access-Requests, 06Labs: create a 'root' group with bdavis strictly for labs/cloud services infrastructure - https://phabricator.wikimedia.org/T162404#3168687 (10chasemp) Adding the group with initial roles: https://gerrit.wikimedia.org/r/#/c/346838/ Adding bd808 if that all makes sense... [15:42:43] (03PS1) 10Giuseppe Lavagetto: Make reason for disabling puppet fixed [switchdc] - 10https://gerrit.wikimedia.org/r/347401 [15:42:45] (03PS1) 10Giuseppe Lavagetto: Filter out verbose output from cumin in dry-run mode [switchdc] - 10https://gerrit.wikimedia.org/r/347402 [15:43:26] PROBLEM - puppet last run on analytics1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:36] PROBLEM - puppet last run on analytics1033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:36] PROBLEM - puppet last run on analytics1057 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:42] (03PS1) 10Cmjohnson: Adding mac address for analytis1068 [puppet] - 10https://gerrit.wikimedia.org/r/347403 [15:46:09] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/346838 (https://phabricator.wikimedia.org/T162404) (owner: 10Rush) [15:46:36] PROBLEM - puppet last run on analytics1035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:47:25] !log troubleshooting link cr2-eqiad:xe-3/0/1 {#2014 to asw-b-eqiad:xe-1/1/2 per T162199 [15:47:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:32] T162199: Faulty optics on asw-b-eqiad:xe-1/1/2 - https://phabricator.wikimedia.org/T162199 [15:47:39] i gather /etc/fstab isn't normally managed by puppet in our fleet, is that correct? [15:48:16] PROBLEM - puppet last run on analytics1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:48:23] is it generated by the installer, and then managed by-hand afterward? [15:49:46] PROBLEM - puppet last run on analytics1051 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:50:26] PROBLEM - puppet last run on analytics1056 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:50:35] (03CR) 10Cmjohnson: [C: 032] Adding mac address for analytis1068 [puppet] - 10https://gerrit.wikimedia.org/r/347403 (owner: 10Cmjohnson) [15:50:46] PROBLEM - puppet last run on analytics1029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:51:26] PROBLEM - puppet last run on analytics1052 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:51:46] urandom: yeah fstab isn't generally touched by puppet, with some exceptions [15:52:15] kk [15:52:16] PROBLEM - puppet last run on analytics1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:52:36] PROBLEM - puppet last run on analytics1032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:53:24] elukey ottomata I take it the puppet failures there are legit? [15:54:04] godog: I am still not sure why all the yarn daemons complained at the same time when the new master came up (still without any daemons after the reimage, really weird) [15:54:11] I am going to check puppet now [15:54:32] papaul: re: T161538 yeah the drive isn't currently in graphite2001 but in wmf6406, it needs to be plugged in into graphite2001 now though [15:54:33] T161538: Plug in ex-graphite2001 SSDs to recover coal data - https://phabricator.wikimedia.org/T161538 [15:54:46] PROBLEM - puppet last run on analytics1041 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:55:04] !log mobrovac@tin Started deploy [restbase/deploy@2c70843]: Initial deployment with Scap3 [15:55:08] 06Operations, 10Cassandra, 06Services: RAID-0 volume not mounted on restbase-dev1001.eqiad.wmnet - https://phabricator.wikimedia.org/T162614#3168712 (10Eevans) p:05Triage>03Normal a:03Eevans [15:55:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:16] PROBLEM - puppet last run on analytics1053 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:55:26] PROBLEM - puppet last run on analytics1040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:55:28] godog: Error 400 on SERVER: undefined method `function_create_resources' for nil:NilClass at /etc/puppet/modules/role/manifests/analytics_cluster/hadoop/worker.pp:120 on node analytics1032.eqiad.wmnet [15:55:36] PROBLEM - puppet last run on analytics1028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:55:52] lolz [15:57:40] 06Operations, 10Cassandra, 06Services (blocked): RAID-0 volume not mounted on restbase-dev1001.eqiad.wmnet - https://phabricator.wikimedia.org/T162614#3168715 (10mobrovac) [15:58:14] 06Operations, 10Cassandra, 06Services (doing): RAID-0 volume not mounted on restbase-dev1001.eqiad.wmnet - https://phabricator.wikimedia.org/T162614#3168717 (10Eevans) [15:58:27] !log switchdc (oblivian@sarin) START TASK - switchdc.stages.t00_disable_puppet(eqiad, codfw) Disabling puppet on MediaWiki jobrunners and videoscalers [15:58:31] !log switchdc (oblivian@sarin) END TASK - switchdc.stages.t00_disable_puppet(eqiad, codfw) Successfully completed [15:58:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:10] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint, and 2 others: Reduce number of false positive alerts on postgresql lag for maps - https://phabricator.wikimedia.org/T162345#3168718 (10Gehel) No new false positive since changes have been deployed. Things are looking good! [15:59:26] PROBLEM - puppet last run on analytics1038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:01:16] PROBLEM - puppet last run on analytics1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:01:36] PROBLEM - puppet last run on analytics1037 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:02:36] PROBLEM - puppet last run on analytics1039 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:02:56] !log mobrovac@tin Finished deploy [restbase/deploy@2c70843]: Initial deployment with Scap3 (duration: 07m 52s) [16:03:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:49] godog: it is in graphite2001 [16:06:57] papaul: thanks! I'll let you know when it is safe to remove from there [16:07:46] 06Operations, 10ops-codfw, 06Performance-Team: Plug in ex-graphite2001 SSDs to recover coal data - https://phabricator.wikimedia.org/T161538#3168732 (10Papaul) @fgiunchedi the drive is now in graphite2001 [16:07:57] godog: I think it is related to https://gerrit.wikimedia.org/r/346812 [16:08:25] (03CR) 10Daniel Kinzler: [C: 031] "Yes, we want this. This has been extensively tested now by Ladsgroop, Hoo, and myself." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347395 (https://phabricator.wikimedia.org/T159826) (owner: 10Ladsgroup) [16:08:44] (03PS2) 10Giuseppe Lavagetto: Filter out verbose output from cumin from stdout [switchdc] - 10https://gerrit.wikimedia.org/r/347402 [16:09:30] (03PS2) 10Daniel Kinzler: Use redis lock manager for dispatching in all production repo instances [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347395 (https://phabricator.wikimedia.org/T159826) (owner: 10Ladsgroup) [16:12:51] 06Operations, 10Wikimedia-Mailing-lists: Reset admin password for wikimania-program mailing list - https://phabricator.wikimedia.org/T162080#3168740 (10Dzahn) @eyoung @aklapper This can be closed i assume, right? [16:13:16] (03CR) 10Volans: [C: 032] Make reason for disabling puppet fixed [switchdc] - 10https://gerrit.wikimedia.org/r/347401 (owner: 10Giuseppe Lavagetto) [16:19:42] 06Operations, 10Ops-Access-Requests: Icinga contact/permissions for cwdent (cdentinger) - https://phabricator.wikimedia.org/T159564#3168765 (10Dzahn) @cwdent great! thanks for confirming [16:20:33] (03CR) 10Giuseppe Lavagetto: "Yes, you should coordinate with ops before trying this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347395 (https://phabricator.wikimedia.org/T159826) (owner: 10Ladsgroup) [16:20:43] (03PS1) 10Bartosz Dziewoński: Remove defunct $wgForeignUploadTestEnabled for cross-wiki upload A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347412 [16:20:48] anyone wants to deploy a no-op? https://gerrit.wikimedia.org/r/347412 [16:21:33] MatmaRex: it's not a labs only, so it's better to include it to SWAT [16:21:49] ok [16:22:08] jouncebot: next [16:22:08] In 0 hour(s) and 37 minute(s): Weekly Wikidata query service deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170410T1700) [16:22:26] jouncebot: in 1 hours 37 minutes the next one I think [16:27:40] 06Operations, 06Labs: Initial OpenStack Neutron PoC deployment in Labtest - https://phabricator.wikimedia.org/T153099#3168775 (10chasemp) We have neutron deployed initially but not functioning as a control plane for anything yet. The first part of this has been working through the keystone/ldap/neutron integr... [16:31:08] (03PS1) 10Elukey: Revert "Add python3 packages to hadoop workers for ORES in hadoop" [puppet] - 10https://gerrit.wikimedia.org/r/347414 [16:31:22] (03PS2) 10Elukey: Revert "Add python3 packages to hadoop workers for ORES in hadoop" [puppet] - 10https://gerrit.wikimedia.org/r/347414 [16:34:24] (03CR) 10Elukey: [C: 032] Revert "Add python3 packages to hadoop workers for ORES in hadoop" [puppet] - 10https://gerrit.wikimedia.org/r/347414 (owner: 10Elukey) [16:35:25] (03CR) 10Volans: [C: 032] Filter out verbose output from cumin from stdout [switchdc] - 10https://gerrit.wikimedia.org/r/347402 (owner: 10Giuseppe Lavagetto) [16:36:17] godog: puppet runs should be fixed [16:36:26] RECOVERY - puppet last run on analytics1040 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [16:36:31] just did --^ [16:36:56] elukey: neat, thanks! I think it might be the version specification in one of the packages? [16:37:15] godog: could be yes, not sure, I'll ping Andrew! [16:37:34] there are multiple packages with two versions available on apt-cache [16:37:39] so not sure if this could be another issue [16:40:44] (03PS2) 10Andrew Bogott: fullstack: Switch to the normal jessie image [puppet] - 10https://gerrit.wikimedia.org/r/346984 [16:40:49] (03PS1) 10Andrew Bogott: nova-scheduler: Temporarily depool labvirt1002 [puppet] - 10https://gerrit.wikimedia.org/r/347415 (https://phabricator.wikimedia.org/T162529) [16:41:16] (03CR) 10Rush: [C: 031] nova-scheduler: Temporarily depool labvirt1002 [puppet] - 10https://gerrit.wikimedia.org/r/347415 (https://phabricator.wikimedia.org/T162529) (owner: 10Andrew Bogott) [16:42:16] 06Operations, 10Ops-Access-Requests, 06Labs: create a 'root' group with bdavis strictly for labs/cloud services infrastructure - https://phabricator.wikimedia.org/T162404#3161730 (10RobH) >>! In T162404#3168687, @chasemp wrote: > Adding the group with initial roles: > > https://gerrit.wikimedia.org/r/#/c/34... [16:42:35] RECOVERY - puppet last run on analytics1043 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [16:42:35] RECOVERY - puppet last run on analytics1033 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [16:42:39] (03CR) 10Andrew Bogott: [C: 032] nova-scheduler: Temporarily depool labvirt1002 [puppet] - 10https://gerrit.wikimedia.org/r/347415 (https://phabricator.wikimedia.org/T162529) (owner: 10Andrew Bogott) [16:43:35] RECOVERY - puppet last run on analytics1057 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [16:44:35] RECOVERY - puppet last run on analytics1035 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [16:45:48] !log switchdc (oblivian@sarin) START TASK - switchdc.stages.t09_restart_parsoid(codfw, eqiad) Rolling restart parsoid in eqiad and codfw [16:45:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:15] RECOVERY - puppet last run on analytics1031 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [16:48:20] <_joe_> !log not really restarting parsoid, still testing swtichdc [16:48:25] RECOVERY - puppet last run on analytics1056 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [16:48:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:45] RECOVERY - puppet last run on analytics1051 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [16:48:45] RECOVERY - puppet last run on analytics1029 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [16:50:25] RECOVERY - puppet last run on analytics1052 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [16:50:35] RECOVERY - puppet last run on analytics1032 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [16:51:25] RECOVERY - puppet last run on analytics1036 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [16:53:15] RECOVERY - puppet last run on analytics1053 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [16:53:45] RECOVERY - puppet last run on analytics1041 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [16:54:35] RECOVERY - puppet last run on analytics1028 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [16:56:31] (03Abandoned) 10Volans: Logging: use NodeSet for more compact outputs [switchdc] - 10https://gerrit.wikimedia.org/r/347399 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [16:56:38] (03CR) 10Dereckson: "I'm not sure about * Bureaucrats can demote sysops and bureaucrats." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347214 (https://phabricator.wikimedia.org/T162510) (owner: 10Urbanecm) [16:57:25] RECOVERY - puppet last run on analytics1038 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [17:00:04] gehel: Respected human, time to deploy Weekly Wikidata query service deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170410T1700). Please do the needful. [17:00:04] gehel: A patch you scheduled for Weekly Wikidata query service deployment window is about to be deployed. Please be available during the process. [17:00:15] RECOVERY - puppet last run on analytics1034 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [17:00:35] RECOVERY - puppet last run on analytics1037 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [17:00:35] RECOVERY - puppet last run on analytics1039 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [17:01:05] RECOVERY - puppet last run on analytics1054 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [17:01:05] 06Operations, 10Ops-Access-Requests, 06Labs: create a 'root' group with bdavis strictly for labs/cloud services infrastructure - https://phabricator.wikimedia.org/T162404#3161730 (10faidon) Looks good to me and there were no objections in the ops meeting either. Bikeshedding a little bit: the "cs-roots" nam... [17:01:26] 06Operations, 10Ops-Access-Requests, 06Labs: create a 'root' group with bdavis strictly for labs/cloud services infrastructure - https://phabricator.wikimedia.org/T162404#3161730 (10Dzahn) I would vote for **cloud-roots**. "cloud" seems the most obvious term for the team and of 17 root groups, 16 end in -ro... [17:01:30] 06Operations, 10RESTBase, 13Patch-For-Review, 10Scap (Scap3-Adoption-Phase1), and 2 others: Deploy RESTBase with scap3 - https://phabricator.wikimedia.org/T116335#3168878 (10mobrovac) [17:01:45] RECOVERY - puppet last run on analytics1042 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [17:04:25] RECOVERY - puppet last run on analytics1055 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [17:04:26] 06Operations, 10Ops-Access-Requests, 06Labs: create a 'root' group with bdavis strictly for labs/cloud services infrastructure - https://phabricator.wikimedia.org/T162404#3168881 (10chasemp) I'm open to what seems best :) `wmcs-roots` seems most popular, but `cloud-roots` is really clear. I'm punting to @b... [17:04:49] !log gehel@tin Started deploy [wdqs/wdqs@1cfbd8d]: (no justification provided) [17:04:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:05] 06Operations, 10Ops-Access-Requests, 06Labs: create a 'root' group with bdavis strictly for labs/cloud services infrastructure - https://phabricator.wikimedia.org/T162404#3161730 (10fgiunchedi) from the "naming is hard" bandwagon: I'd suggest an easily grep-able and unambiguous name like `wmcs` to be used ev... [17:05:55] RECOVERY - Host lvs2002 is UP: PING OK - Packet loss = 0%, RTA = 36.59 ms [17:06:11] !log gehel@tin Finished deploy [wdqs/wdqs@1cfbd8d]: (no justification provided) (duration: 01m 22s) [17:06:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:27] SMalyshev: deployment done and looking good [17:06:56] (03PS5) 10Gehel: wdqs: active/active public interface [puppet] - 10https://gerrit.wikimedia.org/r/346543 (https://phabricator.wikimedia.org/T162111) [17:08:05] PROBLEM - pybal on lvs2002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [17:08:05] PROBLEM - puppet last run on lvs2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:08:05] PROBLEM - PyBal backends health check on lvs2002 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 [17:08:12] 06Operations, 10Ops-Access-Requests, 06Labs: create a 'root' group with bdavis strictly for labs/cloud services infrastructure - https://phabricator.wikimedia.org/T162404#3168886 (10chasemp) Good point @fgiunchedi -- i n purely searchable terms using `cs` is probably too diminutive and `wmcs` will be more se... [17:09:07] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: codfw rack/setup 22 DB servers - https://phabricator.wikimedia.org/T162159#3168887 (10Papaul) [17:09:10] 06Operations, 10Ops-Access-Requests, 06Labs: create a 'root' group with bdavis strictly for labs/cloud services infrastructure - https://phabricator.wikimedia.org/T162404#3168888 (10bd808) Let's do `wmcs-roots`. WMCS is our official short form written name for Wikimedia Cloud Services. Cookie licking `cloud`... [17:09:14] (03CR) 10Gehel: [C: 032] wdqs: active/active public interface [puppet] - 10https://gerrit.wikimedia.org/r/346543 (https://phabricator.wikimedia.org/T162111) (owner: 10Gehel) [17:10:10] SMalyshev, bblack: I merged https://gerrit.wikimedia.org/r/#/c/346543/ (WDQS going active / active). I'll keep an eye on logs / graphs, but ping me if you see anything strange... [17:10:54] (03CR) 10Faidon Liambotis: [C: 04-1] "If we're going for a separate file, /etc/ferm/conntrack-sysctl.conf as I this patch has done, the applying this via a ferm post hook ('@ho" [puppet] - 10https://gerrit.wikimedia.org/r/320197 (https://phabricator.wikimedia.org/T136094) (owner: 10Muehlenhoff) [17:11:27] 06Operations, 10Wikimedia-Mailing-lists: Reset admin password for wikimania-program mailing list - https://phabricator.wikimedia.org/T162080#3168894 (10eyoung) yes thank you!!! I'm swamped with scholarship wikimania stuff or I would have responded sooner. thanks for your help! [17:12:08] 06Operations, 10Wikimedia-Mailing-lists: Reset admin password for wikimania-program mailing list - https://phabricator.wikimedia.org/T162080#3168896 (10Dzahn) 05Open>03Resolved a:03Dzahn @eyoung ok, great :) closing ticket [17:12:49] gehel: ok [17:14:55] PROBLEM - Host lvs2002 is DOWN: PING CRITICAL - Packet loss = 100% [17:15:25] wow [17:15:41] elukey: working on it [17:15:45] ahh okok :) [17:16:04] papaul: can you log it in the sal? [17:16:20] sure [17:16:45] thanks, just to have it logged somewhere for other people :) [17:16:45] !log testing lvs2002 after mainboard replacement [17:16:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:05] RECOVERY - pybal on lvs2002 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [17:17:15] RECOVERY - PyBal backends health check on lvs2002 is OK: PYBAL OK - All pools are healthy [17:17:15] RECOVERY - Host lvs2002 is UP: PING OK - Packet loss = 0%, RTA = 36.12 ms [17:17:16] (03CR) 10Muehlenhoff: "The ferm post hook didn't work, see my earlier test results at https://gerrit.wikimedia.org/r/#/c/319071/" [puppet] - 10https://gerrit.wikimedia.org/r/320197 (https://phabricator.wikimedia.org/T136094) (owner: 10Muehlenhoff) [17:18:02] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3168914 (10Gehel) @Papaul has put in place new SSD in that server. I've been running the same kind of load test as before for most of the day... [17:18:04] (03CR) 10Muehlenhoff: "The sysctl config file is populated via https://gerrit.wikimedia.org/r/#/c/319071/" [puppet] - 10https://gerrit.wikimedia.org/r/320197 (https://phabricator.wikimedia.org/T136094) (owner: 10Muehlenhoff) [17:18:05] RECOVERY - puppet last run on lvs2002 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [17:19:25] 06Operations, 10ops-codfw, 10Traffic: lvs2002 random shut down - https://phabricator.wikimedia.org/T162099#3168916 (10Papaul) main board replacement complete on lvs2002, System is back up. @elukey please check everything is okay while I am on site. Thanks. [17:19:31] elukey: https://phabricator.wikimedia.org/T162099 [17:23:09] 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files (e.g. djvu, jpg, png, webm) on Commons and other sites - https://phabricator.wikimedia.org/T161836#3144779 (10greg) >>! In T161836#3146624, @fgiunchedi wrote: > Since some files linked here seem to 200 now (instead of 40... [17:29:14] 06Operations, 10ops-codfw, 10Traffic: lvs2002 random shut down - https://phabricator.wikimedia.org/T162099#3168958 (10Papaul) a:05Papaul>03elukey [17:31:55] 06Operations, 13Patch-For-Review, 15User-Elukey, 07Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#3168964 (10elukey) Summary of the various issues as I understand them: 1) the descrip... [17:31:56] (03PS1) 10Catrope: Enable Flow beta feature on frwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347419 (https://phabricator.wikimedia.org/T162022) [17:32:45] 06Operations, 10ops-codfw, 10Traffic: lvs2002 random shut down - https://phabricator.wikimedia.org/T162099#3168972 (10BBlack) a:05elukey>03BBlack Switching this to me [17:38:29] (03PS1) 10Giuseppe Lavagetto: Fix output typo in the redis task [switchdc] - 10https://gerrit.wikimedia.org/r/347420 [17:38:36] <_joe_> volans: ^^ [17:40:05] (03CR) 10Volans: [C: 031] "LGTM" [switchdc] - 10https://gerrit.wikimedia.org/r/347420 (owner: 10Giuseppe Lavagetto) [17:41:52] (03CR) 10Jforrester: [C: 031] "Cleanup noop, good to go whenever." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347412 (owner: 10Bartosz Dziewoński) [17:44:10] (03PS1) 10Ottomata: Revert "Revert "Add python3 packages to hadoop workers for ORES in hadoop"" [puppet] - 10https://gerrit.wikimedia.org/r/347421 [17:44:25] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3168987 (10Papaul) @Gehel Thank you. [17:44:42] (03PS1) 10Jforrester: Wikitech: Don't try to do cross-wiki uploads to Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347422 (https://phabricator.wikimedia.org/T162374) [17:44:53] (03PS2) 10Ottomata: Revert "Revert "Add python3 packages to hadoop workers for ORES in hadoop"" [puppet] - 10https://gerrit.wikimedia.org/r/347421 [17:47:59] (03CR) 10Ottomata: [C: 032] Revert "Revert "Add python3 packages to hadoop workers for ORES in hadoop"" [puppet] - 10https://gerrit.wikimedia.org/r/347421 (owner: 10Ottomata) [17:50:07] (03PS1) 10Giuseppe Lavagetto: Correct offset of the main task to be 0 [switchdc] - 10https://gerrit.wikimedia.org/r/347423 [17:50:47] i think i'm about to get soem puppet errors ... [17:50:55] (03PS1) 10Ottomata: Separating out require_package calls in attempt to figure out weird error message [puppet] - 10https://gerrit.wikimedia.org/r/347424 [17:51:17] (03CR) 10Ottomata: [V: 032 C: 032] Separating out require_package calls in attempt to figure out weird error message [puppet] - 10https://gerrit.wikimedia.org/r/347424 (owner: 10Ottomata) [17:51:25] PROBLEM - puppet last run on analytics1052 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:51:26] PROBLEM - puppet last run on analytics1040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:51:29] (03PS4) 10Catrope: Enable RCFilters beta feature on fawiki, ruwiki, trwiki, and frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343438 [17:51:30] yayayay [17:51:31] i know [17:51:34] ahhahah [17:51:44] (03PS5) 10Catrope: Enable RCFilters beta feature on fawiki, ruwiki, trwiki, and frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343438 (https://phabricator.wikimedia.org/T144458) [17:51:44] !log restore Hadoop masters to analytics1001 [17:51:45] PROBLEM - puppet last run on analytics1041 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:51:45] PROBLEM - puppet last run on analytics1042 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:51:50] i'm shooting in the dark at it [17:51:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:59] wut the crap crackers is this [17:52:19] (03PS4) 10Catrope: Enable RCFilters beta feature on all wikis except wikidatawiki, nlwiki, cswiki and etwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343439 [17:52:25] PROBLEM - puppet last run on analytics1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:52:32] (03PS5) 10Catrope: Enable RCFilters beta feature on all wikis except wikidatawiki, nlwiki, cswiki and etwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343439 (https://phabricator.wikimedia.org/T144458) [17:52:35] PROBLEM - puppet last run on analytics1032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:52:55] (03PS1) 10Ottomata: Using array argument for require_package [puppet] - 10https://gerrit.wikimedia.org/r/347425 [17:53:09] (03CR) 10Catrope: "For deployment 2017-04-11 13:00 UTC" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343438 (https://phabricator.wikimedia.org/T144458) (owner: 10Catrope) [17:53:13] (03CR) 10Ottomata: [V: 032 C: 032] Using array argument for require_package [puppet] - 10https://gerrit.wikimedia.org/r/347425 (owner: 10Ottomata) [17:54:26] (03PS2) 10Catrope: Enable RCFilters beta feature on all remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347045 [17:54:37] (03PS1) 10Ottomata: Remove python3-numpy=version from list of installed packages to test fix require_package error [puppet] - 10https://gerrit.wikimedia.org/r/347426 [17:54:38] (03PS3) 10Catrope: Enable RCFilters beta feature on all remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347045 (https://phabricator.wikimedia.org/T144458) [17:54:58] (03CR) 10Ottomata: [V: 032 C: 032] Remove python3-numpy=version from list of installed packages to test fix require_package error [puppet] - 10https://gerrit.wikimedia.org/r/347426 (owner: 10Ottomata) [17:55:15] PROBLEM - puppet last run on analytics1053 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:55:35] PROBLEM - puppet last run on analytics1028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:57:54] (03PS1) 10Ottomata: Ensuring python and python3 versions of numpy are the same on hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/347427 [17:58:17] (03CR) 10Ottomata: [V: 032 C: 032] Ensuring python and python3 versions of numpy are the same on hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/347427 (owner: 10Ottomata) [17:58:45] RECOVERY - puppet last run on analytics1041 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [17:59:46] 06Operations, 10ops-eqiad, 10netops: Faulty optics on asw-b-eqiad:xe-1/1/2 - https://phabricator.wikimedia.org/T162199#3169051 (10ayounsi) 05Open>03Resolved Interface has been stable. Everything looks good. Thanks! [17:59:51] (03PS1) 10Ottomata: Fix missing comma in package list [puppet] - 10https://gerrit.wikimedia.org/r/347428 [18:00:04] (03CR) 10Ottomata: [V: 032 C: 032] Fix missing comma in package list [puppet] - 10https://gerrit.wikimedia.org/r/347428 (owner: 10Ottomata) [18:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170410T1800). Please do the needful. [18:00:04] Amir1: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:17] o/ [18:00:25] PROBLEM - puppet last run on analytics1038 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[python3-sklearn] [18:01:15] PROBLEM - puppet last run on analytics1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:01:35] PROBLEM - puppet last run on analytics1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:01:45] PROBLEM - puppet last run on analytics1041 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:02:04] 06Operations, 10netops: Slight packet loss observed on the network starting Nov 2016 - https://phabricator.wikimedia.org/T154507#3169060 (10ayounsi) 05Open>03Resolved a:03ayounsi > XioNoX> I'm secretly hoping that T154507 was caused by T162199, it's on the path, and the LACP hashing algorithm would expla... [18:02:20] (03PS1) 10Ottomata: Use package resource instead of require_package for ensuring specific python-numpy versions [puppet] - 10https://gerrit.wikimedia.org/r/347429 [18:02:35] PROBLEM - puppet last run on analytics1037 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:02:35] PROBLEM - puppet last run on analytics1039 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:03:05] I can SWAT today [18:03:15] (03PS2) 10Thcipriani: Enable ORES review tool in hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345619 (https://phabricator.wikimedia.org/T161621) (owner: 10Ladsgroup) [18:03:30] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345619 (https://phabricator.wikimedia.org/T161621) (owner: 10Ladsgroup) [18:03:46] (03CR) 10Ottomata: [C: 032] Use package resource instead of require_package for ensuring specific python-numpy versions [puppet] - 10https://gerrit.wikimedia.org/r/347429 (owner: 10Ottomata) [18:04:05] PROBLEM - puppet last run on analytics1054 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:04:39] (03Merged) 10jenkins-bot: Enable ORES review tool in hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345619 (https://phabricator.wikimedia.org/T161621) (owner: 10Ladsgroup) [18:05:25] PROBLEM - puppet last run on analytics1055 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:06:57] !log create ores tables on hewiki [18:07:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:48] (03CR) 10jenkins-bot: Enable ORES review tool in hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345619 (https://phabricator.wikimedia.org/T161621) (owner: 10Ladsgroup) [18:07:50] (03PS1) 10Ottomata: Install python3-sklearn-lib and make sure python3-numpy is installed first [puppet] - 10https://gerrit.wikimedia.org/r/347430 [18:08:46] Amir1: change is live on mwdebug1002, check please [18:08:53] Thanks [18:09:52] (03CR) 10Ottomata: [C: 032] Install python3-sklearn-lib and make sure python3-numpy is installed first [puppet] - 10https://gerrit.wikimedia.org/r/347430 (owner: 10Ottomata) [18:10:13] thcipriani: basic parts looks fine, can you run the maintenance script to check out the recent changes? [18:10:32] (you need to pull the change to terbium too, last time it took some of my time :D) [18:11:53] 06Operations, 10ops-codfw, 10Traffic: lvs2002 random shut down - https://phabricator.wikimedia.org/T162099#3169146 (10BBlack) a:05BBlack>03ayounsi @papaul Everything looks good with lvs2002 (checked icinga, interfaces on correct vlans, etc). @ayounsi Let's let it burn in with no traffic until tomorrow s... [18:12:09] Amir1: yup, running now [18:12:35] RECOVERY - puppet last run on analytics1043 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [18:12:48] !log mwscript extensions/ORES/maintenance/CheckModelVersions.php hewiki && mwscript extensions/ORES/maintenance/PopulateDatabase.php hewiki [18:12:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:48] (03PS1) 10Ottomata: Reorder more python sklearn deps to work around require_package issue [puppet] - 10https://gerrit.wikimedia.org/r/347431 [18:15:58] (03CR) 10jerkins-bot: [V: 04-1] Reorder more python sklearn deps to work around require_package issue [puppet] - 10https://gerrit.wikimedia.org/r/347431 (owner: 10Ottomata) [18:17:00] (03PS2) 10Ottomata: Reorder more python sklearn deps to work around require_package issue [puppet] - 10https://gerrit.wikimedia.org/r/347431 [18:19:11] thcipriani: It looks okay [18:19:39] (03CR) 10Ottomata: [C: 032] Reorder more python sklearn deps to work around require_package issue [puppet] - 10https://gerrit.wikimedia.org/r/347431 (owner: 10Ottomata) [18:19:50] Amir1: cool, is it fine to sync out the change while populateDatabase is finishing up? [18:20:09] With the progress it made until now, I think so [18:20:20] ok, doing [18:20:25] RECOVERY - puppet last run on analytics1036 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [18:20:27] RECOVERY - puppet last run on analytics1052 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [18:20:35] RECOVERY - puppet last run on analytics1032 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [18:22:26] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:345619|Enable ORES review tool in hewiki]] T161621 (duration: 00m 39s) [18:22:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:33] T161621: Deploy ORES Review Tool for hewiki - https://phabricator.wikimedia.org/T161621 [18:22:34] ^ Amir1 live everywhere now [18:22:49] I'll ping you when populateDatabase finishes [18:22:54] thanks [18:23:45] RECOVERY - puppet last run on analytics1041 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [18:25:15] RECOVERY - puppet last run on analytics1053 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [18:25:22] Amir1: populateDatabase just finished [18:25:26] RECOVERY - puppet last run on analytics1040 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [18:25:35] RECOVERY - puppet last run on analytics1028 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [18:25:50] Thanks! [18:25:54] (03CR) 10Volans: "It looks more a workaround than a fix, probably at this point it would be better to change the menu.append() to properly save the items in" [switchdc] - 10https://gerrit.wikimedia.org/r/347423 (owner: 10Giuseppe Lavagetto) [18:26:17] I think we might need to change the default precision but it's outside of scope of this patch [18:28:15] (03PS1) 10Ottomata: Add new hadoop worker nodes to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/347432 (https://phabricator.wikimedia.org/T155065) [18:28:25] RECOVERY - puppet last run on analytics1038 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [18:30:15] RECOVERY - puppet last run on analytics1034 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [18:30:23] (03CR) 10Ottomata: [C: 032] Add new hadoop worker nodes to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/347432 (https://phabricator.wikimedia.org/T155065) (owner: 10Ottomata) [18:32:35] RECOVERY - puppet last run on analytics1037 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [18:32:35] RECOVERY - puppet last run on analytics1039 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [18:32:45] RECOVERY - puppet last run on analytics1042 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [18:32:49] hey cmjohnson1 i'm starting to puppetize the new hadoop workers [18:32:54] is 1068 still a little wonky? [18:33:10] no, it's good [18:33:19] it looks like its in busy box console or something? [18:34:05] RECOVERY - puppet last run on analytics1054 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [18:35:12] cmjohnson1: ^^ [18:35:25] RECOVERY - puppet last run on analytics1055 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [18:35:40] ottomata: looking [18:43:26] 06Operations, 10Cassandra, 06Services (done): RAID-0 volume not mounted on restbase-dev1001.eqiad.wmnet - https://phabricator.wikimedia.org/T162614#3169273 (10Eevans) 05Open>03Resolved This should be done; I did the following: - Brought down Cassandra, and masked the systemd units - Reformatted `/dev/re... [18:44:01] (03PS6) 10Rush: admin: add a group for cloud services roots (wmcs-roots) [puppet] - 10https://gerrit.wikimedia.org/r/346838 (https://phabricator.wikimedia.org/T162404) [18:44:04] (03PS3) 10Rush: admin: add bd808 to wmcs-roots [puppet] - 10https://gerrit.wikimedia.org/r/347400 (https://phabricator.wikimedia.org/T162404) [18:44:53] (03PS7) 10Rush: admin: add a group for cloud services roots (wmcs-roots) [puppet] - 10https://gerrit.wikimedia.org/r/346838 (https://phabricator.wikimedia.org/T162404) [18:45:08] (03PS4) 10Rush: admin: add bd808 to wmcs-roots [puppet] - 10https://gerrit.wikimedia.org/r/347400 (https://phabricator.wikimedia.org/T162404) [18:47:21] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/6073/" [puppet] - 10https://gerrit.wikimedia.org/r/347024 (owner: 10Dzahn) [18:49:36] (03CR) 10Rush: [C: 032] admin: add a group for cloud services roots (wmcs-roots) [puppet] - 10https://gerrit.wikimedia.org/r/346838 (https://phabricator.wikimedia.org/T162404) (owner: 10Rush) [18:50:18] (03CR) 10Rush: [C: 032] admin: add bd808 to wmcs-roots [puppet] - 10https://gerrit.wikimedia.org/r/347400 (https://phabricator.wikimedia.org/T162404) (owner: 10Rush) [18:56:25] PROBLEM - MariaDB Slave Lag: m3 on db1048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 368.00 seconds [18:56:46] ottomata: it's failing in the partitioner [18:56:51] don't know why [18:56:55] PROBLEM - puppet last run on mw1222 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:57:00] 06Operations, 10Ops-Access-Requests, 06Labs: create a 'root' group with bdavis strictly for labs/cloud services infrastructure - https://phabricator.wikimedia.org/T162404#3169334 (10chasemp) 05Open>03Resolved a:03chasemp >>! In T162404#3168888, @bd808 wrote: > Let's do `wmcs-roots`. WMCS is our officia... [18:59:31] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 21 probes of 282 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [19:00:34] (03PS2) 10Dzahn: standardize "include ::base:*" [puppet] - 10https://gerrit.wikimedia.org/r/347024 [19:02:56] (03CR) 10Dzahn: [C: 032] standardize "include ::base:*" [puppet] - 10https://gerrit.wikimedia.org/r/347024 (owner: 10Dzahn) [19:03:04] (03PS3) 10Dzahn: standardize "include ::base:*" [puppet] - 10https://gerrit.wikimedia.org/r/347024 [19:04:31] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 18 probes of 282 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [19:05:21] RECOVERY - MariaDB Slave Lag: m3 on db1048 is OK: OK slave_sql_lag Replication lag: 46.00 seconds [19:08:08] Ops, Who should I talk to if I'm going to deploy this? https://gerrit.wikimedia.org/r/#/c/347395/ [19:08:23] checking redis instances [19:11:43] Amir1: elukey has been looking at redis stuff a lot recently. That might be a good person to start with [19:11:56] Thanks [19:13:21] PROBLEM - puppet last run on californium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:17:05] 06Operations, 10ops-ulsfo, 10fundraising-tech-ops: rack/setup frbackup2001 - https://phabricator.wikimedia.org/T162469#3169388 (10Papaul) [19:20:09] (03PS1) 10Hashar: jenkins: tweak log permissions [puppet] - 10https://gerrit.wikimedia.org/r/347435 [19:20:43] hmm, ok weird i'll poke at it then cmjohnson1 [19:20:53] (03PS2) 10Hashar: jenkins: tweak log permissions [puppet] - 10https://gerrit.wikimedia.org/r/347435 [19:20:57] cmjohnson1: also, something seems weird with analytics1064 too [19:20:59] it installed fine [19:21:11] but it gets Error: Could not request certificate: getaddrinfo: Name or service not known when attempting to talk to puppetmsater [19:21:29] (03PS1) 10Urbanecm: Increase default image thumbnail size on Finnish Wikipedia to 250px [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347436 (https://phabricator.wikimedia.org/T162376) [19:21:49] (03CR) 10Hashar: "That one should be better. I have dropped the dupe File[/var/log/jenkins/access.log]. It is already created by systemd::syslog." [puppet] - 10https://gerrit.wikimedia.org/r/347435 (owner: 10Hashar) [19:22:59] cmjohnson1: can you leave mgmt console com2 session on 1068 [19:23:05] i will powercycle and see what i can find [19:23:23] out [19:23:51] RECOVERY - puppet last run on mw1222 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [19:24:32] (03CR) 10Ottomata: [C: 031] Add JVM options tunables for Yarn RM and Hadoop DN/NN [puppet/cdh] - 10https://gerrit.wikimedia.org/r/347353 (https://phabricator.wikimedia.org/T159219) (owner: 10Elukey) [19:27:49] (03PS14) 10EBernhardson: Upgrade logstash to 5.x [puppet] - 10https://gerrit.wikimedia.org/r/344964 (https://phabricator.wikimedia.org/T154473) [19:27:51] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [19:28:09] (03PS2) 10Dzahn: typos: make "include profile::backup" a typo [puppet] - 10https://gerrit.wikimedia.org/r/347064 [19:28:42] cmjohnson1: was it just stuck on partitioning this once? [19:28:47] or did you try installing multiple times? [19:28:51] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 646 600 - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3080666 keys, up 18 days 3 hours - replication_delay is 646 [19:29:31] ottomata, no, everytime and I went back and cleared it and did it again [19:29:35] the raid cfg [19:29:39] please feel free to look at it [19:29:39] 06Operations, 10ops-codfw, 10Traffic: lvs2002 random shut down - https://phabricator.wikimedia.org/T162099#3169490 (10Papaul) @BBlack Thanks. [19:29:51] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3057719 keys, up 18 days 3 hours - replication_delay is 0 [19:30:12] ok [19:30:34] ok i'll look at 1068, if you have a sec could you see if you can figure out why 1064 isn't talking to puppetmaster? [19:30:51] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 3.33% of data above the critical threshold [1000.0] [19:31:09] (03CR) 10Dzahn: [C: 032] typos: make "include profile::backup" a typo [puppet] - 10https://gerrit.wikimedia.org/r/347064 (owner: 10Dzahn) [19:33:44] (03CR) 10Dzahn: [C: 04-1] "no new "if $realm" checks please" [puppet] - 10https://gerrit.wikimedia.org/r/347189 (owner: 10Paladox) [19:34:39] (03CR) 10Dzahn: "no new "if $realm"-checks please. we can convert the phab class to profile soon" [puppet] - 10https://gerrit.wikimedia.org/r/347188 (owner: 10Paladox) [19:34:45] (03CR) 10Dzahn: [C: 04-1] Phabricator: Make backup optional [puppet] - 10https://gerrit.wikimedia.org/r/347188 (owner: 10Paladox) [19:35:24] (03PS3) 10Dzahn: jenkins: tweak log permissions [puppet] - 10https://gerrit.wikimedia.org/r/347435 (owner: 10Hashar) [19:35:46] mutante: I am not sure what happened with that jenkins log patch :/ [19:36:01] the spec test should have caught it, but then I am too lazy to do the full investigation :D [19:36:43] hashar: yea, don't worry about it :) not worth it, it was just some dependency issue. i also just reverted because i didn't want to investigate it on Friday [19:36:55] yeah make sense :] [19:37:17] (03CR) 10Dzahn: [C: 032] jenkins: tweak log permissions [puppet] - 10https://gerrit.wikimedia.org/r/347435 (owner: 10Hashar) [19:38:10] Notice: /Stage[main]/Jenkins/Systemd::Syslog[jenkins]/File[/var/log/jenkins/jenkins.log]/group: group changed 'jenkins' to 'wikidev' [19:38:18] Notice: Finished catalog run in 14.10 seconds [19:38:21] hashar: all done [19:38:26] !log starting logstash upgrade - some log messages will be lost! - T161908 [19:38:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:35] T161908: ELK 5.x deployment plan - https://phabricator.wikimedia.org/T161908 [19:38:48] !log disabling puppet on logstash1* - T161908 [19:38:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:57] mutante: hurrah! [19:39:14] (03CR) 10Dzahn: "yep, all done. changed permissions and no problem this time." [puppet] - 10https://gerrit.wikimedia.org/r/347435 (owner: 10Hashar) [19:42:17] (03CR) 10Dzahn: [C: 031] "@Giuseppe what do you think? .erb templates here for Apache config? per "so we can add beta suffixes later"? seems good to me" [puppet] - 10https://gerrit.wikimedia.org/r/322602 (https://phabricator.wikimedia.org/T1256) (owner: 10Alex Monk) [19:44:15] mutante: I found out the issue :]  and it is a bit more complicated bah [19:44:36] hashar: couldn't resist?:) of course it is, hehe [19:46:22] cmjohnson1: ah! [19:46:35] i think the partman is failing because the flex bay is showing up as sdb instead of sda [19:46:36] on 1068 [19:46:45] i just manually partitioned, and that's what happened when it came up [19:46:47] not good really [19:47:14] that makes sense but very odd that only that one would have that issue [19:47:22] ya [19:47:42] wait [19:47:42] hang on [19:48:09] i take that back [19:48:10] (03PS1) 10Hashar: jenkins: also fix permissions for access.log [puppet] - 10https://gerrit.wikimedia.org/r/347437 [19:48:16] i was looking at the wrong terminal. i'm looking at 1064 [19:48:32] it partitioned, but has sdb as the flex bay?! [19:49:35] mutante: yeah could not resist. I am running the new one via the puppet compiler [19:50:08] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [19:51:18] !log upgrading qemu and oslo packages on labvirt1002 [19:51:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:08] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [19:54:11] i'm going to reboot analytics1064 [19:55:44] (03PS8) 10Dzahn: lists: convert to role/profile structure [puppet] - 10https://gerrit.wikimedia.org/r/346923 [19:56:57] hashar: base = /var/log filename = access.log but it's /var/log/jenkins/access.log is the jenkins part really implied there? [19:58:04] 06Operations, 10ops-eqiad, 13Patch-For-Review: rack and set up analyics1058-1069 - https://phabricator.wikimedia.org/T162216#3169539 (10Ottomata) Ok! All but 2 of the nodes are up and running as Hadoop worker nodes. analytics1064 doesn't seem to be able to contact puppetmaster1001: ``` puppet agent -t Er... [19:58:25] cmjohnson1: phew, ok, after reboot sda is now the flex bays [19:58:29] on 1064 [19:58:32] it still can't talk to puppet though [19:59:32] (03CR) 10Dzahn: [C: 031] lists: convert to role/profile structure (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/346923 (owner: 10Dzahn) [20:00:05] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170410T2000). Please do the needful. [20:01:55] no parsoid deploy today [20:02:11] (03PS9) 10Dzahn: lists: convert to role/profile structure [puppet] - 10https://gerrit.wikimedia.org/r/346923 [20:02:23] (03CR) 10Dzahn: [C: 031] lists: convert to role/profile structure (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/346923 (owner: 10Dzahn) [20:03:44] none for ores [20:04:06] :) [20:04:12] 06Operations, 10ops-eqiad, 13Patch-For-Review: rack and set up analyics1058-1069 - https://phabricator.wikimedia.org/T162216#3169543 (10Ottomata) [20:04:33] (03PS1) 10Legoktm: Deploy Linter to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347439 (https://phabricator.wikimedia.org/T148609) [20:04:41] cmjohnson1: can I assume you installed these in the racks in the order you listed? [20:04:53] 1058 in A1, 1059 and a1060 in A2, etc. [20:04:53] ? [20:05:12] yes [20:05:29] they're in racktables [20:07:34] question: we have production proxy e.g. webproxy.eqiad.wmnet. Do we log requests to these proxies? [20:07:41] (03PS15) 10Gehel: Upgrade logstash to 5.x [puppet] - 10https://gerrit.wikimedia.org/r/344964 (https://phabricator.wikimedia.org/T154473) (owner: 10EBernhardson) [20:07:48] i.e. requests going *through* these proxies [20:08:02] (03PS10) 10Dzahn: lists: convert to role/profile structure [puppet] - 10https://gerrit.wikimedia.org/r/346923 [20:08:10] (03CR) 10Jforrester: [C: 031] "\o/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347439 (https://phabricator.wikimedia.org/T148609) (owner: 10Legoktm) [20:08:12] SMalyshev: yes, i think we do [20:08:30] mutante: do you know where these logs are and in which form? [20:08:52] SMalyshev: install1002:/var/log/squid3/access.log [20:08:58] for eqiad [20:09:03] (03CR) 10Gehel: [C: 032] Upgrade logstash to 5.x [puppet] - 10https://gerrit.wikimedia.org/r/344964 (https://phabricator.wikimedia.org/T154473) (owner: 10EBernhardson) [20:09:12] SMalyshev: what are you looking for? [20:09:14] mutante: aha, thanks. any analysis done on them currently? [20:09:28] SMalyshev: i dont think there is any analysis, no [20:09:42] mutante: I'd like to see (probably not now, but in the future) how sparql federation is being used [20:10:46] SMalyshev: yea, so it's on the install servers, 1002 and 2002 respectively, and access logs are there and rotated. looks like we have 10 days [20:10:49] !log bsitzmann@tin Started deploy [mobileapps/deploy@9bc8c07]: Update mobileapps to 1695900 [20:10:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:59] should be no problem i guess [20:11:42] mutante: 10 days is kinda short but I wonder if it'll be possible to run some kind of script maybe... I'll think about it. [20:12:10] for now I wanted to know whether we have any logs at all, now that I know they're there I'll think how to use them [20:12:30] SMalyshev: alright, *nod* [20:12:41] mutante: thanks for the info :) [20:12:51] yw [20:15:52] (03CR) 10Dzahn: [C: 04-1] "this will create /var/log/jenkins_access_log/access.log next to existing /var/log/jenkins.access.log. That is not intended is it? Apparen" [puppet] - 10https://gerrit.wikimedia.org/r/347437 (owner: 10Hashar) [20:16:17] !log bsitzmann@tin Finished deploy [mobileapps/deploy@9bc8c07]: Update mobileapps to 1695900 (duration: 05m 27s) [20:16:19] (03PS2) 10Hashar: jenkins: also fix permissions for access.log [puppet] - 10https://gerrit.wikimedia.org/r/347437 [20:16:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:39] (03PS1) 10Ottomata: Add new hadoop workers to hadoop/net-topology.py.erb Bug: T152713 [puppet] - 10https://gerrit.wikimedia.org/r/347447 (https://phabricator.wikimedia.org/T152713) [20:17:06] mutante: yeah it is plain wrong :D [20:17:42] (03CR) 10EBernhardson: [V: 032 C: 032] "deploying logstash 5.x" [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/344704 (owner: 10EBernhardson) [20:19:10] ok, and in my comment itself it also is :) [20:19:22] /var/log/jenkins/access.log is it :p [20:19:45] just use "jenkins" as resource name i guess [20:20:06] https://puppet-compiler.wmflabs.org/6080/contint1001.wikimedia.org/ ! [20:20:13] (03CR) 10Ottomata: [C: 032] Add new hadoop workers to hadoop/net-topology.py.erb Bug: T152713 [puppet] - 10https://gerrit.wikimedia.org/r/347447 (https://phabricator.wikimedia.org/T152713) (owner: 10Ottomata) [20:20:17] (03PS2) 10Ottomata: Add new hadoop workers to hadoop/net-topology.py.erb Bug: T152713 [puppet] - 10https://gerrit.wikimedia.org/r/347447 (https://phabricator.wikimedia.org/T152713) [20:20:22] (03CR) 10Ottomata: [V: 032 C: 032] Add new hadoop workers to hadoop/net-topology.py.erb Bug: T152713 [puppet] - 10https://gerrit.wikimedia.org/r/347447 (https://phabricator.wikimedia.org/T152713) (owner: 10Ottomata) [20:20:28] (03CR) 10Hashar: "Ran it via the compiler https://puppet-compiler.wmflabs.org/6080/contint1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/347437 (owner: 10Hashar) [20:20:31] (03PS1) 10Mobrovac: [WIP] RESTBase: Switch to Scap3 config deploys [puppet] - 10https://gerrit.wikimedia.org/r/347452 [20:20:53] im correct in saying no mediawiki train week of april 17th right? [20:21:05] mutante: i think the last iteration would work :] [20:21:22] Zppix: https://wikitech.wikimedia.org/wiki/Deployments is the reference [20:21:38] Zppix: and yeah that week is frozen [20:21:54] hashar: i know about deployments page i was just confirming, and thanks [20:24:15] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Make WDQS active / active - https://phabricator.wikimedia.org/T162111#3169608 (10Gehel) Using the following curl to test, I don't see an entry in the nginx access log: ``` curl 'https://query.wikidata.org/bigdata/namespace/wdq/s... [20:25:15] (03CR) 10Dzahn: [C: 032] "yep, looks good now" [puppet] - 10https://gerrit.wikimedia.org/r/347437 (owner: 10Hashar) [20:25:32] (03PS3) 10Dzahn: jenkins: also fix permissions for access.log [puppet] - 10https://gerrit.wikimedia.org/r/347437 (owner: 10Hashar) [20:25:46] mutante: the reason is that I had a before systemd::syslog, which happens to generate a File resource for the log directory [20:26:07] which lead to a cycle because I defined two files in that same dir but with conflicting dependencies [20:26:08] bah [20:26:10] hashar: yea, i noticed it uses the resource name as part of the path [20:26:22] aha, yea, duplicate def [20:26:36] gotcha [20:27:45] !log deployed new logstash plugins to logstash100[123] [20:27:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:33] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: Make WDQS active / active - https://phabricator.wikimedia.org/T162111#3169636 (10Gehel) I'm not sure the change is effective. While I do see a few requests (outside of pyball / icinga) in the nginx logs on the wdqs codwf servers, I don't see a... [20:28:53] hm, cmjohnson1 what about analytics1069 [20:28:58] is it around somewhere? [20:29:36] sort of [20:29:44] oh? hah [20:29:53] long story [20:29:59] !log upgrading logstash on logstash1001 - T161908 [20:30:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:06] T161908: ELK 5.x deployment plan - https://phabricator.wikimedia.org/T161908 [20:31:33] mutante: submit ? ( https://gerrit.wikimedia.org/r/347437 ) :D [20:31:58] i did once i could [20:32:04] Notice: /Stage[main]/Jenkins/File[/var/log/jenkins/access.log]/group: group changed 'jenkins' to 'wikidev' [20:32:15] (03CR) 10Dzahn: "Notice: /Stage[main]/Jenkins/File[/var/log/jenkins/access.log]/group: group changed 'jenkins' to 'wikidev'" [puppet] - 10https://gerrit.wikimedia.org/r/347437 (owner: 10Hashar) [20:32:32] hurrah [20:32:48] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [20:33:30] (03CR) 10Dzahn: [C: 031] "http://puppet-compiler.wmflabs.org/6079/fermium.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/346923 (owner: 10Dzahn) [20:33:35] hashar: :) [20:36:37] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: Make WDQS active / active - https://phabricator.wikimedia.org/T162111#3169658 (10Smalyshev) @Gehel: you can check x-served-by headers in the responses - half of those should have codfw hosts there now. [20:42:08] PROBLEM - logstash process on logstash1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 998 (logstash), command name java, args logstash [20:45:27] (03PS1) 10Kaldari: Adding pageassessments.dblist for maintanence script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347468 [20:45:56] (03PS2) 10Kaldari: Adding pageassessments.dblist for maintanence script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347468 [20:46:55] (03CR) 10jerkins-bot: [V: 04-1] Adding pageassessments.dblist for maintanence script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347468 (owner: 10Kaldari) [20:47:10] godog: Ping? [20:47:25] Serious wierdness, and you are on call. [20:47:32] https://commons.wikimedia.org/w/index.php?title=Commons:Village_pump&diff=240380222&oldid=240367865 [20:48:09] RECOVERY - logstash process on logstash1001 is OK: PROCS OK: 1 process with UID = 998 (logstash), command name java, args logstash [20:48:12] Logged as “Undo revision 239878186 by Revent (talk (A/OS)) WTF?”, but that revision was not by me. [20:49:06] Oh wait, nevermind… [20:49:53] (03PS2) 10Mobrovac: [WIP] RESTBase: Switch to Scap3 config deploys [puppet] - 10https://gerrit.wikimedia.org/r/347452 [20:51:41] (03CR) 10jerkins-bot: [V: 04-1] [WIP] RESTBase: Switch to Scap3 config deploys [puppet] - 10https://gerrit.wikimedia.org/r/347452 (owner: 10Mobrovac) [20:54:27] (03PS3) 10Mobrovac: [WIP] RESTBase: Switch to Scap3 config deploys [puppet] - 10https://gerrit.wikimedia.org/r/347452 [20:55:55] (03CR) 10jerkins-bot: [V: 04-1] [WIP] RESTBase: Switch to Scap3 config deploys [puppet] - 10https://gerrit.wikimedia.org/r/347452 (owner: 10Mobrovac) [20:56:07] (03PS3) 10Kaldari: Adding pageassessments.dblist for maintanence script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347468 (https://phabricator.wikimedia.org/T159438) [20:56:55] (03PS4) 10Kaldari: Adding pageassessments.dblist for maintanence script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347468 (https://phabricator.wikimedia.org/T159438) [20:58:35] (03PS1) 10EBernhardson: Update de_dot plugin for logstash 5.x [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/347488 [20:59:01] (03PS2) 10EBernhardson: Update de_dot plugin for logstash 5.x [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/347488 [21:00:05] dapatrick, bawolff, and Reedy: Dear anthropoid, the time has come. Please deploy Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170410T2100). [21:00:09] (03PS4) 10Mobrovac: [WIP] RESTBase: Switch to Scap3 config deploys [puppet] - 10https://gerrit.wikimedia.org/r/347452 [21:01:04] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: Make WDQS active / active - https://phabricator.wikimedia.org/T162111#3169714 (10Gehel) grafana dashboard was wrongly filtering on eqiad only (that's why I did not see any traffic there). More tests and checking x-cache and x-served-by headers... [21:01:40] (03CR) 10jerkins-bot: [V: 04-1] [WIP] RESTBase: Switch to Scap3 config deploys [puppet] - 10https://gerrit.wikimedia.org/r/347452 (owner: 10Mobrovac) [21:01:48] PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [21:06:02] (03PS3) 10EBernhardson: Update de_dot plugin for logstash 5.x [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/347488 [21:07:36] (03CR) 10BryanDavis: [C: 032] Update de_dot plugin for logstash 5.x [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/347488 (owner: 10EBernhardson) [21:08:54] (03CR) 10BryanDavis: [V: 032 C: 032] Update de_dot plugin for logstash 5.x [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/347488 (owner: 10EBernhardson) [21:13:51] !log running puppet on logstash1001 to deploy new logstash plugins - T161908 [21:13:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:59] T161908: ELK 5.x deployment plan - https://phabricator.wikimedia.org/T161908 [21:15:49] (03CR) 10Krinkle: [C: 031] Wikitech: Don't try to do cross-wiki uploads to Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347422 (https://phabricator.wikimedia.org/T162374) (owner: 10Jforrester) [21:16:20] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3169755 (10RobH) It seems that these were in the initial Dasher orders where Intel disks were Dasher supported, not HP. @Papaul: Can you provi... [21:17:40] !log logstash upgrade on logstash1001 completed - T161908 [21:17:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:15] !log upgrading logstash on logstash1002 - T161908 [21:22:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:22] T161908: ELK 5.x deployment plan - https://phabricator.wikimedia.org/T161908 [21:22:37] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3169778 (10Papaul) Drive Model ATA INTEL SSDSC2BB80 [21:26:08] PROBLEM - puppet last run on logstash1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[install-logstash-plugins] [21:27:08] RECOVERY - puppet last run on logstash1002 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [21:29:11] (03CR) 10Dzahn: [C: 031] mw_rc_irc: convert to profile/role structure (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/346926 (owner: 10Dzahn) [21:29:14] 06Operations, 10Wikimedia-Site-requests, 07CSS, 07Russian-Sites: Add unique class for 'migration warningbox' on Watchlist page in Russian Wikipedia - https://phabricator.wikimedia.org/T162639#3169785 (10Iniquity) [21:29:21] (03PS2) 10Dzahn: mw_rc_irc: convert to profile/role structure [puppet] - 10https://gerrit.wikimedia.org/r/346926 [21:31:26] !log upgrading logstash on logstash1003 - T161908 [21:31:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:34] T161908: ELK 5.x deployment plan - https://phabricator.wikimedia.org/T161908 [21:31:50] (03CR) 10Dzahn: [C: 031] "http://puppet-compiler.wmflabs.org/6086/" [puppet] - 10https://gerrit.wikimedia.org/r/346926 (owner: 10Dzahn) [21:32:28] PROBLEM - logstash process on logstash1003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 998 (logstash), command name java, args logstash [21:33:28] RECOVERY - logstash process on logstash1003 is OK: PROCS OK: 1 process with UID = 998 (logstash), command name java, args logstash [21:33:44] !log logstash upgrade on all logstash1* nodes completed- T161908 [21:33:48] PROBLEM - puppet last run on logstash1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Service[logstash] [21:33:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:48] RECOVERY - puppet last run on logstash1003 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [21:36:02] (03PS5) 10Mobrovac: [WIP] RESTBase: Switch to Scap3 config deploys [puppet] - 10https://gerrit.wikimedia.org/r/347452 [21:36:05] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3169816 (10Papaul) @RobH I was able to pull the information from the HW diagnostic i did last week please see below for information Disk 1 SC... [21:39:00] (03CR) 10jerkins-bot: [V: 04-1] [WIP] RESTBase: Switch to Scap3 config deploys [puppet] - 10https://gerrit.wikimedia.org/r/347452 (owner: 10Mobrovac) [21:42:32] (03PS1) 10Volans: Disable puppet: add videoscalers [switchdc] - 10https://gerrit.wikimedia.org/r/347511 (https://phabricator.wikimedia.org/T160178) [21:45:29] (03CR) 10Volans: "I know they are already included by the previous selection, but I think it's more explicit and in case we'll split them in the future." [switchdc] - 10https://gerrit.wikimedia.org/r/347511 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [21:47:07] (03PS3) 10Andrew Bogott: fullstack: Switch to the normal jessie image [puppet] - 10https://gerrit.wikimedia.org/r/346984 [21:47:09] (03PS1) 10Andrew Bogott: Remove the striker-users admin group [puppet] - 10https://gerrit.wikimedia.org/r/347513 [21:49:12] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3169847 (10RobH) This lists two SSDs, which one is the failed one? [21:49:28] (03CR) 10Andrew Bogott: [C: 032] Remove the striker-users admin group [puppet] - 10https://gerrit.wikimedia.org/r/347513 (owner: 10Andrew Bogott) [21:50:18] RECOVERY - puppet last run on californium is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [21:51:50] (03PS13) 10Andrew Bogott: wmfkeystonehooks: Create project page on wikitech on project creation [puppet] - 10https://gerrit.wikimedia.org/r/323117 (https://phabricator.wikimedia.org/T150091) [21:51:52] (03PS4) 10Andrew Bogott: Keystonehooks: Add two more ldap ous for sudo handling. [puppet] - 10https://gerrit.wikimedia.org/r/346880 [21:52:19] (03CR) 10Andrew Bogott: [C: 032] fullstack: Switch to the normal jessie image [puppet] - 10https://gerrit.wikimedia.org/r/346984 (owner: 10Andrew Bogott) [21:52:47] (03CR) 10jerkins-bot: [V: 04-1] Keystonehooks: Add two more ldap ous for sudo handling. [puppet] - 10https://gerrit.wikimedia.org/r/346880 (owner: 10Andrew Bogott) [21:53:48] RECOVERY - Check HHVM threads for leakage on mw1169 is OK: OK [21:56:23] (03PS6) 10Mobrovac: [WIP] RESTBase: Switch to Scap3 config deploys [puppet] - 10https://gerrit.wikimedia.org/r/347452 [21:56:39] (03PS5) 10Andrew Bogott: Keystonehooks: Add two more ldap ous for sudo handling. [puppet] - 10https://gerrit.wikimedia.org/r/346880 [21:58:32] (03CR) 10Andrew Bogott: [C: 032] Keystonehooks: Add two more ldap ous for sudo handling. [puppet] - 10https://gerrit.wikimedia.org/r/346880 (owner: 10Andrew Bogott) [22:09:25] (03PS14) 10Andrew Bogott: wmfkeystonehooks: Create project page on wikitech on project creation [puppet] - 10https://gerrit.wikimedia.org/r/323117 (https://phabricator.wikimedia.org/T150091) [22:10:10] (03CR) 10jerkins-bot: [V: 04-1] wmfkeystonehooks: Create project page on wikitech on project creation [puppet] - 10https://gerrit.wikimedia.org/r/323117 (https://phabricator.wikimedia.org/T150091) (owner: 10Andrew Bogott) [22:11:35] (03PS7) 10Mobrovac: RESTBase: Switch to Scap3 config deploys [puppet] - 10https://gerrit.wikimedia.org/r/347452 (https://phabricator.wikimedia.org/T116335) [22:12:54] (03PS2) 10Thcipriani: Scap: update version to 3.5.5-1 [puppet] - 10https://gerrit.wikimedia.org/r/346579 (https://phabricator.wikimedia.org/T127762) [22:15:05] (03PS15) 10Andrew Bogott: wmfkeystonehooks: Create project page on wikitech on project creation [puppet] - 10https://gerrit.wikimedia.org/r/323117 (https://phabricator.wikimedia.org/T150091) [22:16:25] (03CR) 10Dzahn: deployment::server: convert to profile/role (pt. 1) (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/344728 (owner: 10Dzahn) [22:18:31] 06Operations, 10Wikimedia-Site-requests, 07CSS, 07Russian-Sites: Add unique class for 'migration warningbox' on Watchlist page in Russian Wikipedia - https://phabricator.wikimedia.org/T162639#3169914 (10Iniquity) 05Open>03Invalid Sorry. Figured out. [22:20:13] (03PS1) 10Jdlrobson: Remove deprecated config option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347521 [22:20:44] (03PS1) 10Andrew Bogott: keystone.conf: Whitespace cleanups [puppet] - 10https://gerrit.wikimedia.org/r/347522 [22:22:26] (03CR) 10Andrew Bogott: [C: 032] keystone.conf: Whitespace cleanups [puppet] - 10https://gerrit.wikimedia.org/r/347522 (owner: 10Andrew Bogott) [22:24:46] (03PS16) 10Andrew Bogott: wmfkeystonehooks: Create project page on wikitech on project creation [puppet] - 10https://gerrit.wikimedia.org/r/323117 (https://phabricator.wikimedia.org/T150091) [22:28:57] (03PS10) 10BBlack: dnsrecursor: 4.x backport and edns-client-subnet [puppet] - 10https://gerrit.wikimedia.org/r/346937 [22:28:58] (03PS7) 10BBlack: dnsrecursor: update to backports for transition [puppet] - 10https://gerrit.wikimedia.org/r/346980 [22:31:10] (03PS4) 10Dzahn: deployment::server: convert to profile/role [puppet] - 10https://gerrit.wikimedia.org/r/344728 [22:31:42] (03CR) 10BBlack: [C: 031] "For PS10, I re-worked the config diff to net zero lines for trusty dnsrecursor installs, avoiding unnecessary service restarts on those ho" [puppet] - 10https://gerrit.wikimedia.org/r/346937 (owner: 10BBlack) [22:33:00] (03CR) 10jerkins-bot: [V: 04-1] deployment::server: convert to profile/role [puppet] - 10https://gerrit.wikimedia.org/r/344728 (owner: 10Dzahn) [22:41:14] (03PS5) 10Dzahn: deployment::server: convert to profile/role [puppet] - 10https://gerrit.wikimedia.org/r/344728 [22:42:56] (03Draft1) 10Paladox: irc echo: Convert to base::service class to maintain the script [puppet] - 10https://gerrit.wikimedia.org/r/347518 [22:42:59] (03PS2) 10Paladox: irc echo: Convert to base::service class to maintain the script [puppet] - 10https://gerrit.wikimedia.org/r/347518 [22:43:10] mutante ^^ :) [22:43:12] (03PS3) 10Paladox: irc echo: Convert to base::service class to maintain the script [puppet] - 10https://gerrit.wikimedia.org/r/347518 [22:45:09] !log Deployed patch for T162621 to wmf18 and wmf19 [22:45:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:45:32] (03PS1) 10Catrope: Add ORES thresholds for fawiki, ruwiki, trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347526 [22:47:03] (03PS6) 10Dzahn: deployment::server: convert to profile/role [puppet] - 10https://gerrit.wikimedia.org/r/344728 [22:58:00] paladox: compiling [22:58:17] ok [22:58:18] thanks [23:00:05] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170410T2300). Please do the needful. [23:00:05] RoanKattouw: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:32] I think I'm the only one, so I'll deploy my patch myself [23:00:45] (03CR) 10Catrope: [C: 032] Add ORES thresholds for fawiki, ruwiki, trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347526 (owner: 10Catrope) [23:03:32] (03CR) 10BBlack: [C: 032] dnsrecursor: 4.x backport and edns-client-subnet [puppet] - 10https://gerrit.wikimedia.org/r/346937 (owner: 10BBlack) [23:03:38] (03CR) 10BBlack: [C: 032] dnsrecursor: update to backports for transition [puppet] - 10https://gerrit.wikimedia.org/r/346980 (owner: 10BBlack) [23:04:03] (03Merged) 10jenkins-bot: Add ORES thresholds for fawiki, ruwiki, trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347526 (owner: 10Catrope) [23:04:11] !log puppet disabled on jessie recdns (maerlant, nescio, hydrogen, chromium) for complex upgrade process ( https://gerrit.wikimedia.org/r/#/c/346937/ ) [23:04:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:04:17] (03CR) 10jenkins-bot: Add ORES thresholds for fawiki, ruwiki, trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347526 (owner: 10Catrope) [23:09:54] (03PS1) 10BBlack: ulsfo lvs: prefer codfw dns [puppet] - 10https://gerrit.wikimedia.org/r/347531 [23:09:56] (03PS1) 10BBlack: eqiad lvs: do not directly use hydrogen temporarily [puppet] - 10https://gerrit.wikimedia.org/r/347532 [23:10:17] (03CR) 10BBlack: [V: 032 C: 032] ulsfo lvs: prefer codfw dns [puppet] - 10https://gerrit.wikimedia.org/r/347531 (owner: 10BBlack) [23:14:31] (03CR) 10BBlack: [C: 032] eqiad lvs: do not directly use hydrogen temporarily [puppet] - 10https://gerrit.wikimedia.org/r/347532 (owner: 10BBlack) [23:17:23] (03PS1) 10Volans: Logging: filter out all cumin's messages from stderr [switchdc] - 10https://gerrit.wikimedia.org/r/347534 (https://phabricator.wikimedia.org/T160178) [23:18:53] !log bblack@neodymium conftool action : set/pooled=no; selector: name=hydrogen.wikimedia.org,service=pdns_recursor [23:19:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:18] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:21:18] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:21:28] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:21:28] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:21:28] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:21:28] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:21:38] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:21:38] PROBLEM - eventlogging-service-eventbus endpoints health on kafka1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:21:38] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:21:38] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:21:38] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:21:52] PROBLEM - LVS HTTP IPv4 on eventbus.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:21:58] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - eventbus_8085 - Could not depool server kafka1002.eqiad.wmnet because of too many down! [23:21:58] PROBLEM - eventlogging-service-eventbus endpoints health on kafka1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:21:58] PROBLEM - eventlogging-service-eventbus endpoints health on kafka1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:21:58] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - eventbus_8085 - Could not depool server kafka1002.eqiad.wmnet because of too many down! [23:22:29] PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:22:51] RECOVERY - LVS HTTP IPv4 on eventbus.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 2247 bytes in 4.661 second response time [23:22:52] I don't think that's me [23:22:56] hrmm [23:23:09] the timing is suspicious though, so I've paused what I'm working on [23:23:18] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy [23:23:20] over aggressive monitoring perhaps? [23:23:28] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy [23:23:29] RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy [23:23:38] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [23:23:38] RECOVERY - eventlogging-service-eventbus endpoints health on kafka1001 is OK: All endpoints are healthy [23:23:38] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [23:23:38] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [23:23:38] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [23:23:48] RECOVERY - eventlogging-service-eventbus endpoints health on kafka1003 is OK: All endpoints are healthy [23:23:48] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [23:23:48] RECOVERY - eventlogging-service-eventbus endpoints health on kafka1002 is OK: All endpoints are healthy [23:23:53] I haven't stopped anything yet [23:24:08] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [23:24:12] (also the things I'm doing definitely couldn't impact codfw along with eqiad all at the same time) [23:24:18] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [23:24:18] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [23:24:24] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3170063 (10RobH) Update from IRC: Papaul wasn't sure which SSD failed, he just pulled both. He'll place one of the two back in and run the di... [23:24:28] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [23:24:28] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [23:24:28] PROBLEM - puppet last run on db1070 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:25:11] What happened? Eventbus paged flapping [23:25:16] got it too [23:25:17] dunno [23:25:28] looks like maybe a more general scb problem based on other pages too? [23:25:28] ah, just saw backscroll [23:25:30] eventbus looks ok [23:25:34] I think eventbus just flapped on its own, but it's also in the midst of me working on recdns [23:25:40] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Set ORES thresholds for fawiki, ruwiki, trwiki (duration: 00m 39s) [23:25:45] I haven't stopped any recdns service though, just doing the prerequisite prep work [23:25:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:59] don't have backscroll, just logged on [23:26:09] the only way they could be related, I think, is if when I depooled hydrogen, chromium can't handle normal eqiad recdns load [23:26:16] but it doesn't look overloaded, either [23:26:58] PROBLEM - HHVM jobrunner on mw1299 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:27:16] did all scb services flap? [23:28:24] hm, dunno, everything looks ok. i'll check back later [23:28:33] I see eventlogging, mobileapps, restbase, [23:28:38] I don't know if that's all or not [23:28:48] it's odd that it's scb in both datacenters at the same time, though [23:29:04] that is odd [23:29:15] eventbus processes are still running fine [23:29:25] at least on one box [23:32:29] in any case, they all recovered, and I'm continuing in my steps [23:32:46] (I still don't think there's any relation between my recdns and whatever happened there with eventbus, afaics) [23:33:24] !log upgrading hydrogen to pdns-recursor 4.x [23:33:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:31] !log bblack@neodymium conftool action : set/pooled=yes; selector: name=hydrogen.wikimedia.org,service=pdns_recursor [23:34:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:52] paladox: i cant compile changes on einsteinium: http://puppet-compiler.wmflabs.org/6093/einsteinium.wikimedia.org/change.einsteinium.wikimedia.org.err [23:36:19] 06Operations, 07HHVM, 13Patch-For-Review, 07Upstream: Build / migrate to HHVM 3.18 - https://phabricator.wikimedia.org/T158176#3170099 (10Krinkle) [23:36:19] oh [23:36:21] notices "Warning: Scope(Class[Base::Standard_packages]): os_version(): obsolete distribution check in ubuntu >= precise" too, unrelatedly [23:36:21] 06Operations, 10MediaWiki-Internationalization, 07HHVM, 05MW-1.29-release (WMF-deploy-2017-03-21_(1.29.0-wmf.17)), and 3 others: Uninitialized string offset warnings with HHVM 3.18 in LanguageAz.php and LanguageKk.php - https://phabricator.wikimedia.org/T161095#3170096 (10Krinkle) 05Open>03Resolved a:... [23:36:33] (03PS1) 10BBlack: Revert "eqiad lvs: do not directly use hydrogen temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/347536 [23:36:34] mutante those failures look un related to my patch. [23:36:45] (03CR) 10BBlack: [V: 032 C: 032] Revert "eqiad lvs: do not directly use hydrogen temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/347536 (owner: 10BBlack) [23:36:57] paladox: i know, they are unrelated. but it means i cant test your change [23:37:10] ok [23:37:19] i am just telling you that compiling it failed (for other reasons) [23:37:44] ok [23:43:22] hrhr, of course i get "Error: Could not find class ::role::backup::host" [23:43:59] !log jessie-recdns: upgrade to pdns-recursor 4.x paused - hydrogen updated and in-service; chromium/nescio/maerlant still puppet-disabled. Going to leave things in this state for a while. If something seems amiss, hydrogen can be re-depooled via confctl: confctl select name=hydrogen.wikimedia.org,service=pdns_recursor set/pooled=no [23:44:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:46:11] (03PS7) 10Dzahn: deployment::server: convert to profile/role [puppet] - 10https://gerrit.wikimedia.org/r/344728 [23:46:56] (03PS4) 10Dzahn: ircecho: Convert to base::service class to maintain the script [puppet] - 10https://gerrit.wikimedia.org/r/347518 (owner: 10Paladox) [23:48:21] (03CR) 10Dzahn: [C: 04-1] "http://puppet-compiler.wmflabs.org/6095/" [puppet] - 10https://gerrit.wikimedia.org/r/342777 (owner: 10Dzahn) [23:52:29] RECOVERY - puppet last run on db1070 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [23:55:32] (03PS4) 10Dzahn: mediawiki::maintenance: convert to profile/role (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/342777 [23:55:46] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::maintenance: convert to profile/role (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/342777 (owner: 10Dzahn) [23:57:23] (03PS5) 10Dzahn: mediawiki::maintenance: convert to profile/role (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/342777 [23:59:47] (03PS1) 10Tim Starling: Use EtcdConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347537 (https://phabricator.wikimedia.org/T156924)