[01:14:54] PROBLEM - puppet last run on relforge1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:15:14] PROBLEM - check_puppetrun on bellatrix is CRITICAL: CRITICAL: Puppet has 1 failures [01:20:04] RECOVERY - check_puppetrun on bellatrix is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [01:26:14] PROBLEM - puppet last run on ms-be1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:29:14] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 83787.342553 Seconds [01:29:54] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 83824.095022 Seconds [01:29:54] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 83824.107284 Seconds [01:30:24] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 81441.250886 Seconds [01:30:24] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 81441.250503 Seconds [01:30:34] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 81448.575023 Seconds [01:43:14] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [01:43:54] RECOVERY - puppet last run on relforge1002 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [01:46:14] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 84807.457752 Seconds [01:49:24] RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 0.0 Seconds [01:49:24] RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 0.0 Seconds [01:49:54] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds [01:49:54] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds [01:51:14] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 23.427531 Seconds [01:52:24] PROBLEM - puppet last run on db1055 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:52:24] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 82762.043985 Seconds [01:52:25] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 82762.044495 Seconds [01:53:24] RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 43.174927 Seconds [01:53:24] RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 43.176464 Seconds [01:53:34] RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 49.457395 Seconds [01:55:14] RECOVERY - puppet last run on ms-be1022 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [02:20:24] RECOVERY - puppet last run on db1055 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [02:21:28] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.17) (duration: 08m 22s) [02:21:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:26:53] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Mar 27 02:26:53 UTC 2017 (duration 5m 25s) [02:27:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:41:34] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:43:45] huh, https://commons.wikimedia.org/wiki/File:Assemblea_Costituente_1946_(2).svg seems to have the underlying file be missing somehow [02:44:54] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 1835.167609 Seconds [02:44:54] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 1835.168252 Seconds [02:47:14] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 1978.281989 Seconds [02:49:54] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds [02:51:14] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [02:52:54] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds [02:54:14] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 2398.264469 Seconds [02:55:31] Revent: hey… anyone? [02:55:32] [9:53pm] Revent: https://commons.wikimedia.org/w/index.php?title=File:Svgfiles-2017-03-26-18-56-23-164701-11810081852259458081.svg&action=history [02:55:33] [9:54pm] Revent: The move is fucked… the file is not visible at the new location. [02:55:34] [9:55pm] Revent: Oh shit, wrong channel. [02:56:54] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 2554.987493 Seconds [02:56:54] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 2554.992726 Seconds [03:00:54] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds [03:02:24] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 1971.075378 Seconds [03:02:24] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 1971.075799 Seconds [03:03:54] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 2975.219963 Seconds [03:04:34] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 2097.289523 Seconds [03:06:14] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [03:06:34] RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 0.0 Seconds [03:07:24] RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 0.0 Seconds [03:07:24] RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 0.0 Seconds [03:09:34] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 2397.525389 Seconds [03:09:54] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds [03:09:54] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds [03:10:24] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 2451.228734 Seconds [03:10:25] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 2451.232024 Seconds [03:12:34] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [03:16:14] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [03:16:24] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [03:17:34] RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 0.0 Seconds [03:18:03] 06Operations, 06Commons, 10media-storage: Commons File:Assemblea_Costituente_1946_(2).svg missing after file move - https://phabricator.wikimedia.org/T161476#3132721 (10Bawolff) [03:20:34] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 3057.431478 Seconds [03:21:34] RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 0.0 Seconds [03:23:24] RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 0.0 Seconds [03:23:25] RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 0.0 Seconds [03:27:24] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [03:29:14] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [03:37:54] PROBLEM - puppet last run on mc1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:39:54] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 1838.754955 Seconds [03:41:54] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds [03:44:14] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 2101.819626 Seconds [03:44:34] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 4497.501202 Seconds [03:45:14] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [03:45:54] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 2198.693571 Seconds [03:45:54] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 2198.698373 Seconds [03:46:34] RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 0.0 Seconds [03:46:54] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds [03:46:54] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds [03:51:24] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 4911.150571 Seconds [03:51:54] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 2558.78351 Seconds [03:51:54] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 2558.79434 Seconds [03:53:24] RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 0.0 Seconds [03:53:54] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds [03:53:54] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds [03:54:14] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is inactive [03:54:34] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 5097.30056 Seconds [03:55:34] RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 0.0 Seconds [03:58:34] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 5337.454393 Seconds [03:59:34] RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 0.0 Seconds [04:05:34] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 5757.43627 Seconds [04:06:54] RECOVERY - puppet last run on mc1020 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [04:07:24] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 5871.086865 Seconds [04:07:24] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 5871.087077 Seconds [04:07:34] RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 0.0 Seconds [04:09:24] RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 0.0 Seconds [04:09:24] RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 0.0 Seconds [04:13:14] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1005 is OK: OK - maintain-dbusers is active [04:13:54] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=2100.50 Read Requests/Sec=384.20 Write Requests/Sec=142.10 KBytes Read/Sec=36518.00 KBytes_Written/Sec=5897.20 [04:16:24] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 6411.131524 Seconds [04:16:24] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 6411.132367 Seconds [04:17:24] RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 0.0 Seconds [04:17:24] RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 0.0 Seconds [04:22:54] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=0.70 Read Requests/Sec=0.50 Write Requests/Sec=12.30 KBytes Read/Sec=4.80 KBytes_Written/Sec=75.20 [04:26:14] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 1971.022847 Seconds [04:27:14] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [04:29:54] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 2188.013367 Seconds [04:29:54] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 2188.044187 Seconds [04:30:54] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds [04:30:54] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds [04:36:54] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 2607.839761 Seconds [04:36:54] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 2607.83982 Seconds [04:38:54] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds [04:38:54] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds [04:44:24] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [04:45:44] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [04:49:14] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [04:54:44] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [05:01:44] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [05:03:44] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [05:11:14] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [05:11:44] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [05:12:24] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [05:12:44] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [05:46:34] PROBLEM - puppet last run on cp3010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:06:00] !log Resume pt-table-checksum on dewiki - T161294 [06:06:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:09] T161294: run pt-tablechecksum on s5 - https://phabricator.wikimedia.org/T161294 [06:14:34] RECOVERY - puppet last run on cp3010 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [06:18:32] <_joe_> !log disabling puppet on authdns while merging a dns change [06:18:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:44] (03PS1) 10Marostegui: db-eqiad.php: Depool db1053 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344895 (https://phabricator.wikimedia.org/T73563) [06:20:24] (03CR) 10Giuseppe Lavagetto: [C: 032] Add missing discovery entries [puppet] - 10https://gerrit.wikimedia.org/r/344088 (owner: 10Giuseppe Lavagetto) [06:23:32] (03PS1) 10Giuseppe Lavagetto: discovery: add active_active stanza for search [puppet] - 10https://gerrit.wikimedia.org/r/344897 [06:23:47] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1053 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344895 (https://phabricator.wikimedia.org/T73563) (owner: 10Marostegui) [06:23:53] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] discovery: add active_active stanza for search [puppet] - 10https://gerrit.wikimedia.org/r/344897 (owner: 10Giuseppe Lavagetto) [06:25:18] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1053 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344895 (https://phabricator.wikimedia.org/T73563) (owner: 10Marostegui) [06:26:28] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1053 T160415 - T73563 (duration: 00m 56s) [06:26:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:36] T160415: Review schema changes for T125071 - Add index to image table on all wikis - https://phabricator.wikimedia.org/T160415 [06:26:36] T73563: *_minor_mime are varbinary(32) on WMF sites, out of sync with varbinary(100) in MW core - https://phabricator.wikimedia.org/T73563 [06:26:37] !log Deploy alter table s4 (commonswiki) db1053 - https://phabricator.wikimedia.org/T73563 https://phabricator.wikimedia.org/T160415 [06:26:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:55] (03CR) 10Giuseppe Lavagetto: [C: 032] Add new discovery entries [dns] - 10https://gerrit.wikimedia.org/r/344093 (owner: 10Giuseppe Lavagetto) [06:28:04] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1053 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344895 (https://phabricator.wikimedia.org/T73563) (owner: 10Marostegui) [06:28:24] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:29:55] (03PS1) 10Giuseppe Lavagetto: discovery: convert switft-ro to active-active [puppet] - 10https://gerrit.wikimedia.org/r/344900 [06:30:14] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] discovery: convert switft-ro to active-active [puppet] - 10https://gerrit.wikimedia.org/r/344900 (owner: 10Giuseppe Lavagetto) [06:33:14] (03PS1) 10Giuseppe Lavagetto: discovery: wdqs is active-active too [puppet] - 10https://gerrit.wikimedia.org/r/344904 [06:33:39] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] discovery: wdqs is active-active too [puppet] - 10https://gerrit.wikimedia.org/r/344904 (owner: 10Giuseppe Lavagetto) [06:35:54] PROBLEM - puppet last run on mw1306 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:41:44] (03PS2) 10Muehlenhoff: Adapt debdeploy grain to rename of nova::api role [puppet] - 10https://gerrit.wikimedia.org/r/344611 [06:44:56] 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 07Wikimedia-Multiple-active-datacenters: Prepare and improve the datacenter switchover procedure - https://phabricator.wikimedia.org/T154658#3132876 (10Joe) [06:55:50] (03PS1) 10Elukey: Move Hadoop cron MAILTO from singe person to analytics-alerts [puppet] - 10https://gerrit.wikimedia.org/r/344908 (https://phabricator.wikimedia.org/T160888) [06:58:57] (03CR) 10Elukey: [C: 032] Move Hadoop cron MAILTO from singe person to analytics-alerts [puppet] - 10https://gerrit.wikimedia.org/r/344908 (https://phabricator.wikimedia.org/T160888) (owner: 10Elukey) [06:59:02] 06Operations, 10MediaWiki-extensions-UniversalLanguageSelector, 07I18n, 07Verified: ULS causes pages to be cached with random user language - https://phabricator.wikimedia.org/T43451#3132899 (10Nemo_bis) 05Resolved>03declined [07:01:41] (03CR) 10Muehlenhoff: [C: 032] Adapt debdeploy grain to rename of nova::api role [puppet] - 10https://gerrit.wikimedia.org/r/344611 (owner: 10Muehlenhoff) [07:01:47] (03PS3) 10Muehlenhoff: Adapt debdeploy grain to rename of nova::api role [puppet] - 10https://gerrit.wikimedia.org/r/344611 [07:01:54] 06Operations, 06Analytics-Kanban, 06WMDE-Analytics-Engineering, 13Patch-For-Review, 15User-Addshore: /a/mw-log/archive/api on stat1002 no longer being populated - https://phabricator.wikimedia.org/T160888#3132915 (10elukey) Removed `/srv/mw-log/archive/api_log_backup_elukey/*` from mwlog1001 and verified... [07:03:26] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [07:03:56] RECOVERY - puppet last run on mw1306 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [07:09:34] 06Operations, 10Mail: E-mail for people in different OIT LDAP object unit - https://phabricator.wikimedia.org/T159750#3077335 (10MoritzMuehlenhoff) I'd say let's make the name a little bit more generic, how about ou=mail_only or similar? [07:09:57] 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 07Wikimedia-Multiple-active-datacenters: Prepare and improve the datacenter switchover procedure - https://phabricator.wikimedia.org/T154658#3132930 (10Joe) What still needs to be done: - Integrate discovery system in puppet/MediaWiki config extensive... [07:16:15] 06Operations, 10Ops-Access-Requests: Granting wmde group access to - https://phabricator.wikimedia.org/T161484#3132935 (10MoritzMuehlenhoff) [07:16:31] (03PS1) 10Marostegui: db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344911 (https://phabricator.wikimedia.org/T17441) [07:17:05] !log elukey@puppetmaster1001 conftool action : set/pooled=active; selector: name=mw2256.codfw.wmnet [07:17:07] (03PS2) 10Marostegui: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344911 (https://phabricator.wikimedia.org/T17441) [07:17:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:21] (executed scap pull on mw2256 just in case but it was already "pooled=no", so in tin's mw dsh) [07:25:16] (03CR) 10Jcrespo: [C: 031] db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344911 (https://phabricator.wikimedia.org/T17441) (owner: 10Marostegui) [07:30:50] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344911 (https://phabricator.wikimedia.org/T17441) (owner: 10Marostegui) [07:32:20] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344911 (https://phabricator.wikimedia.org/T17441) (owner: 10Marostegui) [07:32:32] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344911 (https://phabricator.wikimedia.org/T17441) (owner: 10Marostegui) [07:33:48] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1089 - T17441 (duration: 00m 45s) [07:33:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:54] T17441: Some tables lack unique or primary keys, may allow confusing duplicate data - https://phabricator.wikimedia.org/T17441 [07:38:03] (03PS5) 10DCausse: [es5 upgrade] step 5: restore normal operations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342034 (https://phabricator.wikimedia.org/T157479) [07:47:14] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Good job, but there are a few things that need to be improved (and I also added a few other optional comments too)." (0310 comments) [puppet] - 10https://gerrit.wikimedia.org/r/342248 (https://phabricator.wikimedia.org/T147718) (owner: 10Gehel) [07:47:25] (03PS1) 10Jcrespo: mariadb: repool es2014 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344913 (https://phabricator.wikimedia.org/T129350) [07:47:26] PROBLEM - cassandra-b SSL 10.192.48.69:7001 on restbase2012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [07:47:36] PROBLEM - Check systemd state on restbase2012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:47:36] PROBLEM - cassandra-b service on restbase2012 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [07:48:06] PROBLEM - cassandra-b CQL 10.192.48.69:9042 on restbase2012 is CRITICAL: connect to address 10.192.48.69 and port 9042: Connection refused [07:48:57] (03CR) 10Jcrespo: [C: 032] mariadb: repool es2014 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344913 (https://phabricator.wikimedia.org/T129350) (owner: 10Jcrespo) [07:49:24] (03CR) 10Hashar: "There are a three Precise instances left in labs (T143349). For at least two I am pretty sure they dont use mediawiki::packages::legacy:" [puppet] - 10https://gerrit.wikimedia.org/r/343309 (https://phabricator.wikimedia.org/T158652) (owner: 10Hashar) [07:50:11] (03Merged) 10jenkins-bot: mariadb: repool es2014 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344913 (https://phabricator.wikimedia.org/T129350) (owner: 10Jcrespo) [07:50:19] (03CR) 10jenkins-bot: mariadb: repool es2014 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344913 (https://phabricator.wikimedia.org/T129350) (owner: 10Jcrespo) [07:54:56] PROBLEM - Check HHVM threads for leakage on mw1259 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [07:58:25] (03CR) 10Giuseppe Lavagetto: [C: 032] redis: cleanup unused modules/files [puppet] - 10https://gerrit.wikimedia.org/r/344623 (owner: 10Giuseppe Lavagetto) [07:58:33] (03PS2) 10Giuseppe Lavagetto: redis: cleanup unused modules/files [puppet] - 10https://gerrit.wikimedia.org/r/344623 [07:58:38] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] redis: cleanup unused modules/files [puppet] - 10https://gerrit.wikimedia.org/r/344623 (owner: 10Giuseppe Lavagetto) [08:01:57] !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool es2014 after maintenance (duration: 00m 43s) [08:02:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:36] RECOVERY - cassandra-b SSL 10.192.48.69:7001 on restbase2012 is OK: SSL OK - Certificate restbase2012-b valid until 2017-11-17 00:54:33 +0000 (expires in 234 days) [08:13:36] RECOVERY - Check systemd state on restbase2012 is OK: OK - running: The system is fully operational [08:13:36] RECOVERY - cassandra-b service on restbase2012 is OK: OK - cassandra-b is active [08:14:06] RECOVERY - cassandra-b CQL 10.192.48.69:9042 on restbase2012 is OK: TCP OK - 0.036 second response time on 10.192.48.69 port 9042 [08:16:15] (03PS3) 10Giuseppe Lavagetto: realm: remove parsoid_site, switch to discovery. [puppet] - 10https://gerrit.wikimedia.org/r/340993 [08:16:39] !log Deploy alter tables on db1089 (depooled) for a bunch of tables to convert UNIQUE keys into PK for testing - T17441 [08:16:41] (03PS1) 10Elukey: Move hue.w.o's backend to thorium [puppet] - 10https://gerrit.wikimedia.org/r/344916 (https://phabricator.wikimedia.org/T159527) [08:16:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:46] T17441: Some tables lack unique or primary keys, may allow confusing duplicate data - https://phabricator.wikimedia.org/T17441 [08:22:27] moritzm: hello. I am merging and deploying the lcfirst() fix for hhvm [08:22:32] ( https://gerrit.wikimedia.org/r/#/c/344618/ ) [08:23:05] (03PS3) 10Ema: varnish: remove varnish::monitoring::ganglia [puppet] - 10https://gerrit.wikimedia.org/r/337002 [08:23:11] (03CR) 10Ema: [V: 032 C: 032] varnish: remove varnish::monitoring::ganglia [puppet] - 10https://gerrit.wikimedia.org/r/337002 (owner: 10Ema) [08:23:58] marostegui: jynus: hello! I will do a trivial code deploy in a few minutes. [08:24:10] hashar: sounds good, go ahead [08:24:56] RECOVERY - Check HHVM threads for leakage on mw1259 is OK: OK [08:36:26] doing it now [08:38:16] !log hashar@tin Synchronized php-1.29.0-wmf.17/languages/classes/LanguageKk.php: Check for string initialization in lcfirst() for HHVM 3.18 - T161095 (duration: 00m 52s) [08:38:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:22] T161095: Uninitialized string offset warnings with HHVM 3.18 in LanguageAz.php and LanguageKk.php - https://phabricator.wikimedia.org/T161095 [08:39:22] hashar: when you get a min can you take a look at my ping on phab [08:39:38] Zppix: which ping? I receive a hundred notification per day! [08:39:53] The ping from me a few mins ago [08:40:29] 06Operations, 10MediaWiki-Internationalization, 07HHVM, 05MW-1.29-release (WMF-deploy-2017-03-28_(1.29.0-wmf.18)), and 3 others: Uninitialized string offset warnings with HHVM 3.18 in LanguageAz.php and LanguageKk.php - https://phabricator.wikimedia.org/T161095#3133044 (10hashar) LanguageKk:lcfirst() shoul... [08:41:08] T152484 hashar [08:41:09] T152484: Simplify Menu creation - https://phabricator.wikimedia.org/T152484 [08:42:16] !log deploying semisync replication to all hosts (eqiad and codfw) on s6 T161007 [08:42:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:22] T161007: Decouple Mariadb semi-sync replication from $::mw_primary - https://phabricator.wikimedia.org/T161007 [08:43:53] Zppix: there is not much to do rally [08:45:41] 06Operations, 10DBA, 05DC-Switchover-Prep-Q3-2016-17, 13Patch-For-Review, 07Wikimedia-Multiple-active-datacenters: Decouple Mariadb semi-sync replication from $::mw_primary - https://phabricator.wikimedia.org/T161007#3133052 (10jcrespo) I think this will work with no issue and no impact on production, wh... [08:45:43] 06Operations, 10ops-codfw, 06DC-Ops, 15User-Elukey: Reclaim/Decommission mw2090->mw2096 (OOW) - https://phabricator.wikimedia.org/T161488#3133053 (10elukey) [08:45:47] It was an ci question and i know you manage ci [08:45:57] 06Operations, 06Analytics-Kanban, 06WMDE-Analytics-Engineering, 13Patch-For-Review, 15User-Addshore: /a/mw-log/archive/api on stat1002 no longer being populated - https://phabricator.wikimedia.org/T160888#3133065 (10Addshore) >>! In T160888#3132915, @elukey wrote: > @Addshore is everything ok from your s... [08:47:07] 06Operations, 10ops-codfw, 13Patch-For-Review, 15User-Elukey: codfw: mw2251-mw2260 rack/setup - https://phabricator.wikimedia.org/T155180#3133067 (10elukey) 05Open>03Resolved Created T161488 to decommission 7 old API codfw appservers (replaced by these ones). mw2256 works fine and it is now active in... [08:48:29] 06Operations, 10ops-codfw, 06DC-Ops, 15User-Elukey: Reclaim/Decommission mw2090->mw2096 (OOW) - https://phabricator.wikimedia.org/T161488#3133053 (10elukey) [08:54:00] (03CR) 10Jcrespo: "Compiler does exactly as expected: https://puppet-compiler.wmflabs.org/5914/" [puppet] - 10https://gerrit.wikimedia.org/r/344442 (https://phabricator.wikimedia.org/T161007) (owner: 10Jcrespo) [08:54:32] (03CR) 10Marostegui: [C: 031] mariadb-core: Decouple Mariadb semi-sync replication from $::mw_primary [puppet] - 10https://gerrit.wikimedia.org/r/344442 (https://phabricator.wikimedia.org/T161007) (owner: 10Jcrespo) [09:01:33] 06Operations, 10DBA, 05DC-Switchover-Prep-Q3-2016-17, 13Patch-For-Review, 07Wikimedia-Multiple-active-datacenters: Decouple Mariadb semi-sync replication from $::mw_primary - https://phabricator.wikimedia.org/T161007#3133089 (10Volans) I'm not saying it will not work, I just suggested to monitor it, beca... [09:04:28] 06Operations, 10DBA, 05DC-Switchover-Prep-Q3-2016-17, 13Patch-For-Review, 07Wikimedia-Multiple-active-datacenters: Decouple Mariadb semi-sync replication from $::mw_primary - https://phabricator.wikimedia.org/T161007#3133092 (10jcrespo) > If so I would consider not having it for the cross-DC or gather so... [09:15:19] Hey Ops, I want to have an unscheduled deploy for this UBN! bug: https://phabricator.wikimedia.org/T161263 [09:15:19] (03PS4) 10Giuseppe Lavagetto: realm: remove parsoid_site, switch to discovery. [puppet] - 10https://gerrit.wikimedia.org/r/340993 [09:15:28] The backport is https://gerrit.wikimedia.org/r/#/c/344879/ [09:15:43] Can I do it now, or should I wait until SWAT [09:15:49] <_joe_> jynus, marostegui \o/ [09:17:14] <_joe_> Amir1: if it's UBN! I think it can be done out of SWAT windows, but I'd ask releng [09:17:20] <_joe_> hashar, zeljkof ^^ [09:17:36] Thanks :) [09:18:22] (03CR) 10Giuseppe Lavagetto: [C: 032] realm: remove parsoid_site, switch to discovery. [puppet] - 10https://gerrit.wikimedia.org/r/340993 (owner: 10Giuseppe Lavagetto) [09:21:05] Amir1: hashar has way more experience with deployment than I do, I would wait for him to say it is ok [09:21:09] Amir1: I cant tell how valid that code is really [09:21:28] oh that is wikidata.. [09:21:42] hashar: In Friday, WMDE people tested and saw it fixed the bug [09:23:00] who knows what is going to explode when adding unicode flag to a regex :D [09:23:32] :P [09:24:00] I +2ed it [09:24:04] _joe_, what do you refer to? [09:24:09] do you have a way to reproduce it reliably via mwdebug1001 ? [09:24:23] <_joe_> jynus: the mwprimary removal :) [09:24:26] hashar: yup [09:24:43] _joe_, it is blocked by riccardo, though [09:24:54] ? [09:25:28] jynus: ? [09:26:22] you disagree with the patch at https://phabricator.wikimedia.org/T161007#3126018 [09:27:11] hashar: do you want to deploy it or should I do? [09:28:04] jynus: I don't disagree, I just said let's check the numbers and btw you're right I forgot all the details of the implementation of semi-sync ;) [09:28:22] so is my explanation ok for you? [09:28:39] sure [09:29:16] basically, I do not like to deploy it unless I understood 100% the comment [09:29:24] 06Operations, 10DNS, 06Discovery, 06Labs, and 3 others: multi-component wmflabs.org subdomains doesn't work under simple wildcard TLS cert - https://phabricator.wikimedia.org/T161256#3133149 (10grin) >>! In T161256#3128643, @MaxSem wrote: > Now, in the time of HTTP/2.0 over TLS, there are modern pipelinin... [09:29:42] because normally it is me that misunderstands the comments [09:31:31] Amir1: you can do it [09:31:32] :) [09:31:34] 06Operations, 10DBA, 05DC-Switchover-Prep-Q3-2016-17, 13Patch-For-Review, 07Wikimedia-Multiple-active-datacenters: Decouple Mariadb semi-sync replication from $::mw_primary - https://phabricator.wikimedia.org/T161007#3133166 (10Volans) Right, I was forgetting the details of the implementation, I agree th... [09:31:38] Thanks [09:31:39] Amir1: make sure to test on mwdebug1001 ! [09:31:43] yup [09:31:47] the change is still being tested though [09:32:00] (03PS4) 10Gehel: elasticsearch - move role::elasticsearch::common to a profile [puppet] - 10https://gerrit.wikimedia.org/r/342248 (https://phabricator.wikimedia.org/T147718) [09:32:13] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch - move role::elasticsearch::common to a profile [puppet] - 10https://gerrit.wikimedia.org/r/342248 (https://phabricator.wikimedia.org/T147718) (owner: 10Gehel) [09:33:20] !log mforns@tin Started deploy [analytics/aqs/deploy@80a9de4]: (no justification provided) [09:33:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:09] !log mforns@tin Finished deploy [analytics/aqs/deploy@80a9de4]: (no justification provided) (duration: 01m 49s) [09:35:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:48] (03PS3) 10Jcrespo: mariadb-core: Decouple Mariadb semi-sync replication from $::mw_primary [puppet] - 10https://gerrit.wikimedia.org/r/344442 (https://phabricator.wikimedia.org/T161007) [09:37:18] (03PS5) 10Gehel: elasticsearch - move role::elasticsearch::common to a profile [puppet] - 10https://gerrit.wikimedia.org/r/342248 (https://phabricator.wikimedia.org/T147718) [09:37:45] (03CR) 10Gehel: "rebased" (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/342248 (https://phabricator.wikimedia.org/T147718) (owner: 10Gehel) [09:40:23] Amir1: it merged [09:40:40] yup, Already logged in in tin [09:40:42] (03PS6) 10Gehel: elasticsearch - move role::elasticsearch::common to a profile [puppet] - 10https://gerrit.wikimedia.org/r/342248 (https://phabricator.wikimedia.org/T147718) [09:41:16] (03PS2) 10Peachey88: Add wmde ldap group to grafana [puppet] - 10https://gerrit.wikimedia.org/r/333024 (https://phabricator.wikimedia.org/T161484) (owner: 10Addshore) [09:45:21] (03CR) 10Alexandros Kosiaris: [C: 04-1] url_downloader: convert to profile/role (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/344729 (owner: 10Dzahn) [09:45:46] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Overall looks correct, minor comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/344729 (owner: 10Dzahn) [09:45:55] Confirming it fixes the issue [09:45:55] https://www.wikidata.org/w/index.php?title=Q20031362&diff=470683322&oldid=470629430 [09:46:11] 06Operations, 06Operations-Software-Development: E901 SyntaxError: invalid syntax is wrongly raised on using python's abc by jenkins python CI linter - https://phabricator.wikimedia.org/T152950#3133206 (10hashar) a:05hashar>03None Seems the easiest is to move the code out of puppet.git to a standalone repo... [09:46:26] PROBLEM - puppet last run on gerrit2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:47:30] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Granting wmde group access to grafana-admin.wikimedia.org - https://phabricator.wikimedia.org/T161484#3133210 (10MoritzMuehlenhoff) [09:48:09] 06Operations, 10Icinga: Icinga check for sysctl settings - https://phabricator.wikimedia.org/T160060#3133212 (10MoritzMuehlenhoff) p:05Triage>03High [09:48:39] (03PS1) 10DCausse: [cirrus] enable more accurate regex timeout [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344922 (https://phabricator.wikimedia.org/T161095) [09:48:49] (03PS7) 10Gehel: elasticsearch - move role::elasticsearch::common to a profile [puppet] - 10https://gerrit.wikimedia.org/r/342248 (https://phabricator.wikimedia.org/T147718) [09:49:19] incinga, fatalmonitor and logstash seems happy [09:49:22] going everywhere [09:50:30] (03CR) 10DCausse: [C: 031] Update mwgrep for elasticsearch 5.x [puppet] - 10https://gerrit.wikimedia.org/r/344044 (https://phabricator.wikimedia.org/T161055) (owner: 10EBernhardson) [09:50:53] <_joe_> gehel: hehe, I have more comments incoming now that things start to take shape :P [09:51:20] _joe_: I'm sure you do! I have a few of my own as well... [09:51:34] _joe_: feel free to send them, and I'll include them... [09:51:36] 06Operations, 10Gerrit, 07Beta-Cluster-reproducible, 13Patch-For-Review, 07Upstream: gerrit jgit gc caused mediawiki/core repo problems - https://phabricator.wikimedia.org/T151676#3133217 (10Paladox) This may have been caused by a jgit cache issue rather then gc, see https://groups.google.com/forum/m/#!t... [09:51:43] <_joe_> gehel: for instance, it could be a good idea to make servers that are master-eligible to have their own role [09:52:05] <_joe_> absolutely identical to the main one, but just with that hiera setting [09:52:06] !log mforns@tin Started deploy [analytics/aqs/deploy@a5e1775]: (no justification provided) [09:52:07] !log start of ladsgroup@tin:/srv/mediawiki-staging/php-1.29.0-wmf.17$ scap sync-dir php-1.29.0-wmf.17/extensions/Wikidata "Update Wikidata - fix term validation (T161263)" [09:52:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:18] T161263: Wikidata does not accept characters ending in \x85 (Cyrillic х, Armenian Յ, Arabic م etc.) in labels/aliases/descriptions - https://phabricator.wikimedia.org/T161263 [09:52:48] _joe_: not sure about that onw... master or not master is really a configuration detail for elasticsearch, it does not "feel" like a role [09:53:23] <_joe_> gehel: well the server will have a different role :) [09:53:47] !log mforns@tin Finished deploy [analytics/aqs/deploy@a5e1775]: (no justification provided) (duration: 01m 41s) [09:53:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:01] <_joe_> anyways, I'll take a look later :) [09:54:03] !log ladsgroup@tin Synchronized php-1.29.0-wmf.17/extensions/Wikidata: Update Wikidata - fix term validation (T161263) (duration: 02m 22s) [09:54:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:27] _joe_: Oh, I now see what you mean with hiera (/me is still not used to this role hierarchy) [09:55:38] _joe_: yeah, that would make more sense... [09:56:14] 06Operations, 10Analytics, 06Reading-Web-Backlog, 10Traffic: mobile-safari has very few internally-referred pageviews - https://phabricator.wikimedia.org/T148780#2732531 (10Nemo_bis) >>! In T148780#2891117, @mforns wrote: > Until today, there are certain browser versions that are not populating the referre... [09:56:32] <_joe_> !log rolling restart of restbase in codfw to pick up the new parsoid config [09:56:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:54] hashar: It seems strange but it didn't resolve the issue [09:57:54] 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 13Patch-For-Review, 07Wikimedia-Multiple-active-datacenters: Check the size of every cluster in codfw to see if it matches eqiad's capacity - https://phabricator.wikimedia.org/T156023#3133233 (10elukey) @Joe definitely. I already added 3 new api-apps... [09:57:56] PROBLEM - puppet last run on ms-be1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:58:20] even though it did in mwdebug1002. Maybe I missed a step [09:58:51] got to be at the daily, will investigate more after that [09:59:07] (03PS1) 10DCausse: [mwgrep] enable more accurate regex timeout [puppet] - 10https://gerrit.wikimedia.org/r/344925 [10:00:08] Amir1: cache issue ? [10:00:18] Amir1: have you double checked it on beta and mwdebug? [10:00:23] it can be [10:00:43] not yet [10:02:55] Amir1: the change is apparently not on tin [10:03:10] whut [10:03:31] HEAD still point to parent commit [10:03:42] have to: [10:03:44] cd /srv/mediawiki-staging/php-1.29.0-wmf.17 [10:03:49] git submodule update extensions/Wikidata [10:03:56] then on mwdebug1001 : scap pull [10:04:40] I did [10:04:42] I swear [10:06:55] going to double check and try again [10:07:42] (03PS10) 10Filippo Giunchedi: add PDUs jobs to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/341535 (https://phabricator.wikimedia.org/T148541) [10:08:06] PROBLEM - puppet last run on conf1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:08:59] (03CR) 10jerkins-bot: [V: 04-1] add PDUs jobs to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/341535 (https://phabricator.wikimedia.org/T148541) (owner: 10Filippo Giunchedi) [10:09:29] (03CR) 10Jcrespo: [C: 032] mariadb-core: Decouple Mariadb semi-sync replication from $::mw_primary [puppet] - 10https://gerrit.wikimedia.org/r/344442 (https://phabricator.wikimedia.org/T161007) (owner: 10Jcrespo) [10:13:33] 06Operations, 10ops-codfw: Degraded RAID on ms-be2005 - https://phabricator.wikimedia.org/T161358#3133279 (10fgiunchedi) a:03Papaul @papaul please replace if you have spares onsite, same deal with other ms-be slated for decomission: we can keep the newer disks upon decom and wipe them [10:13:47] ACKNOWLEDGEMENT - puppet last run on ms-be2005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[mkfs-/dev/sdk1] Filippo Giunchedi T161358 [10:13:50] pulled and works in mwdebug [10:13:51] https://www.wikidata.org/w/index.php?title=Q16226929&diff=470685690&oldid=470627106 [10:14:26] RECOVERY - puppet last run on gerrit2001 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [10:15:00] !log start of ladsgroup@tin:/srv/mediawiki-staging/php-1.29.0-wmf.17$ scap sync-dir php-1.29.0-wmf.17/extensions/Wikidata "Second try for Update Wikidata - fix term validation (T161263)" [10:15:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:06] T161263: Wikidata does not accept characters ending in \x85 (Cyrillic х, Armenian Յ, Arabic م etc.) in labels/aliases/descriptions - https://phabricator.wikimedia.org/T161263 [10:15:10] Amir1: magic :) [10:15:36] PROBLEM - salt-minion processes on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:15:57] hashar: unfortunately it did work in the first try too: https://www.wikidata.org/w/index.php?title=Q20031362&diff=470683322&oldid=470629430 [10:16:12] I did "git submodule update" in first try [10:16:26] RECOVERY - salt-minion processes on thumbor1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [10:16:30] !log ladsgroup@tin Synchronized php-1.29.0-wmf.17/extensions/Wikidata: Second try for Update Wikidata - fix term validation (T161263) (duration: 02m 05s) [10:16:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:15] Amir1: all good ? [10:19:32] 06Operations, 10ops-codfw, 15User-fgiunchedi: Degraded RAID on ms-be2005 - https://phabricator.wikimedia.org/T161358#3133297 (10fgiunchedi) [10:19:41] (03PS3) 10Giuseppe Lavagetto: realm: get rid of more entries [puppet] - 10https://gerrit.wikimedia.org/r/340996 [10:19:42] hashar: yup, thanks for helping out [10:19:46] I'm learning still [10:20:09] !log Restarting Jenkins to drop the Throttle Concurrent Builds plugin - T158596 [10:20:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:16] T158596: Remove throttling plugin from Jenkins - https://phabricator.wikimedia.org/T158596 [10:23:37] Amir1: no worries :] [10:26:56] RECOVERY - puppet last run on ms-be1010 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [10:33:20] (03CR) 10Volans: "a couple of comments inline" (032 comments) [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/344659 (owner: 10Ema) [10:36:06] RECOVERY - puppet last run on conf1002 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [10:38:01] 06Operations, 13Patch-For-Review: replace fluorine with mwlog servers (was: Upgrade fluorine to trusty/jessie) - https://phabricator.wikimedia.org/T123728#3133379 (10fgiunchedi) @krinkle I though I'd copied coal data over to graphite2001 (and restored it back on graphite1001) but obviously that's not the case,... [10:38:41] (03PS3) 10Muehlenhoff: contint: install PhantomJS from backport [puppet] - 10https://gerrit.wikimedia.org/r/344613 (https://phabricator.wikimedia.org/T137112) (owner: 10Hashar) [10:39:59] 06Operations, 13Patch-For-Review, 15User-fgiunchedi: replace fluorine with mwlog servers (was: Upgrade fluorine to trusty/jessie) - https://phabricator.wikimedia.org/T123728#3133381 (10fgiunchedi) [10:40:46] 06Operations, 06Performance-Team: Move coal from graphite machine(s) - https://phabricator.wikimedia.org/T159354#3133384 (10fgiunchedi) [10:44:18] (03CR) 10Giuseppe Lavagetto: [C: 032] realm: get rid of more entries [puppet] - 10https://gerrit.wikimedia.org/r/340996 (owner: 10Giuseppe Lavagetto) [10:46:16] PROBLEM - DPKG on helium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [10:47:03] !log oblivian@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=(kartotherian|search) [10:47:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:36] PROBLEM - bacula director process on helium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (bacula), command name bacula-dir [10:48:16] PROBLEM - Check systemd state on helium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:49:53] (03PS3) 10Giuseppe Lavagetto: realm: remove rb_site [puppet] - 10https://gerrit.wikimedia.org/r/340997 [10:54:06] !log upgrade bacula director and storage daemon to 7.4.3 [10:54:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:31] (03CR) 10Muehlenhoff: [C: 032] contint: install PhantomJS from backport [puppet] - 10https://gerrit.wikimedia.org/r/344613 (https://phabricator.wikimedia.org/T137112) (owner: 10Hashar) [10:57:35] (03PS4) 10Muehlenhoff: contint: install PhantomJS from backport [puppet] - 10https://gerrit.wikimedia.org/r/344613 (https://phabricator.wikimedia.org/T137112) (owner: 10Hashar) [10:58:26] (03CR) 10Muehlenhoff: [V: 032 C: 032] contint: install PhantomJS from backport [puppet] - 10https://gerrit.wikimedia.org/r/344613 (https://phabricator.wikimedia.org/T137112) (owner: 10Hashar) [10:59:04] !log oblivian@puppetmaster1001 conftool action : set/pooled=true; selector: name=codfw,dnsdisc=.*-ro [10:59:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:19] _joe_: can I merge your realm patch along? [10:59:34] <_joe_> moritzm: 1 sec, sorry [10:59:51] ok [10:59:57] PROBLEM - puppet last run on snapshot1007 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 3 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[ensure_absent_mod_mpm_event],Exec[ensure_absent_mod_mpm_worker],Exec[chown /srv/deployment/dumps for datasets] [11:00:00] !log oblivian@puppetmaster1001 conftool action : set/pooled=true; selector: name=codfw,dnsdisc=ores [11:00:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:33] <_joe_> moritzm: go on [11:00:38] <_joe_> sorry for the delay [11:01:36] RECOVERY - bacula director process on helium is OK: PROCS OK: 1 process with UID = 110 (bacula), command name bacula-dir [11:01:38] ok [11:02:07] merged [11:02:16] RECOVERY - Check systemd state on helium is OK: OK - running: The system is fully operational [11:02:17] RECOVERY - DPKG on helium is OK: All packages OK [11:03:56] !log performed bacula schema change on db1016 for database bacula [11:04:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:06] jynus: marostegui ^ [11:04:14] I hope you don't mind me doing that [11:04:28] akosiaris: that is fine! what was the change, just for the record? [11:04:32] a simple table creation and a couple of alter tables [11:04:38] lemme find the exact code [11:07:25] use m2-master [11:11:22] marostegui: https://anonscm.debian.org/cgit/pkg-bacula/bacula.git/tree/updatedb [11:11:34] jynus: bacula is on m1 btw [11:11:47] m2 is otrs IIRC [11:11:54] ok [11:11:55] sorryu [11:12:02] but same stuff :-) [11:12:08] but I did use the m1-master (kind of) [11:12:09] use m1-master.eqiad.wmnet [11:12:24] use the proxy-it is safer in case if fails over [11:12:26] PROBLEM - puppet last run on mx2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:12:45] ok, will do next time (in about 2-3 years :P) [11:12:51] well bacula related at least [11:14:29] !log upgraded bacula-sd to 7.4.3+dfsg-1+sid1~bpo8+1 on heze as well [11:14:32] upgrade done [11:14:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:08] (03CR) 10Alexandros Kosiaris: [C: 031] redis::monitoring: convert ores to nrpe, cleanup [puppet] - 10https://gerrit.wikimedia.org/r/344624 (owner: 10Giuseppe Lavagetto) [11:19:56] (03CR) 10Giuseppe Lavagetto: [C: 032] realm: remove rb_site [puppet] - 10https://gerrit.wikimedia.org/r/340997 (owner: 10Giuseppe Lavagetto) [11:20:02] (03PS4) 10Giuseppe Lavagetto: realm: remove rb_site [puppet] - 10https://gerrit.wikimedia.org/r/340997 [11:22:36] PROBLEM - check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 185 bytes in 0.009 second response time [11:25:36] RECOVERY - check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.094 second response time [11:26:56] RECOVERY - puppet last run on snapshot1007 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [11:32:28] 06Operations, 10DBA, 05DC-Switchover-Prep-Q3-2016-17, 13Patch-For-Review, 07Wikimedia-Multiple-active-datacenters: Decouple Mariadb semi-sync replication from $::mw_primary - https://phabricator.wikimedia.org/T161007#3133548 (10jcrespo) This is now deployed, only applying to shards other than s6 is left.... [11:35:36] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) is CRITICAL: Test retrieve images and videos of en.wp Cat page via media route returned the unexpected status 403 (expecting: 200): /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test ret [11:36:30] <_joe_> this is me [11:36:32] <_joe_> ^^ [11:36:36] PROBLEM - cxserver endpoints health on scb2004 is CRITICAL: /v1/page/{language}/{title}{/revision} (Fetch enwiki Oxygen page) is CRITICAL: Test Fetch enwiki Oxygen page returned the unexpected status 404 (expecting: 200) [11:36:38] <_joe_> checking why is this happening [11:36:58] <_joe_> I just changed a url, something tells me it's not as simple as I hoped in those cases, heh [11:37:17] (03PS1) 10Giuseppe Lavagetto: Revert "realm: remove rb_site" [puppet] - 10https://gerrit.wikimedia.org/r/344931 [11:37:25] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Revert "realm: remove rb_site" [puppet] - 10https://gerrit.wikimedia.org/r/344931 (owner: 10Giuseppe Lavagetto) [11:39:36] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [11:40:26] RECOVERY - puppet last run on mx2001 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [11:40:36] RECOVERY - cxserver endpoints health on scb2004 is OK: All endpoints are healthy [11:44:57] 06Operations: Integrate jessie 8.7 point release - https://phabricator.wikimedia.org/T155401#3133559 (10MoritzMuehlenhoff) These are fully rolled out: systemd libfcgi-perl nss-pam-ldapd nettle exim4 [11:45:52] (03CR) 10Hashar: [V: 031] "Verified and it is available now :]" [puppet] - 10https://gerrit.wikimedia.org/r/344613 (https://phabricator.wikimedia.org/T137112) (owner: 10Hashar) [11:46:25] (03PS1) 10Deskana: Updates and typo fixes to CirrusSearch-common.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344933 [11:51:00] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1053" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344934 [11:51:56] PROBLEM - puppet last run on db1024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:51:59] !log Deploy new index on db1040, s4 primary master table: commonswiki.image - T160415 [11:52:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:05] T160415: Review schema changes for T125071 - Add index to image table on all wikis - https://phabricator.wikimedia.org/T160415 [11:53:26] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1053" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344934 [11:58:36] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1053" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344934 (owner: 10Marostegui) [12:00:29] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1053" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344934 (owner: 10Marostegui) [12:00:39] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1053" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344934 (owner: 10Marostegui) [12:01:53] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1053 T160415 - T73563 (duration: 01m 07s) [12:02:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:01] T160415: Review schema changes for T125071 - Add index to image table on all wikis - https://phabricator.wikimedia.org/T160415 [12:02:01] T73563: *_minor_mime are varbinary(32) on WMF sites, out of sync with varbinary(100) in MW core - https://phabricator.wikimedia.org/T73563 [12:02:35] <_joe_> !log experimenting with cxserver config on scb2004 [12:02:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:36] PROBLEM - cxserver endpoints health on scb2004 is CRITICAL: /v1/page/{language}/{title}{/revision} (Fetch enwiki Oxygen page) is CRITICAL: Test Fetch enwiki Oxygen page returned the unexpected status 404 (expecting: 200) [12:08:36] RECOVERY - cxserver endpoints health on scb2004 is OK: All endpoints are healthy [12:09:37] (03PS8) 10Gehel: elasticsearch - move role::elasticsearch::common to a profile [puppet] - 10https://gerrit.wikimedia.org/r/342248 (https://phabricator.wikimedia.org/T147718) [12:15:56] PROBLEM - puppet last run on bast1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:16:57] !log start of ladsgroup@tin:/srv/mediawiki-staging$ scap sync-file php-1.29.0-wmf.17/extensions/Wikidata/composer.lock 'Third try for Update Wikidata - fix term validation (T161263) Part I' [12:17:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:05] T161263: Wikidata does not accept characters ending in \x85 (Cyrillic х, Armenian Յ, Arabic م etc.) in labels/aliases/descriptions - https://phabricator.wikimedia.org/T161263 [12:17:33] !log ladsgroup@tin Synchronized php-1.29.0-wmf.17/extensions/Wikidata/composer.lock: Third try for Update Wikidata - fix term validation (T161263) Part I (duration: 00m 44s) [12:17:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:07] (03PS9) 10Gehel: elasticsearch - move role::elasticsearch::common to a profile [puppet] - 10https://gerrit.wikimedia.org/r/342248 (https://phabricator.wikimedia.org/T147718) [12:19:24] !log upgrade grafana to 4.2.0 on krypton T161193 [12:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:31] T161193: Upgrade to Grafana 4.2.0 - https://phabricator.wikimedia.org/T161193 [12:19:40] !log ladsgroup@tin Synchronized php-1.29.0-wmf.17/extensions/Wikidata/extensions/Wikibase/: Third try for Update Wikidata - fix term validation (T161263) Part II (duration: 01m 32s) [12:19:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:56] RECOVERY - puppet last run on db1024 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [12:20:49] (03CR) 10Gehel: [C: 031] "LGTM - thanks for the cleanup!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344933 (owner: 10Deskana) [12:21:09] !log ladsgroup@tin Synchronized php-1.29.0-wmf.17/extensions/Wikidata/vendor/composer/installed.json: Third try for Update Wikidata - fix term validation (T161263) Part III (duration: 00m 43s) [12:21:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:30] (03PS1) 10Alexandros Kosiaris: Defaults value for profile::docker::builder hiera [puppet] - 10https://gerrit.wikimedia.org/r/344938 [12:32:15] 06Operations, 06Performance-Team, 15User-fgiunchedi: Upgrade to Grafana 4.2.0 - https://phabricator.wikimedia.org/T161193#3133713 (10fgiunchedi) 05Open>03Resolved @Peter grafana has been upgraded, tentatively resolving! [12:33:46] (03PS11) 10Filippo Giunchedi: add PDUs jobs to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/341535 (https://phabricator.wikimedia.org/T148541) [12:34:45] (03CR) 10Alexandros Kosiaris: [C: 032] Defaults value for profile::docker::builder hiera [puppet] - 10https://gerrit.wikimedia.org/r/344938 (owner: 10Alexandros Kosiaris) [12:35:07] PROBLEM - puppet last run on es1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:39:04] 06Operations, 06Performance-Team, 06Reading-Web-Backlog, 10Traffic, and 3 others: Performance review #2 of Hovercards (Popups extension) - https://phabricator.wikimedia.org/T70861#3133748 (10Gilles) I take issue with the idea of delaying display of data once it's available, through animation or otherwise.... [12:44:51] !log deploying semi-sync replication to all hosts on codfw T161007 [12:44:56] RECOVERY - puppet last run on bast1001 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [12:44:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:57] T161007: Decouple Mariadb semi-sync replication from $::mw_primary - https://phabricator.wikimedia.org/T161007 [12:47:01] !log Run pt-table-checksum for a couple of hundred small wikis in es2 - T161510 [12:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:08] T161510: Run pt-table-checksum on es2 - https://phabricator.wikimedia.org/T161510 [12:48:36] PROBLEM - puppet last run on cp3030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:58:41] 06Operations, 10Citoid, 10Graphoid, 10VisualEditor, and 3 others: SCB services should not use a proxy for our domains - https://phabricator.wikimedia.org/T97530#1245380 (10Joe) This is now a blocker (sort-of) for the current work on using DNS for discovery: in fact as soon as I switched the parameter for t... [12:58:49] (03PS10) 10Gehel: elasticsearch - move role::elasticsearch::common to a profile [puppet] - 10https://gerrit.wikimedia.org/r/342248 (https://phabricator.wikimedia.org/T147718) [13:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170327T1300). Please do the needful. [13:00:04] Urbanecm and dcausse: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:18] Hello, I can SWAT. [13:00:19] o/ [13:00:48] any favourite order for the merge dcausse? [13:01:16] Dereckson: the order in the wikitech page is fine to me [13:01:20] o/ [13:01:22] ok [13:01:26] (03PS2) 10Dereckson: Updates and typo fixes to CirrusSearch-common.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344933 (owner: 10Deskana) [13:01:51] (03CR) 10Dereckson: [C: 032] Updates and typo fixes to CirrusSearch-common.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344933 (owner: 10Deskana) [13:03:02] (03Merged) 10jenkins-bot: Updates and typo fixes to CirrusSearch-common.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344933 (owner: 10Deskana) [13:03:24] (03CR) 10jenkins-bot: Updates and typo fixes to CirrusSearch-common.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344933 (owner: 10Deskana) [13:04:01] 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: EQIAD: stat1002 replacement - https://phabricator.wikimedia.org/T159838#3133807 (10faidon) a:05Ottomata>03RobH [13:04:06] RECOVERY - puppet last run on es1019 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [13:05:22] dcausse: 344933 live on mwdebug1002 [13:05:29] Dereckson: testing [13:05:35] (03CR) 10Filippo Giunchedi: "> Can you tell if the http traffic is PUTs and/or GETs?" [puppet] - 10https://gerrit.wikimedia.org/r/343263 (https://phabricator.wikimedia.org/T160670) (owner: 10Gilles) [13:06:14] Dereckson: no errors, I suppose it's fine :) [13:08:50] !log dereckson@tin Synchronized wmf-config/CirrusSearch-common.php: Updates and typo fixes to CirrusSearch-common.php ([[gerrit:344933]]) (duration: 00m 43s) [13:08:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:38] (03PS6) 10Dereckson: [es5 upgrade] step 5: restore normal operations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342034 (https://phabricator.wikimedia.org/T157479) (owner: 10DCausse) [13:09:56] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342034 (https://phabricator.wikimedia.org/T157479) (owner: 10DCausse) [13:10:04] Dereckson: for this one: deploy order does not matter ^ [13:10:06] RECOVERY - Confd template for /var/lib/gdnsd/discovery-appservers-rw.state on radon is OK: No errors detected [13:10:14] ok [13:10:31] CS.php will affect translate, IS.php will affect cirrus and CirrusSearch-common is just a cleanup [13:11:16] 06Operations, 10hardware-requests: codfw/eqiad: 12x swift backend refresh - https://phabricator.wikimedia.org/T149336#2749097 (10faidon) @RobH @fgiunchedi aren't these T149582 & T155659? Can we resolve this? [13:12:19] (03Merged) 10jenkins-bot: [es5 upgrade] step 5: restore normal operations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342034 (https://phabricator.wikimedia.org/T157479) (owner: 10DCausse) [13:12:35] (03CR) 10jenkins-bot: [es5 upgrade] step 5: restore normal operations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342034 (https://phabricator.wikimedia.org/T157479) (owner: 10DCausse) [13:13:07] (03PS23) 10BBlack: [POC] DNS zones to puppet repo [puppet] - 10https://gerrit.wikimedia.org/r/342887 [13:13:39] (03PS1) 10Gehel: archiva - do not expose the .indexer directory to gitfat [puppet] - 10https://gerrit.wikimedia.org/r/344940 [13:13:44] dcausse: live on mwdebug1002 [13:13:49] Dereckson: thanks, testing [13:15:56] PROBLEM - puppet last run on analytics1057 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:16:07] (03PS2) 10Dereckson: Add khw.wikipedia logos to static resources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343585 (https://phabricator.wikimedia.org/T160865) [13:16:15] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343585 (https://phabricator.wikimedia.org/T160865) (owner: 10Dereckson) [13:17:50] Dereckson: looks good to me, it's possible that we see a spike of pool counter errors when syncing IS.php (time to warmup) [13:18:02] ok, let's sync and see that [13:18:06] ok [13:18:15] First sync will be no-op (tests/ file) [13:18:22] ok [13:18:36] RECOVERY - puppet last run on cp3030 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [13:18:38] (03Merged) 10jenkins-bot: Add khw.wikipedia logos to static resources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343585 (https://phabricator.wikimedia.org/T160865) (owner: 10Dereckson) [13:18:54] !log dereckson@tin Synchronized tests/cirrusTest.php: [es5 upgrade] step 5: restore normal operations (T157479, 1/2) (duration: 00m 48s) [13:19:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:02] T157479: Put together a production migration plan for ES 2 -> ES 5 - https://phabricator.wikimedia.org/T157479 [13:19:27] (03CR) 10jenkins-bot: Add khw.wikipedia logos to static resources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343585 (https://phabricator.wikimedia.org/T160865) (owner: 10Dereckson) [13:19:44] !log dereckson@tin Synchronized wmf-config/: [es5 upgrade] step 5: restore normal operations (T157479, 2/2) (duration: 00m 49s) [13:19:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:44] !log dereckson@tin Synchronized static/images/project-logos/: Add khw.wikipedia logos to static resources (T160865) (duration: 00m 43s) [13:20:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:53] T160865: Create Wikipedia Khowar - https://phabricator.wikimedia.org/T160865 [13:21:14] (03PS2) 10Dereckson: [cirrus] enable more accurate regex timeout [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344922 (https://phabricator.wikimedia.org/T161095) (owner: 10DCausse) [13:21:21] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344922 (https://phabricator.wikimedia.org/T161095) (owner: 10DCausse) [13:21:56] Urbanecm: ping? [13:22:07] !log repooled mw1261 (now that fix for lcfirst() issue from T161095 is deployed) [13:22:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:13] T161095: Uninitialized string offset warnings with HHVM 3.18 in LanguageAz.php and LanguageKk.php - https://phabricator.wikimedia.org/T161095 [13:22:33] (03CR) 10Gehel: [C: 04-1] "LGTM (minor comment inline). I don't know much about logstash plugins, but this looks reasonable." (031 comment) [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/344704 (owner: 10EBernhardson) [13:22:35] (03Merged) 10jenkins-bot: [cirrus] enable more accurate regex timeout [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344922 (https://phabricator.wikimedia.org/T161095) (owner: 10DCausse) [13:22:43] (03CR) 10jenkins-bot: [cirrus] enable more accurate regex timeout [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344922 (https://phabricator.wikimedia.org/T161095) (owner: 10DCausse) [13:23:31] dcausse: 344922 live on mwdebug1002 [13:23:38] Dereckson: testing [13:25:03] (03PS2) 10Dereckson: Fix wgLogoHD keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344796 (https://phabricator.wikimedia.org/T161416) [13:25:56] Dereckson: works as expected [13:26:00] ok [13:26:10] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344796 (https://phabricator.wikimedia.org/T161416) (owner: 10Dereckson) [13:26:44] !log dereckson@tin Synchronized wmf-config/CirrusSearch-common.php: [cirrus] enable more accurate regex timeout (T161095) (duration: 00m 44s) [13:26:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:06] Dereckson: thanks a ton for the deploys! [13:27:28] (03Merged) 10jenkins-bot: Fix wgLogoHD keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344796 (https://phabricator.wikimedia.org/T161416) (owner: 10Dereckson) [13:27:36] You're welcome dcausse [13:28:01] 344796 live on mwdebug1002 [13:29:39] 06Operations, 10Dumps-Generation: determine hardware needs for dumps in eqiad and codfw - https://phabricator.wikimedia.org/T118154#3133894 (10chasemp) > > **Proposed setup version 2 (revised from https://phabricator.wikimedia.org/T118154#3054146)** > > Phase 1: > > - introduce labstore1006 and labsto... [13:29:44] (03CR) 10jenkins-bot: Fix wgLogoHD keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344796 (https://phabricator.wikimedia.org/T161416) (owner: 10Dereckson) [13:29:58] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Fix wgLogoHD 2.5x key (T161416) (duration: 00m 43s) [13:30:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:04] T161416: Undefined index: 1.5x in ResourceLoaderSkinModule on ff.wikipedia - https://phabricator.wikimedia.org/T161416 [13:30:35] 13:27:37 Running "banana:all" (banana) task [13:30:35] 13:27:38 >> No metadata block in the ru messages file. [13:30:36] 13:27:38 Warning: Task "banana:all" failed. Use --force to continue. [13:30:39] will defer this one [13:31:07] Urbanecm: ping? [13:31:56] dcausse: 25 error: Timeout reached waiting for an available pooled curl connection! in /srv/mediawiki/php-1.29.0-wmf.17/extensions/CirrusSearch/includes/Elastica/ [13:32:00] PooledHttp.php on line 67 [13:32:03] (only 25x) [13:32:08] (and not increasing) [13:32:16] Dereckson: I think IS was just deployed on your last scap [13:32:25] actually activating eqiad [13:32:52] eqiad needs to warmup [13:32:54] that's plausible, sync wm-config/ doesn't alway sync IS [13:32:58] ok [13:33:20] I have 4 nodes struggling but it's getting better [13:34:11] Dereckson: looks good to me, warmup spike seems to be done [13:34:25] ok [13:37:44] 06Operations, 10MediaWiki-extensions-PageAssessments, 10Wikimedia-General-or-Unknown, 13Patch-For-Review: foreachwikiindblist regular cronspam - https://phabricator.wikimedia.org/T159438#3133902 (10jcrespo) root@ receives one of these emails every day: ``` The following extensions are required to be insta... [13:37:48] Dereckson: are you still on tin? I think same problem happened to CommonSettings.php [13:37:53] translate still points to codfw [13:38:18] mt bad should have double checked with mwscript [13:38:23] *my [13:39:08] PROBLEM - puppet last run on mw1175 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:39:16] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [13:39:41] dcausse: what do you wish we do? [13:39:45] resync CS? [13:39:48] Dereckson: yes [13:40:20] syncing [13:40:38] (I've touched it before to refresh timestamp) [13:40:41] ok [13:40:58] !log dereckson@tin Synchronized wmf-config/CommonSettings.php: no-op, to force resync (duration: 00m 43s) [13:41:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:12] 06Operations, 10hardware-requests: codfw/eqiad: 12x swift backend refresh - https://phabricator.wikimedia.org/T149336#3133908 (10fgiunchedi) a:03RobH @faidon, I think so unless @RobH needs it [13:41:25] > print($wgTranslateTranslationDefaultService); [13:41:27] eqiad [13:41:33] Dereckson: perfect^, thanks! [13:41:37] you're welcome [13:42:16] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [13:42:35] (03PS2) 10Giuseppe Lavagetto: redis::monitoring: convert ores to nrpe, cleanup [puppet] - 10https://gerrit.wikimedia.org/r/344624 [13:42:58] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] redis::monitoring: convert ores to nrpe, cleanup [puppet] - 10https://gerrit.wikimedia.org/r/344624 (owner: 10Giuseppe Lavagetto) [13:43:35] 06Operations, 10LDAP-Access-Requests, 06WMDE-Analytics-Engineering, 10Wikidata, 15User-Addshore: Add goransm to ldap/wmde group - https://phabricator.wikimedia.org/T160924#3133912 (10GoranSMilovanovic) If this helps here, I have an WMDE e-mail account now: goran.milovanovic_ext@wikimedia.de [13:43:59] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to reseachers, analytics-wmde, analytics-users for GoranSMilovanovic - https://phabricator.wikimedia.org/T160980#3133913 (10GoranSMilovanovic) If this helps here, I have an WMDE e-mail now: goran.milovanovic_ext@wikimedia.de [13:44:02] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Add request URL to thumbor errors - https://phabricator.wikimedia.org/T151553#3133914 (10Gilles) [13:44:56] RECOVERY - puppet last run on analytics1057 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [13:48:07] (03PS4) 10Gehel: Update mwgrep for elasticsearch 5.x [puppet] - 10https://gerrit.wikimedia.org/r/344044 (https://phabricator.wikimedia.org/T161055) (owner: 10EBernhardson) [13:48:13] 06Operations, 10MediaWiki-extensions-PageAssessments, 10Wikimedia-General-or-Unknown, 13Patch-For-Review: foreachwikiindblist regular cronspam - https://phabricator.wikimedia.org/T159438#3133928 (10Ladsgroup) Why me? I'm not a maintainer of PageAssessments and I highly doubt that ORES extension (or service... [13:49:42] (03CR) 10Gehel: [C: 032] Update mwgrep for elasticsearch 5.x [puppet] - 10https://gerrit.wikimedia.org/r/344044 (https://phabricator.wikimedia.org/T161055) (owner: 10EBernhardson) [13:49:48] 06Operations, 10hardware-requests: codfw/eqiad: 12x swift backend refresh - https://phabricator.wikimedia.org/T149336#3133931 (10Cmjohnson) The 12 swift servers in eqiad are in....typically Mark waits for the paid invoice from finance before he resolves them (I think) [13:50:34] 06Operations, 10MediaWiki-extensions-PageAssessments, 10Wikimedia-General-or-Unknown, 13Patch-For-Review: foreachwikiindblist regular cronspam - https://phabricator.wikimedia.org/T159438#3133934 (10jcrespo) Ladscroup, my apologies- wrong person. I am sorry. I meant to ping @kaldari [13:54:20] (03PS4) 10BBlack: varnish: refactor all clusters for active/active [puppet] - 10https://gerrit.wikimedia.org/r/339667 (https://phabricator.wikimedia.org/T134404) [13:55:32] (03Abandoned) 10BBlack: varnish: switch all clusters to req_handling [WIP, 2/4] [puppet] - 10https://gerrit.wikimedia.org/r/339668 (https://phabricator.wikimedia.org/T134404) (owner: 10BBlack) [13:55:37] (03Abandoned) 10BBlack: varnish: per-app routing [WIP, 3/4] [puppet] - 10https://gerrit.wikimedia.org/r/339669 (https://phabricator.wikimedia.org/T134404) (owner: 10BBlack) [13:55:41] (03CR) 10Gilles: "The IRC alerting worked, but the team mailing list never received the email. I see that a lot of contact groups have manually-defined memb" [puppet] - 10https://gerrit.wikimedia.org/r/342431 (https://phabricator.wikimedia.org/T156245) (owner: 10Gilles) [13:55:43] (03Abandoned) 10BBlack: varnish: move applayer info back to hiera [WIP, 4/4] [puppet] - 10https://gerrit.wikimedia.org/r/339671 (https://phabricator.wikimedia.org/T134404) (owner: 10BBlack) [13:56:39] (03PS2) 10Gehel: archiva - do not expose the .indexer directory to gitfat [puppet] - 10https://gerrit.wikimedia.org/r/344940 [13:58:52] (03CR) 10DCausse: [C: 031] archiva - do not expose the .indexer directory to gitfat [puppet] - 10https://gerrit.wikimedia.org/r/344940 (owner: 10Gehel) [13:59:06] (03PS3) 10Gehel: archiva - do not expose the .indexer directory to gitfat [puppet] - 10https://gerrit.wikimedia.org/r/344940 [14:00:45] (03CR) 10Gehel: [C: 032] archiva - do not expose the .indexer directory to gitfat [puppet] - 10https://gerrit.wikimedia.org/r/344940 (owner: 10Gehel) [14:00:56] 06Operations, 06Performance-Team, 10Thumbor: Add request URL to thumbor errors - https://phabricator.wikimedia.org/T151553#2820915 (10Gilles) [14:02:09] PROBLEM - puppet last run on mw1261 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[hhvm] [14:03:24] Dereckson, I've totally forgot at the SWAT. Could you deploy the path now please? Thank you! [14:03:26] ^ mw1261 is me [14:05:45] Urbanecm: sure [14:05:48] Dereckson, thank you [14:06:07] (03PS3) 10Dereckson: Add autopatrolled group to svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344393 (https://phabricator.wikimedia.org/T161210) (owner: 10Urbanecm) [14:06:09] RECOVERY - puppet last run on mw1175 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [14:07:02] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344393 (https://phabricator.wikimedia.org/T161210) (owner: 10Urbanecm) [14:08:21] (03Merged) 10jenkins-bot: Add autopatrolled group to svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344393 (https://phabricator.wikimedia.org/T161210) (owner: 10Urbanecm) [14:08:30] (03CR) 10jenkins-bot: Add autopatrolled group to svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344393 (https://phabricator.wikimedia.org/T161210) (owner: 10Urbanecm) [14:09:16] (03CR) 10DCausse: "Damn copy/paster, wrong bug id here, sorry (should be: T152895)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344922 (https://phabricator.wikimedia.org/T161095) (owner: 10DCausse) [14:09:33] Urbanecm: live on mwdebug1002 [14:10:19] Working [14:12:39] (03PS12) 10Filippo Giunchedi: add PDUs jobs to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/341535 (https://phabricator.wikimedia.org/T148541) [14:12:51] (03CR) 10Ottomata: [C: 031] "We can also remove the analytics1027 director, but we can do that in a later commit I suppose." [puppet] - 10https://gerrit.wikimedia.org/r/344916 (https://phabricator.wikimedia.org/T159527) (owner: 10Elukey) [14:13:40] Urbanecm: okay, syncing [14:14:26] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Add autopatrolled group to svwiki (T161210) (duration: 00m 50s) [14:14:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:36] T161210: Autopatrolled user group on Swedish Wikipedia - https://phabricator.wikimedia.org/T161210 [14:14:36] (03CR) 10Filippo Giunchedi: [C: 032] add PDUs jobs to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/341535 (https://phabricator.wikimedia.org/T148541) (owner: 10Filippo Giunchedi) [14:15:31] Dereckson, thank you [14:18:40] (03CR) 10Ema: [C: 031] Move hue.w.o's backend to thorium [puppet] - 10https://gerrit.wikimedia.org/r/344916 (https://phabricator.wikimedia.org/T159527) (owner: 10Elukey) [14:25:39] (03CR) 10Elukey: [C: 04-1] "Just found:" [puppet] - 10https://gerrit.wikimedia.org/r/344916 (https://phabricator.wikimedia.org/T159527) (owner: 10Elukey) [14:25:59] PROBLEM - puppet last run on iridium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:30:19] RECOVERY - puppet last run on mw1261 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [14:33:05] !log rebuilding ttmserver index in elastic@codfw from wasat [14:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:32] andrewbogott: o/ - /usr/local/bin/archive-instances on labmon1001.eqiad.wmnet could be owned by _graphite to fix the cronspam issue? [14:33:50] not really aware of what the config is in there [14:34:27] elukey: I'm not sure I know either... [14:34:52] elukey: what's an example subj: of the cronspam you're looking at? [14:35:12] andrewbogott: Cron <_graphite@labmon1001> /usr/local/bin/archive-instances [14:35:41] oh, hm, ok [14:35:44] I'll make a bug [14:36:43] thanks :) [14:43:39] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3134032 (10Papaul) Here is the result of the HW diagnostic. {F7038006} [14:49:28] 06Operations, 07HHVM: HHVM 3.18 crashes when running Cirrus maintenance script - https://phabricator.wikimedia.org/T161520#3134051 (10MoritzMuehlenhoff) [14:49:43] (03PS5) 10Gilles: Enable memcache-based Thumbor broken thumbnail throttling [puppet] - 10https://gerrit.wikimedia.org/r/342811 (https://phabricator.wikimedia.org/T151065) [14:49:51] 06Operations, 07HHVM: HHVM 3.18 crashes when running Cirrus maintenance script - https://phabricator.wikimedia.org/T161520#3134066 (10MoritzMuehlenhoff) [14:49:53] 06Operations, 07HHVM, 13Patch-For-Review, 07Upstream: Build / migrate to HHVM 3.18 - https://phabricator.wikimedia.org/T158176#3134065 (10MoritzMuehlenhoff) [14:51:17] 06Operations, 07HHVM: HHVM 3.18 crashes when running Cirrus maintenance script - https://phabricator.wikimedia.org/T161520#3134067 (10dcausse) [14:51:53] (03CR) 10jerkins-bot: [V: 04-1] Enable memcache-based Thumbor broken thumbnail throttling [puppet] - 10https://gerrit.wikimedia.org/r/342811 (https://phabricator.wikimedia.org/T151065) (owner: 10Gilles) [14:53:59] RECOVERY - puppet last run on iridium is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [14:54:02] (03PS5) 10BBlack: varnish: refactor all clusters for active/active [puppet] - 10https://gerrit.wikimedia.org/r/339667 (https://phabricator.wikimedia.org/T134404) [14:55:35] (03PS1) 10Volans: Minor fixes for library integrations [switchdc] - 10https://gerrit.wikimedia.org/r/344951 (https://phabricator.wikimedia.org/T160178) [14:56:10] (03PS1) 10Elukey: Add simple proxy for hue.w.o's backend on thorium [puppet] - 10https://gerrit.wikimedia.org/r/344952 (https://phabricator.wikimedia.org/T159527) [14:57:34] (03PS2) 10Elukey: Add simple proxy for hue.w.o's backend on thorium [puppet] - 10https://gerrit.wikimedia.org/r/344952 (https://phabricator.wikimedia.org/T159527) [14:57:42] (03PS2) 10Hashar: swift: lower replication interval for beta [puppet] - 10https://gerrit.wikimedia.org/r/344387 (https://phabricator.wikimedia.org/T160990) [14:58:38] 06Operations, 07HHVM: HHVM 3.18 crashes when Cirrus tries to fetch another wiki config via maint script - https://phabricator.wikimedia.org/T161520#3134083 (10dcausse) [14:59:46] (03CR) 10Ottomata: [C: 031] Add simple proxy for hue.w.o's backend on thorium [puppet] - 10https://gerrit.wikimedia.org/r/344952 (https://phabricator.wikimedia.org/T159527) (owner: 10Elukey) [14:59:49] 06Operations, 10ops-codfw, 15User-fgiunchedi: Degraded RAID on ms-be2005 - https://phabricator.wikimedia.org/T161358#3134084 (10Papaul) a:05Papaul>03fgiunchedi Disk replacement complete [15:00:59] RECOVERY - MegaRAID on ms-be2005 is OK: OK: optimal, 13 logical, 13 physical [15:01:18] (03PS1) 10Filippo Giunchedi: prometheus: fix PDU detection and snmp_exporter config [puppet] - 10https://gerrit.wikimedia.org/r/344953 (https://phabricator.wikimedia.org/T148541) [15:02:06] (03PS6) 10Gilles: Enable memcache-based Thumbor broken thumbnail throttling [puppet] - 10https://gerrit.wikimedia.org/r/342811 (https://phabricator.wikimedia.org/T151065) [15:02:34] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: fix PDU detection and snmp_exporter config [puppet] - 10https://gerrit.wikimedia.org/r/344953 (https://phabricator.wikimedia.org/T148541) (owner: 10Filippo Giunchedi) [15:04:54] (03PS3) 10Elukey: Add simple proxy for hue.w.o's backend on thorium [puppet] - 10https://gerrit.wikimedia.org/r/344952 (https://phabricator.wikimedia.org/T159527) [15:05:30] 06Operations, 10hardware-requests: Rename labtestmetal2001 - https://phabricator.wikimedia.org/T161265#3134094 (10Papaul) 05Open>03Resolved complete [15:06:38] (03PS3) 10Hashar: swift: lower replication interval for beta [puppet] - 10https://gerrit.wikimedia.org/r/344387 (https://phabricator.wikimedia.org/T160990) [15:07:31] 06Operations, 10Monitoring, 13Patch-For-Review, 05Prometheus-metrics-monitoring, 15User-fgiunchedi: Evaluate prometheus snmp_exporter for Torrus PDUs metrics use case - https://phabricator.wikimedia.org/T148541#3134098 (10fgiunchedi) Most pieces are in place now: left to do is allowing prometheus in codf... [15:08:08] (03CR) 10Elukey: "http://puppet-compiler.wmflabs.org/5923/thorium.eqiad.wmnet/ looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/344952 (https://phabricator.wikimedia.org/T159527) (owner: 10Elukey) [15:08:41] (03PS4) 10Elukey: Add simple proxy for hue.w.o's backend on thorium [puppet] - 10https://gerrit.wikimedia.org/r/344952 (https://phabricator.wikimedia.org/T159527) [15:10:12] (03CR) 10Elukey: [C: 032] Add simple proxy for hue.w.o's backend on thorium [puppet] - 10https://gerrit.wikimedia.org/r/344952 (https://phabricator.wikimedia.org/T159527) (owner: 10Elukey) [15:13:21] 06Operations, 10hardware-requests: Rename labtestmetal2001 - https://phabricator.wikimedia.org/T161265#3134105 (10Andrew) Thank you! [15:13:56] (03CR) 10Hashar: "PS2 had some ruby/erb failure" [puppet] - 10https://gerrit.wikimedia.org/r/344387 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar) [15:21:17] (03CR) 10Gilles: "PCC run: https://puppet-compiler.wmflabs.org/5925/" [puppet] - 10https://gerrit.wikimedia.org/r/342811 (https://phabricator.wikimedia.org/T151065) (owner: 10Gilles) [15:21:49] 06Operations, 10media-storage: refresh swift hardware in codfw/eqiad - https://phabricator.wikimedia.org/T148647#3134112 (10RobH) [15:21:51] 06Operations, 10hardware-requests: codfw/eqiad: 12x swift backend refresh - https://phabricator.wikimedia.org/T149336#3134110 (10RobH) 05Open>03Resolved This is just the hw-request, the #procurement stays open. Resolving! [15:23:12] (03CR) 10Elukey: [C: 031] "After https://gerrit.wikimedia.org/r/#/c/344952 there is an apache vhost on port 80 for hue.w.o, so we can proceed now!" [puppet] - 10https://gerrit.wikimedia.org/r/344916 (https://phabricator.wikimedia.org/T159527) (owner: 10Elukey) [15:23:18] (03PS2) 10Elukey: Move hue.w.o's backend to thorium [puppet] - 10https://gerrit.wikimedia.org/r/344916 (https://phabricator.wikimedia.org/T159527) [15:35:40] PROBLEM - puppet last run on ms-be1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:35:41] !log mobrovac@tin Started deploy [cxserver/deploy@40e86ad]: Add discovery.wmnet to no_proxy_list [15:35:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:09] PROBLEM - puppet last run on snapshot1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:36:59] RECOVERY - puppet last run on snapshot1007 is OK: OK: Puppet is currently enabled, last run 10 minutes ago with 0 failures [15:38:20] !log mobrovac@tin Finished deploy [cxserver/deploy@40e86ad]: Add discovery.wmnet to no_proxy_list (duration: 02m 39s) [15:38:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:56] 06Operations, 07HHVM: HHVM 3.18 crashes when Cirrus tries to fetch another wiki config via maint script - https://phabricator.wikimedia.org/T161520#3134051 (10hashar) Jobs do shell out to `mwscript maintenance/getConfiguration.php` to get configuration for all the wiki projects. That has previously lead to ou... [15:42:08] !log mobrovac@tin Started deploy [mobileapps/deploy@aed916b]: Add discovery.wmnet to no_proxy_list [15:42:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:53] 06Operations, 06Performance-Team, 10Thumbor: Thumbor inexplicably 504s intermittently on files that render fine later - https://phabricator.wikimedia.org/T150746#3134255 (10Gilles) Things don't seem to have improved :( ``` gilles@ms-fe1005:/var/log/swift$ cat server.log.1 | grep "Mediawiki: 200 Thumbor: 504... [15:46:12] !log mobrovac@tin Finished deploy [mobileapps/deploy@aed916b]: Add discovery.wmnet to no_proxy_list (duration: 04m 05s) [15:46:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:58] 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 13Patch-For-Review, 07Wikimedia-Multiple-active-datacenters: MediaWiki Datacenter Switchover automation - https://phabricator.wikimedia.org/T160178#3134270 (10jcrespo) [15:48:00] 06Operations, 10DBA, 05DC-Switchover-Prep-Q3-2016-17, 13Patch-For-Review, 07Wikimedia-Multiple-active-datacenters: Decouple Mariadb semi-sync replication from $::mw_primary - https://phabricator.wikimedia.org/T161007#3134267 (10jcrespo) 05Open>03Resolved Semi sync is deployed on all masters, independ... [15:51:48] (03PS1) 10Giuseppe Lavagetto: realm: remove rb_site [puppet] - 10https://gerrit.wikimedia.org/r/344960 [15:51:56] 06Operations, 10Citoid, 10Graphoid, 10VisualEditor, and 4 others: SCB services should not use a proxy for our domains - https://phabricator.wikimedia.org/T97530#3134331 (10mobrovac) After switching to Scap3 config deploys only the services that need the proxy to contact outside services use it. The except... [15:52:13] 06Operations, 07HHVM: HHVM 3.18 crashes when Cirrus tries to fetch another wiki config via maint script - https://phabricator.wikimedia.org/T161520#3134336 (10hashar) Sorry the suspected exec is: `/usr/bin/php /srv/mediawiki/multiversion/MWScript.php maintenance/getConfiguration.php` And `php` is HHVM on mw12... [15:53:18] 06Operations, 05Prometheus-metrics-monitoring, 15User-fgiunchedi: Upgrade mysqld_exporter to 0.10.0 - https://phabricator.wikimedia.org/T161296#3127748 (10jcrespo) Is this blocked on me to configure/upgrade the deployed exporter or has is the package/release not yet available? [15:53:46] 06Operations, 10DBA, 05DC-Switchover-Prep-Q3-2016-17, 13Patch-For-Review, 07Wikimedia-Multiple-active-datacenters: Decouple Mariadb semi-sync replication from $::mw_primary - https://phabricator.wikimedia.org/T161007#3134344 (10Marostegui) >>! In T161007#3134267, @jcrespo wrote: > > We could consider add... [15:53:48] (03CR) 10Elukey: [C: 031] "http://puppet-compiler.wmflabs.org/5926/cp3008.esams.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/344916 (https://phabricator.wikimedia.org/T159527) (owner: 10Elukey) [15:55:59] PROBLEM - puppet last run on mw1231 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:02:34] (03PS3) 10EBernhardson: Update logstash plugins for 5.x [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/344704 [16:02:59] PROBLEM - puppet last run on labservices1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py] [16:03:40] RECOVERY - puppet last run on ms-be1011 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [16:05:47] (03CR) 10EBernhardson: Update logstash plugins for 5.x (031 comment) [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/344704 (owner: 10EBernhardson) [16:06:56] 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: db1057 does not react to powercycle/powerdown/powerup commands - https://phabricator.wikimedia.org/T160435#3134411 (10Cmjohnson) I am not sure which one you think we can pull from? All the db's that are being decom'd are different server types and olde... [16:07:58] 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: db1057 does not react to powercycle/powerdown/powerup commands - https://phabricator.wikimedia.org/T160435#3134420 (10Marostegui) >>! In T160435#3134411, @Cmjohnson wrote: > I am not sure which one you think we can pull from? All the db's that are > be... [16:09:24] 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: db1057 does not react to powercycle/powerdown/powerup commands - https://phabricator.wikimedia.org/T160435#3134427 (10Cmjohnson) We may be able to salvage data.....I can try moving the raid card and disks to a decom R510 [16:12:48] (03PS4) 10Hashar: (WIP) compile authdns::config [puppet] - 10https://gerrit.wikimedia.org/r/343747 [16:14:14] (03CR) 10jerkins-bot: [V: 04-1] (WIP) compile authdns::config [puppet] - 10https://gerrit.wikimedia.org/r/343747 (owner: 10Hashar) [16:14:30] (03PS1) 10Subramanya Sastry: Allow parsoid-vd-client service to be controlled outside systemd [puppet] - 10https://gerrit.wikimedia.org/r/344961 [16:19:23] 06Operations, 10Monitoring: incident 20170323-wikibase did not trigger Icinga paging - https://phabricator.wikimedia.org/T161528#3134440 (10Dzahn) [16:19:49] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests: Create Wikipedia Doteli - https://phabricator.wikimedia.org/T161529#3134452 (10MF-Warburg) [16:21:52] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests: Create Wikipedia Doteli - https://phabricator.wikimedia.org/T161529#3134474 (10MF-Warburg) [16:23:59] RECOVERY - puppet last run on mw1231 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [16:26:57] (03PS1) 10EBernhardson: [WIP] Upgrade logstash to 5.x [puppet] - 10https://gerrit.wikimedia.org/r/344964 (https://phabricator.wikimedia.org/T154473) [16:26:59] (03PS1) 10EBernhardson: [WIP] Update elk stack to 5.x [puppet] - 10https://gerrit.wikimedia.org/r/344965 (https://phabricator.wikimedia.org/T154473) [16:28:19] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Upgrade logstash to 5.x [puppet] - 10https://gerrit.wikimedia.org/r/344964 (https://phabricator.wikimedia.org/T154473) (owner: 10EBernhardson) [16:29:59] PROBLEM - puppet last run on mc1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:30:39] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: rack and cable frdev1001 - https://phabricator.wikimedia.org/T159887#3134508 (10RobH) a:03Cmjohnson [16:30:45] (03PS2) 10EBernhardson: [WIP] Upgrade logstash to 5.x [puppet] - 10https://gerrit.wikimedia.org/r/344964 (https://phabricator.wikimedia.org/T154473) [16:30:47] (03PS2) 10EBernhardson: [WIP] Update elk stack to 5.x [puppet] - 10https://gerrit.wikimedia.org/r/344965 (https://phabricator.wikimedia.org/T154473) [16:34:48] <_joe_> !log cleaned the bc cache on mw1261, restarted hhvm and repooled [16:34:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:53] _joe_ (ignorant Luca asking) - How did you clean the bc cache on mw1261? [16:36:07] <_joe_> elukey: with a refined technique [16:36:29] <_joe_> elukey: killed hhvm, rm /var/cache/hhvm/*, started hhvm [16:36:40] <_joe_> that is for cleaning up a server you can depool [16:36:40] thanks :) [16:38:28] I hoped for something more magical via hhvmadm [16:38:34] seems to have solved it, crash happened again so far [16:39:01] 06Operations, 06Performance-Team, 10Thumbor: Thumbor inexplicably 504s intermittently on files that render fine later - https://phabricator.wikimedia.org/T150746#3134528 (10Gilles) Looking at the nginx logs on thumbor1001, I notice that some of the timeouts are for files that don't exist. Example: https://up... [16:40:13] !log restbase deploying f53bec41 [16:40:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:03] 06Operations, 06Performance-Team, 10Thumbor: Thumbor inexplicably 504s intermittently on files that render fine later - https://phabricator.wikimedia.org/T150746#3134552 (10Gilles) To verify my theory, I would have to be able to log requests to Thumbor when they come in. Since Thumbor is single-threaded I do... [16:55:31] 06Operations, 06Services, 10hardware-requests: Eqiad: (3) hardware access request for RESTBase Staging - https://phabricator.wikimedia.org/T161534#3134576 (10Eevans) [16:56:20] PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py] [16:56:20] 06Operations, 06Performance-Team, 10Thumbor: Track nginx request time in Thumbor debug headers - https://phabricator.wikimedia.org/T161535#3134588 (10Gilles) [16:56:37] 06Operations, 06Performance-Team, 10Thumbor: Thumbor inexplicably 504s intermittently on files that render fine later - https://phabricator.wikimedia.org/T150746#2795266 (10Gilles) [16:57:59] RECOVERY - puppet last run on mc1022 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [16:58:29] (03PS1) 10Gilles: Set nginx request time as a header passed to Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/344968 (https://phabricator.wikimedia.org/T161535) [17:00:04] gehel: Respected human, time to deploy Weekly Wikidata query service deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170327T1700). Please do the needful. [17:00:45] SMalyshev: deployment on wdq-beta done, looking good... [17:00:56] great [17:01:59] RECOVERY - puppet last run on labservices1002 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [17:02:12] 06Operations, 10Monitoring, 10Traffic: Performance impact evaluation of enabling nginx-lua and nginx-lua-prometheus on tlsproxy - https://phabricator.wikimedia.org/T161101#3134623 (10ema) nginx-lua-prometheus uses a dictionary in [[ https://github.com/openresty/lua-nginx-module#lua_shared_dict | shared memo... [17:02:25] !log gehel@tin Started deploy [wdqs/wdqs@d07586c]: (no justification provided) [17:02:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:39] PROBLEM - Disk space on copper is CRITICAL: DISK CRITICAL - /var/lib/docker/containers/76f6c8bb2fb270287af23414fc4f5e17d3fe5402b1a03e73d3511b999ec24033/shm is not accessible: Permission denied [17:03:51] !log gehel@tin Finished deploy [wdqs/wdqs@d07586c]: (no justification provided) (duration: 01m 26s) [17:03:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:29] SMalyshev: deployment completed, tests are green... [17:04:42] gehel: thank you! [17:05:06] SMalyshev: my pleasure! [17:05:21] gehel: http://tinyurl.com/mdnsn3v works now [17:05:39] which means federation fixes work [17:05:45] cool [17:05:49] kool! That looks really interesting ! [17:06:16] * gehel has to admit that he would not know what to do with wdqs federation, but he trust that a few other people will enjoy this [17:07:39] 06Operations, 10Ops-Access-Requests, 10Gerrit: archiva-deploy password for Chad H. - https://phabricator.wikimedia.org/T161067#3134634 (10RobH) 05Open>03Resolved a:03RobH I've emailed the password for arhiva-deploy to chad via gpg encryption. [17:11:39] RECOVERY - Disk space on copper is OK: DISK OK [17:13:49] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to deploy hosts for musikanimal - https://phabricator.wikimedia.org/T161181#3134668 (10Dzahn) a:03Dzahn [17:14:37] (03PS2) 10Dzahn: admins: add musikanimal to deployers [puppet] - 10https://gerrit.wikimedia.org/r/344734 (https://phabricator.wikimedia.org/T161181) [17:15:06] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to deploy hosts for musikanimal - https://phabricator.wikimedia.org/T161181#3123873 (10Dzahn) access request has been approved in today's ops meeting [17:16:51] (03CR) 10Dzahn: [C: 032] admins: add musikanimal to deployers [puppet] - 10https://gerrit.wikimedia.org/r/344734 (https://phabricator.wikimedia.org/T161181) (owner: 10Dzahn) [17:17:03] (03PS3) 10EBernhardson: [WIP] Upgrade logstash to 5.x [puppet] - 10https://gerrit.wikimedia.org/r/344964 (https://phabricator.wikimedia.org/T154473) [17:17:05] (03PS3) 10EBernhardson: [WIP] Update elk stack to 5.x [puppet] - 10https://gerrit.wikimedia.org/r/344965 (https://phabricator.wikimedia.org/T154473) [17:18:38] 06Operations, 10ops-codfw: Plug in ex-graphite2001 SSDs to recover coal data - https://phabricator.wikimedia.org/T161538#3134685 (10fgiunchedi) [17:19:14] !log tin/mira: welcome new mediawiki deployer 'musikanimal' (T161181) [17:19:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:20] T161181: Requesting access to deploy hosts for musikanimal - https://phabricator.wikimedia.org/T161181 [17:20:06] 06Operations, 13Patch-For-Review, 15User-fgiunchedi: replace fluorine with mwlog servers (was: Upgrade fluorine to trusty/jessie) - https://phabricator.wikimedia.org/T123728#3134717 (10fgiunchedi) >>! In T123728#3133379, @fgiunchedi wrote: > I'm afraid we've lost the historical data for coal as both graphite... [17:20:09] Krinkle: ^ [17:21:06] (03CR) 10Mobrovac: [C: 04-1] "Unfortunately, this will not achieve what you want. the service Puppet define defaults to ensure => running, and it is not possible to tel" [puppet] - 10https://gerrit.wikimedia.org/r/344961 (owner: 10Subramanya Sastry) [17:21:08] (03PS4) 10EBernhardson: [WIP] Upgrade logstash to 5.x [puppet] - 10https://gerrit.wikimedia.org/r/344964 (https://phabricator.wikimedia.org/T154473) [17:21:10] (03PS4) 10EBernhardson: [WIP] Update elk stack to 5.x [puppet] - 10https://gerrit.wikimedia.org/r/344965 (https://phabricator.wikimedia.org/T154473) [17:21:21] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to deploy hosts for musikanimal - https://phabricator.wikimedia.org/T161181#3134722 (10Dzahn) @MusikAnimal Done. Welcome to the deployers group! ``` eqiad: [tin:~] $ id musikanimal uid=11106(musikanimal) gid=500(wikidev) groups=50... [17:21:39] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to deploy hosts for musikanimal - https://phabricator.wikimedia.org/T161181#3134723 (10Dzahn) 05Open>03Resolved [17:21:59] PROBLEM - puppet last run on copper is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:22:08] 06Operations, 10Ops-Access-Requests: Requesting access to deploy hosts for musikanimal - https://phabricator.wikimedia.org/T161181#3123873 (10Dzahn) [17:23:21] 06Operations, 10Ops-Access-Requests: Requesting access to deploy hosts for musikanimal - https://phabricator.wikimedia.org/T161181#3123873 (10Dzahn) You should be able to ssh to tin.eqiad.wmnet and mira.codfw.wmnet, via the bastions as you already did for other hosts, now. Let us know if any questions or probl... [17:23:57] (03PS5) 10EBernhardson: [WIP] Upgrade logstash to 5.x [puppet] - 10https://gerrit.wikimedia.org/r/344964 (https://phabricator.wikimedia.org/T154473) [17:23:59] (03PS5) 10EBernhardson: [WIP] Update elk stack to 5.x [puppet] - 10https://gerrit.wikimedia.org/r/344965 (https://phabricator.wikimedia.org/T154473) [17:25:14] 06Operations, 10Ops-Access-Requests, 06Performance-Team, 13Patch-For-Review: Grant perf-roots access to tungsten - https://phabricator.wikimedia.org/T161261#3134733 (10Dzahn) a:03Dzahn [17:25:19] RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [17:29:19] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 48954.304139 Seconds [17:29:19] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 48954.306994 Seconds [17:29:59] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 48993.183568 Seconds [17:30:37] gehel: ^^^ [17:31:32] 06Operations, 10Ops-Access-Requests: Requesting access to maps servers for pnorman - https://phabricator.wikimedia.org/T161274#3134752 (10Dzahn) This access request has been approved in today's ops meeting. [17:31:41] 06Operations, 10Ops-Access-Requests: Requesting access to maps servers for pnorman - https://phabricator.wikimedia.org/T161274#3134753 (10Dzahn) a:03Dzahn [17:32:19] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds [17:32:19] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [17:32:59] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds [17:36:02] volans: thanks! Fixed itself ? [17:36:14] I didn't do anything :) [17:37:16] how frequent is the alarm? can accumulate 50ks delay between checks? if not, is the DB still behind and postgres has a similar issue to the mysql one that used to flap the replica delay between real value and zero? [17:38:18] * gehel is wondering how we can suddenly have 13 hours suddenly... [17:39:06] 06Operations, 06Labs: Add monitoring for nfs-exportd on active labstore specifically - https://phabricator.wikimedia.org/T160838#3134759 (10chasemp) 05Open>03Resolved [17:39:07] * gehel needs to check if this is related to the OSM replication [17:39:29] basically, that database is written too only once a day (we should increase that to each hour) [17:40:43] if the alarm aligns to when you write to it and you write in a single big transaction... maybe could somehow make sense :) [17:41:45] still, 13 hours looks suspicious. I do hope that we don't have transaction running for 13 hours! [17:41:54] and that lag is below that... [17:43:03] and it does not even match with the update cron, which runs at 1:27 UTC [17:44:27] if it does something before starting to write to the DB... 1:27 + some work + 13h of writing == 13h of delay :D [17:45:08] nope, the update script runs for ~1h [17:46:35] 06Operations, 10ops-codfw: Plug in ex-graphite2001 SSDs to recover coal data - https://phabricator.wikimedia.org/T161538#3134685 (10RobH) A potential issue is when the new spare systemis booted up, it will attempt to use the IP of the existing graphite2001 system, so the below checklist should be followed. Si... [17:46:49] 06Operations, 10ops-codfw: Plug in ex-graphite2001 SSDs to recover coal data - https://phabricator.wikimedia.org/T161538#3134770 (10RobH) [17:48:06] (03Abandoned) 10Reedy: hhvm.server.stat_cache = false [puppet] - 10https://gerrit.wikimedia.org/r/341916 (https://phabricator.wikimedia.org/T158176) (owner: 10Reedy) [17:52:47] (03PS1) 10Gehel: maps - keep planet sync logs for 30 days [puppet] - 10https://gerrit.wikimedia.org/r/344974 [17:53:23] (03PS2) 10Gehel: maps - keep planet sync logs for 30 days [puppet] - 10https://gerrit.wikimedia.org/r/344974 (https://phabricator.wikimedia.org/T161542) [17:54:07] 06Operations, 13Patch-For-Review, 15User-fgiunchedi: upgrade netmon1001 to jessie - https://phabricator.wikimedia.org/T125020#3134814 (10Dzahn) checked that the backups now appear on helium (bacula server): how to restore: ``` [helium:~] $ sudo bconsole Connecting to Director helium.eqiad.wmnet:9101 1000... [17:57:01] (03PS6) 10EBernhardson: [WIP] Upgrade logstash to 5.x [puppet] - 10https://gerrit.wikimedia.org/r/344964 (https://phabricator.wikimedia.org/T154473) [17:57:03] (03PS6) 10EBernhardson: [WIP] Update elk stack to 5.x [puppet] - 10https://gerrit.wikimedia.org/r/344965 (https://phabricator.wikimedia.org/T154473) [17:58:26] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Upgrade logstash to 5.x [puppet] - 10https://gerrit.wikimedia.org/r/344964 (https://phabricator.wikimedia.org/T154473) (owner: 10EBernhardson) [18:00:04] 06Operations, 10Ops-Access-Requests: Requesting access to maps servers for pnorman - https://phabricator.wikimedia.org/T161274#3134834 (10RStallman-legalteam) @ Dzhan - I have a record of May 31, 2017 as the current contract expiration. However, legal did not finalize this contract. Depending on the arrangemen... [18:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170327T1800). Please do the needful. [18:02:03] !log mobrovac@tin Started deploy [mobileapps/deploy@92f693c]: Remove the proxy from the config [18:02:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:49] (03PS7) 10EBernhardson: [WIP] Upgrade logstash to 5.x [puppet] - 10https://gerrit.wikimedia.org/r/344964 (https://phabricator.wikimedia.org/T154473) [18:02:51] (03PS7) 10EBernhardson: [WIP] Update elk stack to 5.x [puppet] - 10https://gerrit.wikimedia.org/r/344965 (https://phabricator.wikimedia.org/T154473) [18:04:14] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Upgrade logstash to 5.x [puppet] - 10https://gerrit.wikimedia.org/r/344964 (https://phabricator.wikimedia.org/T154473) (owner: 10EBernhardson) [18:05:33] !log mobrovac@tin Finished deploy [mobileapps/deploy@92f693c]: Remove the proxy from the config (duration: 03m 29s) [18:05:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:59] 06Operations, 13Patch-For-Review, 15User-fgiunchedi: upgrade netmon1001 to jessie - https://phabricator.wikimedia.org/T125020#3134864 (10Dzahn) We were talking about maybe making a wrapper script for the above that makes this less involved for the most common cases. 10:47 I think the common use case... [18:07:46] !log mobrovac@tin Started deploy [mobileapps/deploy@92f693c]: Remove the proxy from the config, deploying to scb2004 [18:07:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:08] (03CR) 10Gehel: [WIP] Upgrade logstash to 5.x (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/344964 (https://phabricator.wikimedia.org/T154473) (owner: 10EBernhardson) [18:08:29] !log mobrovac@tin Finished deploy [mobileapps/deploy@92f693c]: Remove the proxy from the config, deploying to scb2004 (duration: 00m 43s) [18:08:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:12] (03CR) 10MaxSem: [C: 031] maps - keep planet sync logs for 30 days [puppet] - 10https://gerrit.wikimedia.org/r/344974 (https://phabricator.wikimedia.org/T161542) (owner: 10Gehel) [18:23:29] (03PS2) 10Giuseppe Lavagetto: realm: remove rb_site [puppet] - 10https://gerrit.wikimedia.org/r/344960 [18:24:07] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] realm: remove rb_site [puppet] - 10https://gerrit.wikimedia.org/r/344960 (owner: 10Giuseppe Lavagetto) [18:29:10] (03PS1) 10Madhuvishy: nfs: Add functionality to create home and project dirs for new projects in misc [puppet] - 10https://gerrit.wikimedia.org/r/344982 (https://phabricator.wikimedia.org/T158883) [18:31:29] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0] [18:36:27] (03PS2) 10Dzahn: admins: create shell account for Goran S. Milovanovic [puppet] - 10https://gerrit.wikimedia.org/r/344735 (https://phabricator.wikimedia.org/T160980) [18:37:45] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to reseachers, analytics-wmde, analytics-users for GoranSMilovanovic - https://phabricator.wikimedia.org/T160980#3117127 (10Dzahn) Thanks. I have updated https://gerrit.wikimedia.org/r/#/c/344735/2/modules/admin/data/data.yaml to use... [18:40:13] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to reseachers, analytics-wmde, analytics-users for GoranSMilovanovic - https://phabricator.wikimedia.org/T160980#3134964 (10Dzahn) >>! In T160980#3128339, @RobH wrote: > Thanks! Please note we still have to have legal sign off on thi... [18:40:15] (03PS2) 10Madhuvishy: nfs: Add functionality to create home and project dirs for new projects in misc [puppet] - 10https://gerrit.wikimedia.org/r/344982 (https://phabricator.wikimedia.org/T158883) [18:42:08] (03PS1) 10Ottomata: Add cron to prune old druid indexer logs [puppet] - 10https://gerrit.wikimedia.org/r/344983 (https://phabricator.wikimedia.org/T155491) [18:42:11] (03PS4) 10EBernhardson: Update logstash plugins for 5.x [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/344704 [18:43:21] (03CR) 10jerkins-bot: [V: 04-1] Add cron to prune old druid indexer logs [puppet] - 10https://gerrit.wikimedia.org/r/344983 (https://phabricator.wikimedia.org/T155491) (owner: 10Ottomata) [18:46:36] (03PS2) 10Ottomata: Add cron to prune old druid indexer logs [puppet] - 10https://gerrit.wikimedia.org/r/344983 (https://phabricator.wikimedia.org/T155491) [18:48:32] (03CR) 10Ottomata: [C: 032] Add cron to prune old druid indexer logs [puppet] - 10https://gerrit.wikimedia.org/r/344983 (https://phabricator.wikimedia.org/T155491) (owner: 10Ottomata) [18:49:26] (03PS3) 10Madhuvishy: nfs: Add functionality to create home and project dirs for new projects in misc [puppet] - 10https://gerrit.wikimedia.org/r/344982 (https://phabricator.wikimedia.org/T158883) [18:50:17] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to reseachers, analytics-wmde, analytics-users for GoranSMilovanovic - https://phabricator.wikimedia.org/T160980#3134989 (10RStallman-legalteam) Yes, Goran's NDA is on file. Thanks! [18:50:27] (03PS1) 10Ottomata: Need minute => 0 on daily cron job [puppet] - 10https://gerrit.wikimedia.org/r/344985 [18:51:10] (03CR) 10Ottomata: [V: 032 C: 032] Need minute => 0 on daily cron job [puppet] - 10https://gerrit.wikimedia.org/r/344985 (owner: 10Ottomata) [18:55:43] (03PS3) 10Dzahn: admins: create shell account for Goran S. Milovanovic [puppet] - 10https://gerrit.wikimedia.org/r/344735 (https://phabricator.wikimedia.org/T160980) [18:58:07] (03PS4) 10Madhuvishy: nfs: Add functionality to create home and project dirs for new projects in misc [puppet] - 10https://gerrit.wikimedia.org/r/344982 (https://phabricator.wikimedia.org/T158883) [18:58:19] (03PS4) 10Dzahn: admins: create shell account for Goran S. Milovanovic [puppet] - 10https://gerrit.wikimedia.org/r/344735 (https://phabricator.wikimedia.org/T160980) [18:58:48] (03PS5) 10Dzahn: admins: create shell account for Goran S. Milovanovic [puppet] - 10https://gerrit.wikimedia.org/r/344735 (https://phabricator.wikimedia.org/T160980) [19:00:17] (03PS6) 10Dzahn: admins: create shell account for Goran S. Milovanovic [puppet] - 10https://gerrit.wikimedia.org/r/344735 (https://phabricator.wikimedia.org/T160980) [19:00:25] (03PS7) 10Dzahn: admins: create shell account for Goran S. Milovanovic [puppet] - 10https://gerrit.wikimedia.org/r/344735 (https://phabricator.wikimedia.org/T160980) [19:08:23] (03CR) 10Dzahn: [C: 032] admins: create shell account for Goran S. Milovanovic [puppet] - 10https://gerrit.wikimedia.org/r/344735 (https://phabricator.wikimedia.org/T160980) (owner: 10Dzahn) [19:09:20] (03PS1) 10Urbanecm: Allow eliminators and autoreviewers to move a file on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344989 (https://phabricator.wikimedia.org/T161532) [19:09:50] (03CR) 10EBernhardson: [WIP] Upgrade logstash to 5.x (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/344964 (https://phabricator.wikimedia.org/T154473) (owner: 10EBernhardson) [19:10:04] (03PS8) 10EBernhardson: [WIP] Upgrade logstash to 5.x [puppet] - 10https://gerrit.wikimedia.org/r/344964 (https://phabricator.wikimedia.org/T154473) [19:10:06] (03PS8) 10EBernhardson: [WIP] Update elk stack to 5.x [puppet] - 10https://gerrit.wikimedia.org/r/344965 (https://phabricator.wikimedia.org/T154473) [19:10:50] (03CR) 10Dzahn: "user has been created on bastion host(s). you should already be able to ssh to bast1001.wikimedia.org. now we can add you to additional gr" [puppet] - 10https://gerrit.wikimedia.org/r/344735 (https://phabricator.wikimedia.org/T160980) (owner: 10Dzahn) [19:11:56] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Upgrade logstash to 5.x [puppet] - 10https://gerrit.wikimedia.org/r/344964 (https://phabricator.wikimedia.org/T154473) (owner: 10EBernhardson) [19:12:21] (03PS1) 10Ottomata: Prune old druid request logs [puppet] - 10https://gerrit.wikimedia.org/r/344990 (https://phabricator.wikimedia.org/T155491) [19:12:23] (03PS1) 10Urbanecm: [cleanup] Remove expired rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344991 (https://phabricator.wikimedia.org/T161530) [19:13:28] (03PS2) 10Ottomata: Prune old druid request logs [puppet] - 10https://gerrit.wikimedia.org/r/344990 (https://phabricator.wikimedia.org/T155491) [19:13:54] (03PS9) 10EBernhardson: [WIP] Upgrade logstash to 5.x [puppet] - 10https://gerrit.wikimedia.org/r/344964 (https://phabricator.wikimedia.org/T154473) [19:13:56] (03PS9) 10EBernhardson: [WIP] Update elk stack to 5.x [puppet] - 10https://gerrit.wikimedia.org/r/344965 (https://phabricator.wikimedia.org/T154473) [19:15:06] (03PS5) 10Rush: nfs: Add functionality to create home and project dirs for new projects in misc [puppet] - 10https://gerrit.wikimedia.org/r/344982 (https://phabricator.wikimedia.org/T158883) (owner: 10Madhuvishy) [19:15:14] (03CR) 10Rush: [C: 031] nfs: Add functionality to create home and project dirs for new projects in misc [puppet] - 10https://gerrit.wikimedia.org/r/344982 (https://phabricator.wikimedia.org/T158883) (owner: 10Madhuvishy) [19:15:24] (03CR) 10Ottomata: [C: 032] Prune old druid request logs [puppet] - 10https://gerrit.wikimedia.org/r/344990 (https://phabricator.wikimedia.org/T155491) (owner: 10Ottomata) [19:16:23] cmjohnson1: ping on https://phabricator.wikimedia.org/T159632, how about this week? :) [19:16:45] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to reseachers, analytics-wmde, analytics-users for GoranSMilovanovic - https://phabricator.wikimedia.org/T160980#3135083 (10Dzahn) Your user has been created on the [[ https://wikitech.wikimedia.org/wiki/Bastion | bastion host ]](s).... [19:19:00] (03PS6) 10Madhuvishy: nfs: Add functionality to create home and project dirs for new projects in misc [puppet] - 10https://gerrit.wikimedia.org/r/344982 (https://phabricator.wikimedia.org/T158883) [19:19:02] (03CR) 10Madhuvishy: [C: 032] nfs: Add functionality to create home and project dirs for new projects in misc [puppet] - 10https://gerrit.wikimedia.org/r/344982 (https://phabricator.wikimedia.org/T158883) (owner: 10Madhuvishy) [19:19:26] (03CR) 10Madhuvishy: [V: 032 C: 032] nfs: Add functionality to create home and project dirs for new projects in misc [puppet] - 10https://gerrit.wikimedia.org/r/344982 (https://phabricator.wikimedia.org/T158883) (owner: 10Madhuvishy) [19:20:48] ottamata: I will take care of it tomorrow morning. The swift servers are in but of course being HP they don't work unless you waste another 2 weeks trying to figure them out [19:20:58] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to reseachers, analytics-wmde, analytics-users for GoranSMilovanovic - https://phabricator.wikimedia.org/T160980#3135102 (10Dzahn) [19:22:40] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to reseachers, analytics-wmde, analytics-users for GoranSMilovanovic - https://phabricator.wikimedia.org/T160980#3117127 (10Dzahn) [19:25:39] (03PS1) 10Dzahn: admin: add goransm to researchers,analytics-wmde,analytics-users [puppet] - 10https://gerrit.wikimedia.org/r/344992 (https://phabricator.wikimedia.org/T160980) [19:26:19] 06Operations, 06Performance-Team, 10Thumbor: Nginx timeouts on Thumbor - https://phabricator.wikimedia.org/T150746#3135105 (10Gilles) [19:27:29] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [19:27:44] (03PS2) 10Dzahn: admin: add goransm to researchers,analytics-wmde,analytics-users [puppet] - 10https://gerrit.wikimedia.org/r/344992 (https://phabricator.wikimedia.org/T160980) [19:28:36] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to reseachers, analytics-wmde, analytics-users for GoranSMilovanovic - https://phabricator.wikimedia.org/T160980#3135115 (10Dzahn) a:03Dzahn [19:29:25] 06Operations, 06Labs, 10wikitech.wikimedia.org, 07HHVM: Move wikitech (silver) to HHVM - https://phabricator.wikimedia.org/T98813#3135116 (10greg) [19:30:01] 06Operations, 06Labs, 10wikitech.wikimedia.org, 07HHVM: Move wikitech (silver) to HHVM - https://phabricator.wikimedia.org/T98813#1278203 (10greg) Added T161553 as a subtask per above comments. [19:30:46] 06Operations, 06Performance-Team, 10Thumbor: Nginx timeouts on Thumbor - https://phabricator.wikimedia.org/T150746#3135122 (10Gilles) So, in theory our nginx config retries on the next upstream: ``` # fallback to the next upstream at most once, and no longer than 30s proxy_next_upstream_time... [19:31:38] (03PS1) 10Madhuvishy: nfs: Enable mounting /data/project from nfs on project twl [puppet] - 10https://gerrit.wikimedia.org/r/344993 (https://phabricator.wikimedia.org/T159407) [19:34:32] (03CR) 10Dzahn: [C: 032] admin: add goransm to researchers,analytics-wmde,analytics-users [puppet] - 10https://gerrit.wikimedia.org/r/344992 (https://phabricator.wikimedia.org/T160980) (owner: 10Dzahn) [19:38:12] 06Operations, 06Performance-Team, 10Thumbor: Nginx timeouts on Thumbor - https://phabricator.wikimedia.org/T150746#3135132 (10Gilles) Ah, there is a default for that directive: ``` proxy_next_upstream error timeout; ``` But maybe that's not enough for the situation these 504s run into? [19:43:14] (03PS1) 10Mobrovac: service::node: Do not use the proxy by default [puppet] - 10https://gerrit.wikimedia.org/r/344996 (https://phabricator.wikimedia.org/T97530) [19:46:04] 06Operations, 06Performance-Team, 10Thumbor: Nginx timeouts on Thumbor - https://phabricator.wikimedia.org/T150746#3135139 (10Gilles) > One should bear in mind that passing a request to the next server is only possible if nothing has been sent to a client yet. That is, if an error or timeout occurs in the mi... [19:48:35] (03PS1) 10Legoktm: Linter: whitelist parsoid canaries too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344998 (https://phabricator.wikimedia.org/T160573) [19:49:34] (03CR) 10Arlolra: [C: 031] Linter: whitelist parsoid canaries too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344998 (https://phabricator.wikimedia.org/T160573) (owner: 10Legoktm) [19:49:42] jouncebot: next [19:49:42] In 0 hour(s) and 10 minute(s): Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170327T2000) [19:49:59] PROBLEM - puppet last run on mw1279 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:50:16] (03PS2) 10Mobrovac: service::node: Do not use the proxy by default [puppet] - 10https://gerrit.wikimedia.org/r/344996 (https://phabricator.wikimedia.org/T97530) [19:50:46] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to reseachers, analytics-wmde, analytics-users for GoranSMilovanovic - https://phabricator.wikimedia.org/T160980#3135148 (10Dzahn) Hi @GoranSMilovanovic Your shell account has been created and you have the requested groups. I confir... [19:51:29] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to reseachers, analytics-wmde, analytics-users for GoranSMilovanovic - https://phabricator.wikimedia.org/T160980#3135150 (10Dzahn) 05Open>03Resolved [19:51:40] 06Operations, 10Ops-Access-Requests: Requesting access to reseachers, analytics-wmde, analytics-users for GoranSMilovanovic - https://phabricator.wikimedia.org/T160980#3117127 (10Dzahn) [19:51:49] (03CR) 10Legoktm: [C: 032] Linter: whitelist parsoid canaries too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344998 (https://phabricator.wikimedia.org/T160573) (owner: 10Legoktm) [19:52:59] PROBLEM - puppet last run on mx1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:53:08] (03PS2) 10Dzahn: misc-varnish/parsoid-tests: remove parsoid-tests backend [puppet] - 10https://gerrit.wikimedia.org/r/343948 [19:53:15] (03CR) 10Mobrovac: "PCC OK - https://puppet-compiler.wmflabs.org/5929/" [puppet] - 10https://gerrit.wikimedia.org/r/344996 (https://phabricator.wikimedia.org/T97530) (owner: 10Mobrovac) [19:53:19] PROBLEM - Nginx local proxy to apache on mw1261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:53:59] PROBLEM - Apache HTTP on mw1261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:54:09] PROBLEM - HHVM rendering on mw1261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:54:26] (03Merged) 10jenkins-bot: Linter: whitelist parsoid canaries too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344998 (https://phabricator.wikimedia.org/T160573) (owner: 10Legoktm) [19:54:44] (03CR) 10jenkins-bot: Linter: whitelist parsoid canaries too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344998 (https://phabricator.wikimedia.org/T160573) (owner: 10Legoktm) [19:55:13] (03CR) 10Dzahn: [C: 032] misc-varnish/parsoid-tests: remove parsoid-tests backend [puppet] - 10https://gerrit.wikimedia.org/r/343948 (owner: 10Dzahn) [19:56:12] !log legoktm@tin Synchronized wmf-config/InitialiseSettings.php: Linter: whitelist parsoid canaries too - https://gerrit.wikimedia.org/r/#/c/344998/ - T160573 (duration: 00m 44s) [19:56:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:18] T160573: Special:LintErrors page had not been updated more than 90 mins after a page has been edited to fix errors - https://phabricator.wikimedia.org/T160573 [19:57:03] !log ruthenium/varnish misc - remove parsoid-tests.wikimedia.org server_name / backend - replaced by parsoid-rt-test and parsoid-vd-tests [19:57:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:23] (03Abandoned) 10Dzahn: admins: add goransm to researchers, analytics-wmde, analytics-users [puppet] - 10https://gerrit.wikimedia.org/r/344736 (https://phabricator.wikimedia.org/T160980) (owner: 10Dzahn) [19:59:36] (03PS1) 10Gilles: Improve Thumbor nginx timeout settings [puppet] - 10https://gerrit.wikimedia.org/r/344999 (https://phabricator.wikimedia.org/T150746) [20:00:04] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170327T2000). Please do the needful. [20:00:30] (03CR) 10Gilles: "I applied this to thumbor1001 and the timeout errors completely stopped, while they kept happening on thumbor1002 where the nginx config w" [puppet] - 10https://gerrit.wikimedia.org/r/344999 (https://phabricator.wikimedia.org/T150746) (owner: 10Gilles) [20:00:56] nothing for ORES today [20:01:02] (maybe tomorrow) [20:01:59] RECOVERY - Disk space on prometheus1001 is OK: DISK OK [20:07:19] PROBLEM - puppet last run on labvirt1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:12:27] (03CR) 10Dzahn: url_downloader: convert to profile/role (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/344729 (owner: 10Dzahn) [20:14:05] !log arlolra@tin Started deploy [parsoid/deploy@371ba4f]: Updating Parsoid to 6eaad376 [20:14:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:59] RECOVERY - puppet last run on mw1279 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [20:18:29] (03PS2) 10Madhuvishy: nfs: Enable mounting /data/project from nfs on project twl [puppet] - 10https://gerrit.wikimedia.org/r/344993 (https://phabricator.wikimedia.org/T159407) [20:19:00] (03CR) 10Madhuvishy: [V: 032 C: 032] nfs: Enable mounting /data/project from nfs on project twl [puppet] - 10https://gerrit.wikimedia.org/r/344993 (https://phabricator.wikimedia.org/T159407) (owner: 10Madhuvishy) [20:21:11] !log arlolra@tin Finished deploy [parsoid/deploy@371ba4f]: Updating Parsoid to 6eaad376 (duration: 07m 06s) [20:21:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:59] RECOVERY - puppet last run on mx1001 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [20:23:46] (03PS1) 10Madhuvishy: nfs-manage-binds: Pass mounts as keyword arg [puppet] - 10https://gerrit.wikimedia.org/r/345005 [20:26:46] 06Operations, 13Patch-For-Review, 15User-fgiunchedi: replace fluorine with mwlog servers (was: Upgrade fluorine to trusty/jessie) - https://phabricator.wikimedia.org/T123728#3135227 (10Krinkle) >>! In T123728#3133379, @fgiunchedi wrote: >In terms of restoring the data, is there anywhere else the timing data... [20:28:49] (03CR) 10Madhuvishy: [C: 032] nfs-manage-binds: Pass mounts as keyword arg [puppet] - 10https://gerrit.wikimedia.org/r/345005 (owner: 10Madhuvishy) [20:29:26] !log Updated Parsoid to 6eaad376 (T160599, T161178, T133267) [20:29:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:34] T160599: Parsoid's linter doesn't know about thumbtime parameter - https://phabricator.wikimedia.org/T160599 [20:29:35] T133267: Invalid conversion for external links containing ] in the description - https://phabricator.wikimedia.org/T133267 [20:29:35] T161178: Investigate gallery-related rendering diffs in gallery HTML - https://phabricator.wikimedia.org/T161178 [20:30:49] PROBLEM - puppet last run on mw1259 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:33:33] 06Operations, 13Patch-For-Review, 15User-fgiunchedi: replace fluorine with mwlog servers (was: Upgrade fluorine to trusty/jessie) - https://phabricator.wikimedia.org/T123728#3135277 (10Ottomata) > Graphite have no concept of time for incoming data, everything is "now". I'm pretty sure graphite does have a c... [20:33:47] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3135278 (10Papaul) Dear Mr Papaul Tshibamba, Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms... [20:36:19] RECOVERY - puppet last run on labvirt1003 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [20:36:58] (03PS5) 10Hashar: (WIP) compile authdns::config [puppet] - 10https://gerrit.wikimedia.org/r/343747 [20:39:22] (03CR) 10Hashar: "I have:" [puppet] - 10https://gerrit.wikimedia.org/r/343747 (owner: 10Hashar) [20:39:49] (03CR) 10jerkins-bot: [V: 04-1] (WIP) compile authdns::config [puppet] - 10https://gerrit.wikimedia.org/r/343747 (owner: 10Hashar) [20:59:49] RECOVERY - puppet last run on mw1259 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [21:00:04] dapatrick, bawolff, and Reedy: Dear anthropoid, the time has come. Please deploy Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170327T2100). [21:00:31] (03PS3) 10Andrew Bogott: Nova scheduler: Use relative cpu percentages when scheduling. [puppet] - 10https://gerrit.wikimedia.org/r/344689 (https://phabricator.wikimedia.org/T161006) [21:02:31] 06Operations, 10MediaWiki-extensions-PageAssessments, 10Wikimedia-General-or-Unknown, 13Patch-For-Review: foreachwikiindblist regular cronspam - https://phabricator.wikimedia.org/T159438#3135327 (10kaldari) What a helpful email :P Unfortunately, there are no dblists that correspond to the wikis that PageAs... [21:04:49] RECOVERY - Apache HTTP on mw1261 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.061 second response time [21:05:19] RECOVERY - Nginx local proxy to apache on mw1261 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.617 second response time [21:05:59] RECOVERY - HHVM rendering on mw1261 is OK: HTTP OK: HTTP/1.1 200 OK - 76113 bytes in 0.116 second response time [21:10:45] (03PS4) 10Andrew Bogott: Nova scheduler: Use relative cpu percentages when scheduling. [puppet] - 10https://gerrit.wikimedia.org/r/344689 (https://phabricator.wikimedia.org/T161006) [21:12:36] (03PS5) 10Andrew Bogott: Nova scheduler: Use relative cpu percentages when scheduling. [puppet] - 10https://gerrit.wikimedia.org/r/344689 (https://phabricator.wikimedia.org/T161006) [21:13:31] (03PS1) 10Halfak: Adds hunspell-ko to ores:base [puppet] - 10https://gerrit.wikimedia.org/r/345016 [21:19:09] (03PS6) 10Hashar: (WIP) compile authdns::config [puppet] - 10https://gerrit.wikimedia.org/r/343747 [21:19:19] 06Operations, 10MediaWiki-extensions-PageAssessments, 10Wikimedia-General-or-Unknown, 13Patch-For-Review: foreachwikiindblist regular cronspam - https://phabricator.wikimedia.org/T159438#3135403 (10jcrespo) @kaldari You are thinking as a mediawiki developer. The way to solve such problems as an ops is to r... [21:20:45] (03CR) 10jerkins-bot: [V: 04-1] (WIP) compile authdns::config [puppet] - 10https://gerrit.wikimedia.org/r/343747 (owner: 10Hashar) [21:27:16] (03CR) 10Rush: [C: 031] "seems to work in labtest" [puppet] - 10https://gerrit.wikimedia.org/r/344689 (https://phabricator.wikimedia.org/T161006) (owner: 10Andrew Bogott) [21:27:58] !log disabling puppet on labvirt* and labcontrol* to stagger roll out of https://gerrit.wikimedia.org/r/#/c/344689/ [21:28:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:20] (03CR) 10Hashar: "Under Jenkins: fixed copy of build files to the log directory." [puppet] - 10https://gerrit.wikimedia.org/r/343747 (owner: 10Hashar) [21:29:01] (03CR) 10Andrew Bogott: [C: 032] Nova scheduler: Use relative cpu percentages when scheduling. [puppet] - 10https://gerrit.wikimedia.org/r/344689 (https://phabricator.wikimedia.org/T161006) (owner: 10Andrew Bogott) [21:30:43] (03Abandoned) 10Andrew Bogott: Nova scheduler: Prefer virthosts with lower CPU usage [puppet] - 10https://gerrit.wikimedia.org/r/344051 (https://phabricator.wikimedia.org/T161006) (owner: 10Andrew Bogott) [21:37:57] (03PS1) 10Volans: Add task to update Tendril [switchdc] - 10https://gerrit.wikimedia.org/r/345045 (https://phabricator.wikimedia.org/T160178) [21:41:06] (03PS6) 10BBlack: varnish: refactor all clusters for active/active [puppet] - 10https://gerrit.wikimedia.org/r/339667 (https://phabricator.wikimedia.org/T134404) [21:45:49] (03PS2) 10Volans: Add task to update Tendril [switchdc] - 10https://gerrit.wikimedia.org/r/345045 (https://phabricator.wikimedia.org/T160178) [21:52:23] (03PS7) 10BBlack: varnish: refactor all clusters for active/active [puppet] - 10https://gerrit.wikimedia.org/r/339667 (https://phabricator.wikimedia.org/T134404) [21:55:17] 06Operations, 06Discovery, 10Elasticsearch, 10Wikimedia-Logstash, and 2 others: Update logstash on wikimedia to 5.x - https://phabricator.wikimedia.org/T154473#3135483 (10EBernhardson) Logstash has been upgraded to 5.x on the beta cluster. Everything is prepped for elasticsearch and kibana to upgrade as we... [21:57:19] PROBLEM - Nginx local proxy to apache on mw1242 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:57:29] PROBLEM - HHVM rendering on mw1242 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:57:49] PROBLEM - Apache HTTP on mw1242 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:00:01] (03PS8) 10BBlack: varnish: refactor all clusters for active/active [puppet] - 10https://gerrit.wikimedia.org/r/339667 (https://phabricator.wikimedia.org/T134404) [22:00:48] !log deployed patch T151735 [22:00:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:11:29] (03CR) 10BBlack: "PS8 is cherry-picked onto deployment-prep for upload + text cluster cases (misc+maps don't exist there). Found a couple simple fixups in " [puppet] - 10https://gerrit.wikimedia.org/r/339667 (https://phabricator.wikimedia.org/T134404) (owner: 10BBlack) [22:12:20] (03CR) 10EBernhardson: [C: 031] [mwgrep] enable more accurate regex timeout [puppet] - 10https://gerrit.wikimedia.org/r/344925 (owner: 10DCausse) [22:16:27] (03Abandoned) 10Halfak: Adds hunspell-ko to ores:base [puppet] - 10https://gerrit.wikimedia.org/r/345016 (owner: 10Halfak) [22:30:11] 06Operations: add support to offboard-user to support mailman list removal - https://phabricator.wikimedia.org/T161566#3135539 (10RobH) [22:48:51] (03PS1) 10Dzahn: admin: create shell account for Paul Norman [puppet] - 10https://gerrit.wikimedia.org/r/345066 (https://phabricator.wikimedia.org/T161274) [22:49:55] (03PS2) 10Dzahn: admin: create shell account for Paul Norman [puppet] - 10https://gerrit.wikimedia.org/r/345066 (https://phabricator.wikimedia.org/T161274) [22:52:24] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to maps servers for pnorman - https://phabricator.wikimedia.org/T161274#3135609 (10Dzahn) [22:53:47] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to maps servers for pnorman - https://phabricator.wikimedia.org/T161274#3127143 (10Dzahn) Thank you @RStallman-legalteam (and also for the other approval earlier). I have added hr@wikimedia.org as contact for now. [22:54:09] PROBLEM - puppet last run on netmon1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:57:44] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to maps servers for pnorman - https://phabricator.wikimedia.org/T161274#3135626 (10Dzahn) >>! In T161274#3128369, @RobH wrote: > We need a few things for this to be granted: > > [x] - please determine (perhaps with @gehel) exactly wh... [23:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170327T2300). Please do the needful. [23:00:04] ebernhardson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:55] looks like it's just me, will deploy [23:03:42] (03CR) 10Pnorman: [C: 031] admin: create shell account for Paul Norman [puppet] - 10https://gerrit.wikimedia.org/r/345066 (https://phabricator.wikimedia.org/T161274) (owner: 10Dzahn) [23:06:10] !log ebernhardson@tin Synchronized php-1.29.0-wmf.17/extensions/WikimediaEvents/modules/ext.wikimediaEvents.searchSatisfaction.js: SWAT T160006, turning off cirrussearch AB test for sistersearch (duration: 00m 44s) [23:06:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:17] T160006: turn off A/B test for displaying sister project search results (test #2) - https://phabricator.wikimedia.org/T160006 [23:22:00] 06Operations, 10Ops-Access-Requests, 06Performance-Team, 13Patch-For-Review: Grant perf-roots access to tungsten - https://phabricator.wikimedia.org/T161261#3135722 (10Dzahn) @Robh @Krinkle @Muehlenhoff Ok, so i researched the history of this a bit and this access got lost over time. In https://gerrit.w... [23:23:09] RECOVERY - puppet last run on netmon1001 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [23:26:05] (03PS2) 10Dzahn: Add admin group perf-roots to role xhgui. [puppet] - 10https://gerrit.wikimedia.org/r/344531 (https://phabricator.wikimedia.org/T161261) (owner: 10Krinkle) [23:28:20] (03CR) 10Krinkle: [C: 031] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/344531 (https://phabricator.wikimedia.org/T161261) (owner: 10Krinkle) [23:28:44] 06Operations, 10Ops-Access-Requests, 06Performance-Team, 13Patch-For-Review: Restore perf-roots access to xhgui (tungsten) - https://phabricator.wikimedia.org/T161261#3135743 (10Krinkle) [23:29:59] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.495 second response time [23:34:59] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.847 second response time [23:37:39] PROBLEM - puppet last run on bast2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:42:59] PROBLEM - puppet last run on mc1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:45:19] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 2 others: Redo /beacon/impression system (formerly Special:RecordImpression) to remove extra round trips on all FR impressions (title was: S:RI should pyroperish) - https://phabricator.wikimedia.org/T45250#3135775 (10K...