[00:19:33] PROBLEM - puppet last run on restbase-dev1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:21:03] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6479 [00:22:03] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3435435 keys, up 7 days 8 hours - replication_delay is 0 [00:35:23] PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:36:13] RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [00:39:23] PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:41:13] RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.003 second response time [00:48:33] RECOVERY - puppet last run on restbase-dev1002 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [00:56:11] (03PS1) 10Dzahn: new profile/role for IRC server using charybdis (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/345791 (https://phabricator.wikimedia.org/T134271) [01:15:19] 06Operations, 10Wikimedia-IRC-RC-Server, 13Patch-For-Review: Replace ircd-ratbox with something newer/maintained - https://phabricator.wikimedia.org/T134271#3145647 (10Dzahn) So our ratbox has this custom patch to disallow people creating channels unless they are oper. This https://github.com/wikimedia/oper... [01:22:59] 06Operations, 10Wikimedia-IRC-RC-Server, 13Patch-For-Review: Replace ircd-ratbox with something newer/maintained - https://phabricator.wikimedia.org/T134271#3145653 (10Dzahn) 18:24 < amdj> for what you want, you'll be best served by loading extensions/createoperonly, putting +g in default umodes, and putting... [01:23:43] PROBLEM - HHVM jobrunner on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:26:33] RECOVERY - HHVM jobrunner on mw1168 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 1.483 second response time [01:29:43] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 83560.737166 Seconds [01:29:43] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 83635.155794 Seconds [01:29:43] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 83635.165194 Seconds [01:29:43] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 83635.196244 Seconds [01:29:53] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 83568.299499 Seconds [01:29:53] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 83571.83622 Seconds [01:33:24] PROBLEM - puppet last run on wtp1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:33:24] PROBLEM - puppet last run on analytics1054 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:33:53] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds [01:34:43] PROBLEM - Disk space on thumbor1002 is CRITICAL: DISK CRITICAL - free space: / 1728 MB (3% inode=97%) [01:34:43] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds [01:36:53] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 83988.467006 Seconds [01:37:43] PROBLEM - Disk space on thumbor1002 is CRITICAL: DISK CRITICAL - free space: / 1591 MB (3% inode=97%) [01:37:43] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 84040.768862 Seconds [01:39:53] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds [01:42:53] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 84348.386493 Seconds [01:48:46] (03CR) 10Krinkle: [C: 031] noc: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345553 (owner: 10Faidon Liambotis) [01:48:50] :q [01:49:59] (03CR) 10Krinkle: [C: 031] vagrant: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345563 (owner: 10Faidon Liambotis) [01:56:43] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 26.977383 Seconds [01:56:43] RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 25.762525 Seconds [01:56:43] RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 25.792339 Seconds [01:56:44] RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 25.835469 Seconds [01:56:53] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 34.554992 Seconds [01:56:53] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 37.787918 Seconds [02:02:23] RECOVERY - puppet last run on wtp1019 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [02:02:23] RECOVERY - puppet last run on analytics1054 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [02:24:07] (03CR) 10Dzahn: [C: 032] noc: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345553 (owner: 10Faidon Liambotis) [02:25:57] (03CR) 10Dzahn: [C: 031] ci: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345557 (owner: 10Faidon Liambotis) [02:26:57] (03CR) 10Dzahn: [C: 031] ganglia: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345551 (owner: 10Faidon Liambotis) [02:27:57] (03PS2) 10Dzahn: ganglia: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345551 (owner: 10Faidon Liambotis) [02:29:46] (03CR) 10Dzahn: [C: 032] ganglia: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345551 (owner: 10Faidon Liambotis) [02:29:54] (03PS3) 10Dzahn: ganglia: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345551 (owner: 10Faidon Liambotis) [02:35:23] PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:35:43] PROBLEM - HHVM jobrunner on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:35:46] (03PS2) 10Dzahn: noc: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345553 (owner: 10Faidon Liambotis) [02:36:23] RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 6.588 second response time [02:36:33] RECOVERY - HHVM jobrunner on mw1168 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.003 second response time [02:37:42] (03CR) 10Dzahn: [C: 031] puppetmaster: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345554 (owner: 10Faidon Liambotis) [02:38:41] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.18) (duration: 15m 54s) [02:38:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:44:11] !log l10nupdate@tin ResourceLoader cache refresh completed at Fri Mar 31 02:44:11 UTC 2017 (duration 5m 30s) [02:44:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:47:23] PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:48:43] PROBLEM - HHVM jobrunner on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:49:13] RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [02:49:33] RECOVERY - HHVM jobrunner on mw1168 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.003 second response time [02:54:23] PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:54:43] PROBLEM - HHVM jobrunner on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:56:13] RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [02:56:33] RECOVERY - HHVM jobrunner on mw1168 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [02:56:43] (03CR) 10Faidon Liambotis: [C: 04-1] "The sysctl part is wrong. It needs to be "ubuntu >= trusty", as I mentioned before as it's still a Ubuntu-specific change." [puppet] - 10https://gerrit.wikimedia.org/r/345366 (owner: 10Dzahn) [03:05:23] PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:07:43] PROBLEM - HHVM jobrunner on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:08:23] RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 9.966 second response time [03:10:23] PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:10:33] RECOVERY - HHVM jobrunner on mw1168 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 5.283 second response time [03:12:43] PROBLEM - HHVM jobrunner on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:13:33] RECOVERY - HHVM jobrunner on mw1168 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [03:14:13] RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [03:28:43] PROBLEM - puppet last run on silver is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:31:23] PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:31:43] PROBLEM - HHVM jobrunner on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:32:23] PROBLEM - HHVM jobrunner on mw1259 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:34:13] RECOVERY - HHVM jobrunner on mw1259 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [03:41:23] PROBLEM - HHVM jobrunner on mw1259 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:42:13] RECOVERY - HHVM jobrunner on mw1259 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [03:47:13] RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [03:47:33] RECOVERY - HHVM jobrunner on mw1168 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [03:57:43] RECOVERY - puppet last run on silver is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [03:57:53] PROBLEM - puppet last run on db1046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:08:23] PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:08:43] PROBLEM - HHVM jobrunner on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:13:13] RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [04:13:33] RECOVERY - HHVM jobrunner on mw1168 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 3.382 second response time [04:21:34] (03Draft2) 10Felipe L. Ewald: Capitalizing the words dumps and images. [puppet] - 10https://gerrit.wikimedia.org/r/345795 [04:24:53] RECOVERY - puppet last run on db1046 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [04:27:18] (03PS3) 10Dzahn: base: sysctl/check_puppetrun: remove precise remnants [puppet] - 10https://gerrit.wikimedia.org/r/345366 [04:27:23] PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:28:13] RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [04:28:32] (03CR) 10jerkins-bot: [V: 04-1] base: sysctl/check_puppetrun: remove precise remnants [puppet] - 10https://gerrit.wikimedia.org/r/345366 (owner: 10Dzahn) [04:31:02] (03PS4) 10Dzahn: base: sysctl/check_puppetrun: remove precise remnants [puppet] - 10https://gerrit.wikimedia.org/r/345366 [04:31:04] (03PS3) 10Dzahn: noc: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345553 (owner: 10Faidon Liambotis) [04:32:38] (03PS3) 10Dzahn: dataset: Capitalizing the words dumps and images. [puppet] - 10https://gerrit.wikimedia.org/r/345795 (owner: 10Felipe L. Ewald) [04:32:51] (03CR) 10Dzahn: [C: 031] dataset: Capitalizing the words dumps and images. [puppet] - 10https://gerrit.wikimedia.org/r/345795 (owner: 10Felipe L. Ewald) [04:37:23] PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:37:43] PROBLEM - HHVM jobrunner on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:38:33] RECOVERY - HHVM jobrunner on mw1168 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 4.161 second response time [04:38:45] (03CR) 10Dzahn: [C: 031] hhvm: kill a precise reference [puppet] - 10https://gerrit.wikimedia.org/r/345547 (owner: 10Faidon Liambotis) [04:40:28] (03CR) 10Dzahn: [C: 031] elasticsearch: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345552 (owner: 10Faidon Liambotis) [04:42:43] PROBLEM - HHVM jobrunner on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:50:58] (03CR) 10Krinkle: [C: 031] ci: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345557 (owner: 10Faidon Liambotis) [04:51:13] RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [04:51:33] RECOVERY - HHVM jobrunner on mw1168 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [05:50:22] (03CR) 10BryanDavis: [C: 031] Keystone 2fa: Use the wikitech API rather than checking the db directly. [puppet] - 10https://gerrit.wikimedia.org/r/345231 (owner: 10Andrew Bogott) [06:01:53] PROBLEM - puppet last run on mw1183 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:04:23] PROBLEM - puppet last run on notebook1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:13:54] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [06:16:33] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [06:18:37] <_joe_> this seems to be ^^ mostly 500s on the API [06:19:33] <_joe_> and a spike we just passed [06:20:56] (03PS2) 10Marostegui: mysql-predump.erb: Reduce the number of jobs [puppet] - 10https://gerrit.wikimedia.org/r/345372 [06:23:09] (03CR) 10Marostegui: [C: 032] mysql-predump.erb: Reduce the number of jobs [puppet] - 10https://gerrit.wikimedia.org/r/345372 (owner: 10Marostegui) [06:26:53] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [06:29:53] RECOVERY - puppet last run on mw1183 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:32:23] RECOVERY - puppet last run on notebook1001 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:32:43] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 13997.236222 Seconds [06:32:44] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 13997.264261 Seconds [06:32:44] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 13997.266008 Seconds [06:35:43] RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 0.0 Seconds [06:35:43] RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 0.0 Seconds [06:35:43] RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 0.0 Seconds [06:37:33] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:39:53] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:45:33] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [06:45:53] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [06:50:33] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:50:53] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:00:03] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/0/1: down - Core: cr1-ulsfo:xe-1/2/0 (Telia, IC-313592, 51ms) {#11372} [10Gbps wave]BR [07:00:03] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/2/0: down - Core: cr1-eqord:xe-0/0/1 (Telia, IC-313592, 51ms) {#1502} [10Gbps wave]BR [07:02:22] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/345550 (owner: 10Faidon Liambotis) [07:02:56] (03CR) 10Muehlenhoff: [C: 031] hhvm: kill a precise reference [puppet] - 10https://gerrit.wikimedia.org/r/345547 (owner: 10Faidon Liambotis) [07:03:57] (03CR) 10Muehlenhoff: [C: 031] mediawiki: kill more precise references [puppet] - 10https://gerrit.wikimedia.org/r/345546 (owner: 10Faidon Liambotis) [07:06:22] (03CR) 10Muehlenhoff: [C: 031] apache: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345548 (owner: 10Faidon Liambotis) [07:10:03] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [07:10:04] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 [07:21:36] 06Operations, 06Discovery: Recurrent Postgres replication lag for codfw maps hosts - https://phabricator.wikimedia.org/T161870#3145779 (10elukey) [07:28:42] <_joe_> Amir1: around? [07:28:59] _joe_: hey, yes [07:29:02] <_joe_> I wanted your opinion on the renewed https://gerrit.wikimedia.org/r/#/c/316317/ [07:29:29] !log Start pt-table-checksum on s5 wikidatawiki - T161294 [07:29:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:37] T161294: run pt-tablechecksum on s5 - https://phabricator.wikimedia.org/T161294 [07:31:43] _joe_: It looks good but are you sure of https://gerrit.wikimedia.org/r/#/c/316317/3/wmf-config/CommonSettings-labs.php,unified ? [07:32:22] <_joe_> Amir1: I'll recheck, but I am pretty sure it works [07:32:36] <_joe_> as we already define that variable in commonsettings.php [07:32:44] <_joe_> but let me re-check all that just to be sure [07:33:29] Thanks. If that's okay, feel free to merge :) [07:33:45] 06Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 15User-Elukey: Analytics hosts showed high temperature alarms - https://phabricator.wikimedia.org/T132256#3145801 (10elukey) I checked today mcelog and the thermal errors stopped right after Chris applied the thermal paste! New list of affect... [07:34:11] (03CR) 10Hashar: [C: 031] ci: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345557 (owner: 10Faidon Liambotis) [07:34:15] <_joe_> Amir1: yes, in labs LabsServices.php is included at line 103 of commonsettings.php [07:34:32] <_joe_> so when it defines the ORES variables later, they are already correct [07:34:35] Awesome [07:35:54] 06Operations, 10ops-eqiad: Heating alerts for mw servers in eqiad - https://phabricator.wikimedia.org/T149287#3145815 (10elukey) In T132256 Chris applied the thermal paste to analytics1039 and the alerts stopped, so it seems the right way to go. Since the number of appservers is big we might need to schedule i... [07:37:37] (03CR) 10Jcrespo: "I would love to delete all stuff, but mforns didn't let me here: https://phabricator.wikimedia.org/T108850#2463961 (see full ticket)." [puppet] - 10https://gerrit.wikimedia.org/r/345646 (https://phabricator.wikimedia.org/T124307) (owner: 10Ottomata) [07:38:42] (03CR) 10Hashar: [C: 031] "Upstream has already approved/merged Paladox change in their repo." [puppet] - 10https://gerrit.wikimedia.org/r/345583 (owner: 10Paladox) [07:41:58] (03CR) 10Hashar: [C: 031] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/345561 (owner: 10Faidon Liambotis) [07:50:19] (03PS1) 10Hashar: package_builder: remove Precise images [puppet] - 10https://gerrit.wikimedia.org/r/345801 [07:51:09] 06Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 15User-Elukey: Analytics hosts showed high temperature alarms - https://phabricator.wikimedia.org/T132256#3145858 (10elukey) [07:51:13] 06Operations, 10Monitoring: Monitor hardware thermal issues - https://phabricator.wikimedia.org/T125205#3145857 (10elukey) [07:52:53] PROBLEM - puppet last run on dbproxy1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:55:57] (03CR) 10Hashar: [C: 031] "Puppet compilation against copper.eqiad.wmnet https://puppet-compiler.wmflabs.org/5980/copper.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/345801 (owner: 10Hashar) [07:58:40] (03PS2) 10Hashar: package_builder: remove Precise images [puppet] - 10https://gerrit.wikimedia.org/r/345801 [08:10:40] (03CR) 10Elukey: "rebase" [puppet] - 10https://gerrit.wikimedia.org/r/333880 (owner: 10Elukey) [08:10:45] (03PS15) 10Elukey: role::memcached: refactor in multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/333880 [08:12:46] _joe_ if you are ok I'd merge --^ [08:12:55] (disabling puppet beforehand just in case) [08:13:01] <_joe_> elukey: yeah do it :) [08:13:26] <_joe_> elukey: cumin 'R:class = role::memcached' 'puppet agent --disable "change ongoing --elukey"' [08:14:15] (03PS1) 10Giuseppe Lavagetto: parsoid: cleanup hiera defs in codfw, eqiad [puppet] - 10https://gerrit.wikimedia.org/r/345803 [08:14:17] (03PS1) 10Giuseppe Lavagetto: swift: use discovery url for thumb server [puppet] - 10https://gerrit.wikimedia.org/r/345804 [08:14:19] (03PS1) 10Giuseppe Lavagetto: authdns: add discovery record for restbase-async [puppet] - 10https://gerrit.wikimedia.org/r/345805 [08:14:22] (03PS1) 10Giuseppe Lavagetto: changeprop: switch to discovery url [puppet] - 10https://gerrit.wikimedia.org/r/345806 [08:14:53] ah nice! [08:15:01] I was doing it with salt but this is better [08:15:03] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/0/1: down - Core: cr1-ulsfo:xe-1/2/0 (Telia, IC-313592, 51ms) {#11372} [10Gbps wave]BR [08:15:03] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/2/0: down - Core: cr1-eqord:xe-0/0/1 (Telia, IC-313592, 51ms) {#1502} [10Gbps wave]BR [08:15:23] <_joe_> elukey: once you're done merging that, I'll fix things so that puppet doesn't fail on mc1019 [08:16:17] (03CR) 10Elukey: [C: 032] role::memcached: refactor in multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/333880 (owner: 10Elukey) [08:17:12] 06Operations, 10Traffic, 13Patch-For-Review: varnish backends start returning 503s after ~6 days uptime - https://phabricator.wikimedia.org/T145661#3145896 (10ema) This has started happening on cache_text as well. First occurrance I'm aware of on text was on 2017-03-30 between [[https://grafana.wikimedia.or... [08:17:51] (03CR) 10Alexandros Kosiaris: [C: 032] base: sysctl/check_puppetrun: remove precise remnants [puppet] - 10https://gerrit.wikimedia.org/r/345366 (owner: 10Dzahn) [08:17:57] (03PS5) 10Alexandros Kosiaris: base: sysctl/check_puppetrun: remove precise remnants [puppet] - 10https://gerrit.wikimedia.org/r/345366 (owner: 10Dzahn) [08:17:59] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] base: sysctl/check_puppetrun: remove precise remnants [puppet] - 10https://gerrit.wikimedia.org/r/345366 (owner: 10Dzahn) [08:19:03] PROBLEM - puppet last run on mc1019 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 39 seconds ago with 1 failures. Failed resources (up to 3 shown): File[/etc/redis/replica/] [08:20:05] this one is not due to my changes, was already there --^ [08:20:26] mc2019 went fine (only ferm naming changes), proceeding with codfw [08:21:33] <_joe_> elukey: yeah I'll fix that as soon as it's done [08:21:53] RECOVERY - puppet last run on dbproxy1006 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [08:24:05] 06Operations, 10Monitoring: Monitor hardware thermal issues - https://phabricator.wikimedia.org/T125205#3145907 (10hashar) I gave it a try on `contint1001.wikimedia.org` and it alarmed out due to some entity not being present. That can be seen in very verbose output (-vv): ``` $ /usr/local/lib/nagios/plugins/c... [08:28:52] !log repooled mw1261 for more HHVM 3.18 debugging [08:28:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:03] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [08:33:04] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 [08:46:27] (03PS1) 10Jcrespo: Pool db1034 (with the right partit. & indexes) as the db-special-s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345808 (https://phabricator.wikimedia.org/T159319) [08:47:10] (03CR) 10Marostegui: [C: 031] Pool db1034 (with the right partit. & indexes) as the db-special-s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345808 (https://phabricator.wikimedia.org/T159319) (owner: 10Jcrespo) [08:50:03] _joe_ all good! patch merged (took a bit since I played with Cumin and annoyed Riccardo :P) [08:51:12] <_joe_> elukey: annoying riccardo is one of the goals we discussed for this FY, right? [08:51:19] <_joe_> so you're doing good [08:51:29] was not part of the 3y plan? :-P [08:51:34] ah yes! We'll keep raising the standards then :P [08:55:05] _joe_ I think that the puppet issue on mc1019 is also present in deployment-prep (deployment-memc04 for example) [08:56:09] <_joe_> now i gotta make another one to finaly fix it [08:56:19] (03PS2) 10Gehel: Default /etc/elasticsearch to 0755 [puppet] - 10https://gerrit.wikimedia.org/r/345633 (owner: 10EBernhardson) [08:56:24] it says Error 400 on SERVER: undefined method `each' for nil:NilClass at /etc/puppet/modules/profile/manifests/redis/multidc.pp:12 [08:57:01] ah different one [08:57:11] from mc1019 [08:57:58] (03CR) 10Gehel: [C: 032] Default /etc/elasticsearch to 0755 [puppet] - 10https://gerrit.wikimedia.org/r/345633 (owner: 10EBernhardson) [09:05:13] PROBLEM - dhclient process on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:05:32] (03PS2) 10Giuseppe Lavagetto: parsoid: cleanup hiera defs in codfw, eqiad [puppet] - 10https://gerrit.wikimedia.org/r/345803 [09:05:33] PROBLEM - salt-minion processes on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:05:48] (03CR) 10Alexandros Kosiaris: "limn1 is now shuted down per https://phabricator.wikimedia.org/T143349 (it is in fact scheduled to be deleted today), I am removing my -2." [puppet] - 10https://gerrit.wikimedia.org/r/334155 (owner: 10Alexandros Kosiaris) [09:05:54] (03PS3) 10Alexandros Kosiaris: realm.pp: Remove the pre 3.5 puppet handling code [puppet] - 10https://gerrit.wikimedia.org/r/334155 [09:06:03] RECOVERY - dhclient process on thumbor1001 is OK: PROCS OK: 0 processes with command name dhclient [09:06:23] RECOVERY - salt-minion processes on thumbor1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:09:32] (03CR) 10Giuseppe Lavagetto: [C: 032] parsoid: cleanup hiera defs in codfw, eqiad [puppet] - 10https://gerrit.wikimedia.org/r/345803 (owner: 10Giuseppe Lavagetto) [09:11:01] (03CR) 10Alexandros Kosiaris: [C: 04-1] Replace $main_ipaddress by $::interface_primary (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/345569 (owner: 10Faidon Liambotis) [09:15:23] PROBLEM - puppet last run on mw1263 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:20:43] RECOVERY - Disk space on thumbor1002 is OK: DISK OK [09:21:34] !log delete stray nginx error log with debug logging on thumbor1002 [09:21:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:30] (03CR) 10Alexandros Kosiaris: "this is technically correct, but I would lie if I said I love it. As elukey points out we bind puppet to discovery which is not great." [puppet] - 10https://gerrit.wikimedia.org/r/345531 (owner: 10Giuseppe Lavagetto) [09:32:47] (03CR) 10Volans: "What elukey and akosiaris said. A more general solution IMHO would be to have the jobrunners be able to know if they should be active or p" [puppet] - 10https://gerrit.wikimedia.org/r/345531 (owner: 10Giuseppe Lavagetto) [09:33:14] (03CR) 10Jcrespo: [C: 032] Pool db1034 (with the right partit. & indexes) as the db-special-s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345808 (https://phabricator.wikimedia.org/T159319) (owner: 10Jcrespo) [09:34:38] (03Merged) 10jenkins-bot: Pool db1034 (with the right partit. & indexes) as the db-special-s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345808 (https://phabricator.wikimedia.org/T159319) (owner: 10Jcrespo) [09:34:51] (03CR) 10jenkins-bot: Pool db1034 (with the right partit. & indexes) as the db-special-s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345808 (https://phabricator.wikimedia.org/T159319) (owner: 10Jcrespo) [09:35:07] (03PS2) 10Volans: Swift: use discovery record for the imagescalers in codfw [puppet] - 10https://gerrit.wikimedia.org/r/345656 (https://phabricator.wikimedia.org/T160178) [09:35:21] (03PS3) 10Volans: Swift: use discovery record for the imagescalers in codfw [puppet] - 10https://gerrit.wikimedia.org/r/345656 (https://phabricator.wikimedia.org/T160178) [09:35:24] !log fix long-standing swift-account-server REPLICATE backtrace error on ms-be1022 - https://bugs.launchpad.net/swift/+bug/1424108 [09:35:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:58] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1062 (duration: 00m 45s) [09:38:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:08] (03PS1) 10Giuseppe Lavagetto: role::memcached: rationalize naming, split redis for sessions [puppet] - 10https://gerrit.wikimedia.org/r/345810 [09:40:33] PROBLEM - Apache HTTP on mw1197 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:40:43] PROBLEM - Nginx local proxy to apache on mw1197 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:41:03] PROBLEM - HHVM rendering on mw1197 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:42:24] elukey ^^^ same of mw1191? [09:43:04] (03CR) 10Giuseppe Lavagetto: "> this is technically correct, but I would lie if I said I love it." [puppet] - 10https://gerrit.wikimedia.org/r/345531 (owner: 10Giuseppe Lavagetto) [09:43:15] (03PS1) 10Jcrespo: Depool db1066 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345811 (https://phabricator.wikimedia.org/T159319) [09:43:40] (03CR) 10Marostegui: [C: 031] Depool db1066 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345811 (https://phabricator.wikimedia.org/T159319) (owner: 10Jcrespo) [09:44:23] RECOVERY - puppet last run on mw1263 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [09:44:29] volans: checking [09:46:11] (03CR) 10Jcrespo: [C: 032] Depool db1066 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345811 (https://phabricator.wikimedia.org/T159319) (owner: 10Jcrespo) [09:46:15] (03CR) 10Giuseppe Lavagetto: "Also, please note we have a couple of other things depending on mw_primary. If we really think a puppet commit is cleaner than relying on " [puppet] - 10https://gerrit.wikimedia.org/r/345531 (owner: 10Giuseppe Lavagetto) [09:46:37] <_joe_> akosiaris, volans, elukey please see my responses there ^^ [09:46:58] <_joe_> I honestly don't think that making puppet query etcd is worse than having a manual switch in hiera [09:47:13] !log restart hhvm on mw1197 - hhvm dump debug in /tmp/hhvm.14540.bt. - threads stuck in Treadmill::getAgeOldestRequest (HHVM 3.12) [09:47:17] <_joe_> I mean we are relying on that system for all of production atm. [09:47:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:24] (03Merged) 10jenkins-bot: Depool db1066 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345811 (https://phabricator.wikimedia.org/T159319) (owner: 10Jcrespo) [09:47:32] (03CR) 10jenkins-bot: Depool db1066 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345811 (https://phabricator.wikimedia.org/T159319) (owner: 10Jcrespo) [09:48:23] RECOVERY - Apache HTTP on mw1197 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.045 second response time [09:48:33] RECOVERY - Nginx local proxy to apache on mw1197 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.030 second response time [09:48:34] <_joe_> so I basically agree with what volans said (we should control the behaviour outside of puppet), but not with "a variable in hiera is better than quering the discovery system" [09:48:44] <_joe_> elukey: uhm it's happening a bit too often [09:48:53] RECOVERY - HHVM rendering on mw1197 is OK: HTTP OK: HTTP/1.1 200 OK - 79000 bytes in 0.544 second response time [09:48:57] <_joe_> elukey: are those still on 3.12 [09:48:58] <_joe_> ? [09:49:27] mw1197 yes [09:49:41] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1066 for maintenance (duration: 00m 44s) [09:49:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:48] I noticed that too that these alarms have increased [09:49:51] (03PS11) 10Gehel: maps - cleartables osm replication [puppet] - 10https://gerrit.wikimedia.org/r/341563 (https://phabricator.wikimedia.org/T157613) [09:49:52] yeah, only mw1261 is on 3.18 [09:50:13] PROBLEM - puppet last run on analytics1056 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:50:21] <_joe_> more traffic? [09:51:04] I'm currently rebuilding our HHVM extensions against the deb package provided by Facebook to collect additional data points [09:51:36] _joe_: I'm not sure it is worth, is basically moving the technical debt from a hiera value to a "rather obscure" matching of the PTR of the active discovery DC with the IP of the appservers.svc.$dc.wmnet names. What is the advantage? [09:51:52] (03PS12) 10Gehel: maps - cleartables osm replication [puppet] - 10https://gerrit.wikimedia.org/r/341563 (https://phabricator.wikimedia.org/T157613) [09:52:08] <_joe_> volans: that you have one single knob to configure the active mediawiki dc [09:52:08] !log Adding rev_timestamp index to revision page db1066 (s1) - T132416 [09:52:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:15] T132416: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416 [09:54:14] _joe_: agree in theory but for the real usage we're disabling the jobrunner ahead of time and re-enabling them later, so if the use case is only for this I'm not really convinced is a real advantage. The jobrunners would need a much bigger refactor to really switch at the turn of knob ;) [09:56:16] _joe_, elukey for the crashes yesterday there was a release IIRC, might be related? [09:56:55] !log set pooled=yes mw210[56789], mw2260 and mw2213 (and cleaned up old /srv/mediawiki dirs that were causing rsync spam in scap pull) [09:56:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:26] volans: I have been seeing them during the past 3/4 days IIRC [09:57:36] elukey: ok [09:57:37] thanks [09:59:26] <_joe_> elukey: uh? [09:59:30] <_joe_> elukey: what was that? [09:59:43] !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=mw2244.codfw.wmnet [09:59:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:53] <_joe_> also, thanks :) [10:00:27] _joe_ it is weird, scap pull says cannot delete non-empty directory: php-1.29.0-wmf.8 [10:01:08] <_joe_> elukey: that's true everywhere [10:01:21] <_joe_> elukey: you shouldn't do such things without asking releasers [10:01:35] <_joe_> nothing horrible, but notify them [10:01:48] <_joe_> (just noticed testing scap pull on mwdebug1001) [10:01:57] elukey: that is known https://phabricator.wikimedia.org/T157030 [10:02:19] the workaround is to: ( cd /srv/mediawiki-staging && scap clean 1.29.0-wmf.8 ) [10:02:38] that cleans up tin and the targets [10:03:04] PROBLEM - puppet last run on ms-be1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:03:23] 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files e.g. djvu and jpg - https://phabricator.wikimedia.org/T161836#3146209 (10Aklapper) More cases brought up: * https://commons.wikimedia.org/wiki/File:Fawiki500k_celebration_by_Behdad_Abedi_(180).jpg mentioned in https://co... [10:04:16] (03PS2) 10Giuseppe Lavagetto: role::memcached: rationalize naming, split redis for sessions [puppet] - 10https://gerrit.wikimedia.org/r/345810 [10:04:18] _joe_ true, I thought it was old stuff and just removed :/ [10:04:23] thanks hashar [10:04:25] :) [10:06:05] godog: when you have a minute I'd like you to have a look at https://gerrit.wikimedia.org/r/#/c/345656/ [10:06:09] (03CR) 10Giuseppe Lavagetto: [C: 032] role::memcached: rationalize naming, split redis for sessions [puppet] - 10https://gerrit.wikimedia.org/r/345810 (owner: 10Giuseppe Lavagetto) [10:06:12] (1 line ) [10:06:21] <_joe_> volans: I think he has his hands full atm [10:06:32] I know, I said when... not now ;) [10:06:47] <_joe_> volans: oh I re-did that patch this morning :P [10:07:03] <_joe_> but I made it global [10:07:16] <_joe_> we have 95% of prod relying on discovery hosts [10:07:28] I was following https://phabricator.wikimedia.org/T160178#3136906 ;) [10:07:32] yeah I'm looking at what's up with T161836 then I'll look at those patches too [10:07:33] T161836: 404 error while accessing some images files e.g. djvu and jpg - https://phabricator.wikimedia.org/T161836 [10:07:37] what are those discovery host? Is that gdnsd load balancing between datacenters? [10:07:49] (eg to replace eqiad/cofwd with 'discovery' ?) [10:07:56] <_joe_> hashar: doing geoip lookup [10:08:09] <_joe_> and we can flip on/off a dc via conftool [10:08:12] but with private ip and dc aware ? [10:08:20] <_joe_> yes, more or less [10:08:33] ahh and conftool feed gdnsd i guess? [10:09:12] hashar: conftool controls if a DC is up/down for gdnsd [10:09:19] for a given record [10:09:24] <_joe_> we still need to write the docs [10:09:32] <_joe_> but this is all very very fresh [10:10:11] _joe_ nice naming in https://gerrit.wikimedia.org/r/345810, finally we have something meaningful! [10:10:57] <_joe_> not really great, tbh, but you know, I'm not great at naming things [10:11:21] kudos on that system really [10:13:03] PROBLEM - puppet last run on mc1019 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): File[/etc/redis/replica/] [10:14:03] RECOVERY - puppet last run on mc1019 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [10:14:36] <_joe_> so this is now fixed :) [10:14:55] \o/ [10:15:10] and deployment-memc04 is also fixed [10:15:22] <_joe_> yeah expected [10:15:43] PROBLEM - HHVM jobrunner on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:15:55] <_joe_> oh, happening again, good [10:16:00] <_joe_> I can take a look [10:16:33] RECOVERY - HHVM jobrunner on mw1168 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [10:16:33] "good" [10:16:35] :D [10:16:36] lol [10:16:59] <_joe_> elukey: the alternative was us not being able to debug this [10:17:06] the jobrunners are scared by Giuseppe [10:17:14] I know I know was kidding :) [10:17:58] <_joe_> oblivian@mw1168:~$ hhvmadm check-health [10:17:58] <_joe_> { [10:17:58] <_joe_> "load":20 [10:18:13] RECOVERY - puppet last run on analytics1056 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [10:18:29] <_joe_> it's just a bit overwhelmed with requests [10:18:36] <_joe_> I think I know how to "fix" this [10:19:43] PROBLEM - HHVM jobrunner on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:20:01] <_joe_> hhvmadm check-health [10:20:01] <_joe_> { [10:20:01] <_joe_> "load":20 [10:20:01] <_joe_> , "queued":9 [10:20:16] <_joe_> so there is an unbalance between jobrunner threads and hhvm threads [10:20:23] <_joe_> I'll fix this. [10:20:23] PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:20:39] <_joe_> apparently we have a shitton of uploads currently [10:21:20] 06Operations, 07HHVM: HHVM 3.18 deadlocks after 4-6 hours (stuck in in HPHP::Treadmill::getAgeOldestRequest() ) - https://phabricator.wikimedia.org/T161684#3146223 (10MoritzMuehlenhoff) I'm building our internal HHVM extensions against the HHVM packages provided by Facebook, to gather an additional data point [10:21:33] RECOVERY - HHVM jobrunner on mw1168 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [10:22:23] (03PS1) 10Filippo Giunchedi: swift: increase max_connections for object server rsync [puppet] - 10https://gerrit.wikimedia.org/r/345816 (https://phabricator.wikimedia.org/T160640) [10:23:13] RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [10:23:15] so more hhvm threads ? [10:23:25] _joe_: yeah our users are creative. We have a bot importing the whole Panoramio to commons or roughly 3 millions pictures [10:23:50] <_joe_> elukey: not necessarily [10:24:42] okok [10:24:53] I was trying to understand the proposed fix :) [10:26:34] (03CR) 10Filippo Giunchedi: [C: 032] swift: increase max_connections for object server rsync [puppet] - 10https://gerrit.wikimedia.org/r/345816 (https://phabricator.wikimedia.org/T160640) (owner: 10Filippo Giunchedi) [10:31:31] 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files e.g. djvu and jpg - https://phabricator.wikimedia.org/T161836#3144779 (10Ymblanter) One more case here: https://commons.wikimedia.org/w/index.php?title=Commons:%D0%A4%D0%BE%D1%80%D1%83%D0%BC§ion=8#File:50_.D0.B4.D0.B... [10:32:03] RECOVERY - puppet last run on ms-be1031 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [10:32:36] (03PS1) 10Giuseppe Lavagetto: role::mediawiki::videoscaler: deduce parameters from number of cpus [puppet] - 10https://gerrit.wikimedia.org/r/345817 [10:32:44] <_joe_> elukey: ^^ [10:32:50] <_joe_> not sure it works [10:33:15] <_joe_> but it makes it so that total jobs will be equal to the number of hhvm threads running [10:36:01] interesting [10:36:28] PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:36:48] _joe_ so the idea, if I got it correctly, is to start with hhvm threads == 70% of the number of cpus [10:37:18] RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 5.213 second response time [10:37:28] then then number of runners and prioritized runners together is the same, but divided in two proportions [10:37:50] (is the same == is the same 70% number got for hhvm threads) [10:38:04] <_joe_> elukey: yeah alas it doesn't work [10:38:28] PROBLEM - HHVM jobrunner on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:38:56] ah yes https://puppet-compiler.wmflabs.org/5985/mw1168.eqiad.wmnet/ [10:38:58] <_joe_> I'll move that from hiera to the class declaration [10:39:28] PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:39:51] <_joe_> also, the patch was conceptually wrong [10:39:58] <_joe_> heh, way to go, joe :P [10:40:18] RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 3.982 second response time [10:40:19] (03PS1) 10Alexandros Kosiaris: Fix calls to strings.Replace in admission_controller [debs/kubernetes] - 10https://gerrit.wikimedia.org/r/345818 [10:41:42] (03CR) 10Alexandros Kosiaris: "Build completed successfully patch in https://gerrit.wikimedia.org/r/#/c/345818/" [debs/kubernetes] - 10https://gerrit.wikimedia.org/r/343797 (owner: 10Yuvipanda) [10:42:18] RECOVERY - HHVM jobrunner on mw1168 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.003 second response time [10:45:09] (03PS1) 10Volans: CLI: set dry-run mode when no commands are specified [software/cumin] - 10https://gerrit.wikimedia.org/r/345820 (https://phabricator.wikimedia.org/T161887) [10:47:25] !log uploaded jessie-wikimedia kubernetes_1.4.6-4 on apt.wikimedia.org/jessie-wikimedia [10:47:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:58] 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files e.g. djvu and jpg - https://phabricator.wikimedia.org/T161836#3144779 (10fgiunchedi) @Ladsgroup not related to thumbor no since thumbor isn't in production yet (though "thumbor in production" is a Q4 goal now) The timel... [10:50:52] (03CR) 10Giuseppe Lavagetto: [C: 031] "Wonderful, I was about to ask for this!" [software/cumin] - 10https://gerrit.wikimedia.org/r/345820 (https://phabricator.wikimedia.org/T161887) (owner: 10Volans) [10:51:43] (03Abandoned) 10Alexandros Kosiaris: profile::docker::builder: Conditionalize hiera lookups [puppet] - 10https://gerrit.wikimedia.org/r/320985 (owner: 10Alexandros Kosiaris) [10:52:29] (03Abandoned) 10Alexandros Kosiaris: profile::docker::builder: Make the hiera calls parameters [puppet] - 10https://gerrit.wikimedia.org/r/320794 (owner: 10Alexandros Kosiaris) [10:53:04] (03CR) 10Volans: [C: 032] CLI: set dry-run mode when no commands are specified [software/cumin] - 10https://gerrit.wikimedia.org/r/345820 (https://phabricator.wikimedia.org/T161887) (owner: 10Volans) [10:53:38] (03Merged) 10jenkins-bot: CLI: set dry-run mode when no commands are specified [software/cumin] - 10https://gerrit.wikimedia.org/r/345820 (https://phabricator.wikimedia.org/T161887) (owner: 10Volans) [10:58:40] 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files e.g. djvu and jpg - https://phabricator.wikimedia.org/T161836#3146297 (10zhuyifei1999) >>! In T161836#3146271, @fgiunchedi wrote: > At the time of T111838 a script was published in https://gerrit.wikimedia.org/r/#/c/2494... [11:00:48] PROBLEM - puppet last run on labsdb1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:01:44] (03CR) 10Alexandros Kosiaris: "Package uploaded to apt.wikimedia.org/jessie-wikimedia" [debs/kubernetes] - 10https://gerrit.wikimedia.org/r/343797 (owner: 10Yuvipanda) [11:01:50] (03CR) 10Alexandros Kosiaris: [C: 032] Fix calls to strings.Replace in admission_controller [debs/kubernetes] - 10https://gerrit.wikimedia.org/r/345818 (owner: 10Alexandros Kosiaris) [11:02:56] (03PS2) 10Giuseppe Lavagetto: role::mediawiki::videoscaler: deduce parameters from number of cpus [puppet] - 10https://gerrit.wikimedia.org/r/345817 [11:04:21] (03Abandoned) 10Alexandros Kosiaris: ldaplist: Allow searching for more than attribute [puppet] - 10https://gerrit.wikimedia.org/r/295475 (owner: 10Alexandros Kosiaris) [11:08:25] <_joe_> uhm I don't like this patch a bit [11:09:34] <_joe_> and I think I know why: it's introducing a difference between videoscalers and normal jobrunners at a logical level [11:09:56] <_joe_> which is ok, just it would need a more deep puppet restructuring to be "clean" [11:10:01] <_joe_> well, w/e [11:10:26] <_joe_> let's see if this works at all, then I can think of improvements [11:10:48] <_joe_> it's mostly a hiera limitation at this point [11:12:56] (03PS3) 10Giuseppe Lavagetto: role::mediawiki::videoscaler: deduce parameters from number of cpus [puppet] - 10https://gerrit.wikimedia.org/r/345817 [11:17:01] (03PS1) 10Jcrespo: Revert "Depool db1066 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345821 [11:18:52] (03CR) 10Marostegui: [C: 031] Revert "Depool db1066 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345821 (owner: 10Jcrespo) [11:19:24] (03CR) 10Jcrespo: [C: 032] Revert "Depool db1066 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345821 (owner: 10Jcrespo) [11:19:47] (03PS4) 10Giuseppe Lavagetto: role::mediawiki::videoscaler: deduce parameters from number of cpus [puppet] - 10https://gerrit.wikimedia.org/r/345817 [11:20:30] (03Merged) 10jenkins-bot: Revert "Depool db1066 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345821 (owner: 10Jcrespo) [11:20:39] (03CR) 10jenkins-bot: Revert "Depool db1066 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345821 (owner: 10Jcrespo) [11:22:13] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1066 after maintenance (duration: 00m 49s) [11:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:58] PROBLEM - puppet last run on db1045 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:28:20] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "For some reason the $hhvm_threads variable doesn't get interpolated in hiera, will look at this later." [puppet] - 10https://gerrit.wikimedia.org/r/345817 (owner: 10Giuseppe Lavagetto) [11:28:48] RECOVERY - puppet last run on labsdb1007 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [11:29:43] (03Abandoned) 10Giuseppe Lavagetto: realm: remove most references to mwprimary where dns discovery should be enough. [puppet] - 10https://gerrit.wikimedia.org/r/340999 (owner: 10Giuseppe Lavagetto) [11:30:43] (03Abandoned) 10Giuseppe Lavagetto: discovery: remove app_routes, switch mwprimary [puppet] - 10https://gerrit.wikimedia.org/r/341000 (owner: 10Giuseppe Lavagetto) [11:33:16] (03PS2) 10Giuseppe Lavagetto: profile::cumin::target: add parameters/tags to allow easy selection [puppet] - 10https://gerrit.wikimedia.org/r/343008 [11:37:28] PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:38:18] RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 6.564 second response time [11:46:40] 06Operations: Wikipedia MonoBook Bug - Alerts and Notices halfway down page - https://phabricator.wikimedia.org/T161892#3146391 (10Hammersfan) [11:50:12] (03PS1) 10Alexandros Kosiaris: changeprop: Add an ores_uris parameter [puppet] - 10https://gerrit.wikimedia.org/r/345826 (https://phabricator.wikimedia.org/T159615) [11:50:14] (03PS1) 10Alexandros Kosiaris: changeprop: Remove the ores_uri parameter [puppet] - 10https://gerrit.wikimedia.org/r/345827 (https://phabricator.wikimedia.org/T159615) [11:55:58] RECOVERY - puppet last run on db1045 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [12:01:15] (03PS3) 10Giuseppe Lavagetto: profile::cumin::target: add parameters/tags to allow easy selection [puppet] - 10https://gerrit.wikimedia.org/r/343008 [12:03:42] <_joe_> akosiaris: you can find the list of datacenters where ores is active from $lvs::configuration :P [12:04:52] hm, that would be useful indeed [12:05:09] !log rebooting pybal-test* to Linux 4.9 [12:05:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:13] <_joe_> akosiaris: use that as default and allow a hiera override if needed [12:07:23] hmm and I just found we never added the icinga lvs alert from ores in codfw [12:07:25] fixing [12:08:57] (03PS1) 10Alexandros Kosiaris: Add icinga LVS alerts for ORES in codfw [puppet] - 10https://gerrit.wikimedia.org/r/345828 [12:09:14] _joe_: hmm $lvs::configuration::lvs_services does not have the URLs in an easy format [12:09:43] <_joe_> akosiaris: alternative: ganglia_clusters has the datacenters defined for every service [12:10:15] <_joe_> but yeah you could extract just the hostname maybe? [12:10:16] <_joe_> sigh [12:10:33] !log rebooting mwdebug* to Linux 4.9 [12:10:34] that's my point.. the hostname is practically nowhere [12:10:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:28] <_joe_> akosiaris: uhm ok do what you can :P [12:12:40] (03CR) 10Alexandros Kosiaris: [C: 032] Add icinga LVS alerts for ORES in codfw [puppet] - 10https://gerrit.wikimedia.org/r/345828 (owner: 10Alexandros Kosiaris) [12:12:46] (03PS2) 10Alexandros Kosiaris: Add icinga LVS alerts for ORES in codfw [puppet] - 10https://gerrit.wikimedia.org/r/345828 [12:13:00] _joe_: hmm one thing that completely skipped my mind [12:13:20] how are we going to make misc varnish switch from eqiad to codfw ? [12:13:32] for ORES that is [12:13:41] looking at the config, ores.svc.eqiad.wmnet is hardcoded [12:13:57] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Add icinga LVS alerts for ORES in codfw [puppet] - 10https://gerrit.wikimedia.org/r/345828 (owner: 10Alexandros Kosiaris) [12:14:03] <_joe_> akosiaris: ask ema/bblack [12:14:06] <_joe_> it's their realm [12:14:15] ema: bblack ^ [12:14:26] <_joe_> I proposed to use discovery urls there but I think that doesn't really work for misc [12:14:45] <_joe_> I don't recall the details atm, sorry [12:15:26] (03PS4) 10Giuseppe Lavagetto: profile::cumin::target: add parameters/tags to allow easy selection [puppet] - 10https://gerrit.wikimedia.org/r/343008 [12:16:01] I was running a compiler on the previous one _joe_ :) [12:16:15] <_joe_> volans: that would fail everywhere [12:16:20] <_joe_> I hope not everywhere [12:16:24] actually is stuck :D [12:16:28] https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/5992/console [12:16:35] in the failure [12:16:58] <_joe_> volans: abort it and rebuild [12:17:29] ofc [12:17:48] PROBLEM - puppet last run on ms-be1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:18:15] <_joe_> volans: einsteinium should fail there [12:21:28] PROBLEM - HHVM jobrunner on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:23:18] RECOVERY - HHVM jobrunner on mw1168 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.003 second response time [12:25:58] _joe_ since it seems recurring only for mw116[89], what if we lower down transcode_prioritized and transcode ? [12:26:48] maybe the overall transcode count less than the hhvm thread count [12:37:11] or maybe it is a specific job that is heavier than the avg that causes this mess [12:40:54] <_joe_> elukey: yeah that's the solution on the short term ofc [12:41:00] :( [12:42:33] root@mw1168:/var/log/apache2# wc -l jobqueue-access.log [12:42:34] 4039 jobqueue-access.log [12:42:37] 06Operations, 07Puppet, 06Discovery, 06Maps, 03Interactive-Sprint: Puppet fails with "Could not find init script for 'postgresql@9.4-main'" on maps / labs server - https://phabricator.wikimedia.org/T161893#3146449 (10Gehel) [12:42:41] root@mw1168:/var/log/apache2# grep -i prioritized jobqueue-access.log -c [12:42:44] 3609 [12:43:42] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 3 others: Puppet changes required for elasticsearch 5.x upgrade - https://phabricator.wikimedia.org/T155578#3146455 (10Deskana) 05Open>03Resolved a:03Deskana [12:43:53] so maybe we could lower down transcode_prioritized: 12 to 8 and leave transcode to 8? Total of 16 (out of 20 hhvm threads [12:44:14] 06Operations, 06Discovery, 06Discovery-Search (Current work), 13Patch-For-Review: Add elasticsearch 5 .deb to reprepro experimental repository - https://phabricator.wikimedia.org/T159168#3146462 (10Deskana) 05Open>03Resolved a:03Deskana [12:44:44] PROBLEM - puppet last run on db1044 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:45:44] RECOVERY - puppet last run on ms-be1015 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [12:49:24] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: [epic] System level upgrade for cirrus / elasticsearch - https://phabricator.wikimedia.org/T151324#3146506 (10Deskana) [12:51:02] (03PS1) 10Elukey: Lower down transcode_prioritized for mw116[89] [puppet] - 10https://gerrit.wikimedia.org/r/345829 [12:52:49] tentatively submitted --^ [12:59:25] <_joe_> elukey: can you look at my change (that does basically the same thing) and try to understand why the hell it doesn't work for one specific hiera lookup [13:00:22] <_joe_> but yeah, we can do that for now [13:00:59] ahhh okok sorry I thought you wanted to do discard it [13:01:31] <_joe_> no I didn't abandon it [13:01:45] <_joe_> but yeah let's do this for now [13:02:21] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Let's get it to 10, not 8" [puppet] - 10https://gerrit.wikimedia.org/r/345829 (owner: 10Elukey) [13:02:43] (03CR) 10BBlack: "I never merged this because it seemed like we could do even better. I have yet to find a formula I'm completely happy with. Should defin" [puppet] - 10https://gerrit.wikimedia.org/r/324230 (owner: 10BBlack) [13:05:50] (03PS2) 10Giuseppe Lavagetto: authdns: add discovery record for restbase-async [puppet] - 10https://gerrit.wikimedia.org/r/345805 [13:05:57] _joe_ ok so merging with 10 and then work on your patch later on? [13:06:04] (03PS2) 10Elukey: Lower down transcode_prioritized for mw116[89] [puppet] - 10https://gerrit.wikimedia.org/r/345829 [13:06:06] <_joe_> elukey: on a proper patch, yes [13:07:15] <_joe_> bblack: ^^ I'm adding a separate discovery entry that for now will point to the same rb LVSs, for now, to be consumed by changeprop [13:07:37] <_joe_> in the future (I hope not-so-distant) it will be a separate LVS pool [13:08:42] <_joe_> this allows us to flip rb traffic over as we prefer [13:09:27] !log rebooting bromine to Linux 4.9 [13:09:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:05] (03Abandoned) 10Faidon Liambotis: memcached: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345555 (owner: 10Faidon Liambotis) [13:12:44] RECOVERY - puppet last run on db1044 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [13:13:09] (03CR) 10Elukey: [C: 032] Lower down transcode_prioritized for mw116[89] [puppet] - 10https://gerrit.wikimedia.org/r/345829 (owner: 10Elukey) [13:15:01] _joe_: ok [13:16:24] PROBLEM - HHVM jobrunner on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:17:43] mmm [13:18:24] PROBLEM - puppet last run on mw1261 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:18:53] <_joe_> lol [13:19:17] elukey@mw1168:~$ hhvmadm check-health [13:19:17] { "load":20 [13:19:17] , "queued":11 [13:19:23] !log rolling restart of maps-test cluster for kernel upgrade [13:19:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:24] PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:20:51] you guys are so kind [13:21:12] probably you didn't like the jobrunner restart right? [13:21:38] (03PS1) 10Giuseppe Lavagetto: Add entry for restbase-async [dns] - 10https://gerrit.wikimedia.org/r/345832 [13:21:42] <_joe_> elukey: yeah you need to also kill and restart hhvm [13:21:44] <_joe_> I guess [13:21:52] I hoped to avoid it.. sigh [13:21:57] <_joe_> why [13:22:08] <_joe_> when the jobrunner is stopped, you can safely do it [13:22:14] <_joe_> without interruption of service [13:22:44] PROBLEM - tileratorui on maps-test2002 is CRITICAL: connect to address 10.192.0.129 and port 6535: Connection refused [13:23:04] PROBLEM - Postgres Replication Lag on maps-test2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:23:14] !log restart hhvm on mw116[89] after https://gerrit.wikimedia.org/r/345829 [13:23:14] RECOVERY - HHVM jobrunner on mw1168 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.027 second response time [13:23:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:44] maps is me, too fast for icinga, sorry [13:25:14] RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.010 second response time [13:25:15] akosiaris: https://wikitech.wikimedia.org/wiki/Global_traffic_routing#Cache-to-application_routing [13:26:27] <_joe_> I'd say we should make ores active-active [13:27:01] ema: nice doc! [13:27:24] PROBLEM - HHVM jobrunner on mw1259 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:27:35] 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files e.g. djvu and jpg - https://phabricator.wikimedia.org/T161836#3146547 (10fgiunchedi) >>! In T161836#3146297, @zhuyifei1999 wrote: >>>! In T161836#3146271, @fgiunchedi wrote: >> At the time of T111838 a script was publish... [13:27:37] elukey: credit to bblack! :) [13:27:46] (03CR) 10Giuseppe Lavagetto: [C: 032] authdns: add discovery record for restbase-async [puppet] - 10https://gerrit.wikimedia.org/r/345805 (owner: 10Giuseppe Lavagetto) [13:27:52] (03PS3) 10Giuseppe Lavagetto: authdns: add discovery record for restbase-async [puppet] - 10https://gerrit.wikimedia.org/r/345805 [13:27:59] _joe_: When is ETA for active/active dc (for everything)? [13:28:14] RECOVERY - HHVM jobrunner on mw1259 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [13:28:15] <_joe_> for everything ? [13:28:18] <_joe_> what do you mean? [13:28:27] <_joe_> like mediawiki? [13:28:31] yeah [13:28:47] <_joe_> everything else is already active/active capable or actively active/active [13:28:55] <_joe_> mediawiki, I have no idea [13:29:16] <_joe_> I think it's part of next year's plan, and I'm not really the person to give you an ETA on that [13:29:32] _joe_: manyana [13:29:34] okay, if we are going to do it soon, maybe ores should get a head start [13:29:52] Thanks [13:30:29] Amir1: FWIW ores is practically ready to be active/active [13:30:34] PROBLEM - Check systemd state on maps-test2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:30:44] RECOVERY - tileratorui on maps-test2002 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.090 second response time [13:30:45] the only issue is cache warmup [13:30:55] PROBLEM - puppet last run on maps-test2001 is CRITICAL: CRITICAL: Puppet has 24 failures. Last run 5 minutes ago with 24 failures. Failed resources (up to 3 shown): Service[postgresql@9.4-main],Exec[create_user-kartotherian],Exec[create_user-monitoring@maps2004],Exec[create_user-replication@maps2004] [13:31:12] Amir1: and we may very well try it active/active during the 2 weeks of the switchover [13:31:18] I saw your patches in change prop [13:31:51] it would be great to have a doubled up capacity (even more I guess, I saw scb2005) [13:32:37] (03CR) 10Giuseppe Lavagetto: [C: 032] Add entry for restbase-async [dns] - 10https://gerrit.wikimedia.org/r/345832 (owner: 10Giuseppe Lavagetto) [13:32:37] Amir1: I did not mean it that way. I meant mirrored [13:32:56] well, in a way it would be doubled now that I think about it [13:33:03] as in fewer req/s for each DC [13:33:19] yeah [13:33:23] <_joe_> don't think in those terms [13:33:32] but req/s are unimportant for ORES tbh [13:33:33] <_joe_> one dc should be able to withstand all of our traffic [13:33:34] PROBLEM - Check systemd state on maps-test2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:33:34] PROBLEM - Check systemd state on maps-test2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:33:34] PROBLEM - Check systemd state on maps-test2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:33:37] it's the workers that are important [13:33:48] yes that's true [13:33:54] PROBLEM - puppet last run on maps-test2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[postgresql@9.4-main] [13:33:56] PROBLEM - puppet last run on maps-test2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[postgresql@9.4-main] [13:33:56] PROBLEM - puppet last run on maps-test2004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Service[postgresql@9.4-main] [13:34:32] <_joe_> we work so that if we lose one dc, we should allow all the traffic to the rest [13:34:45] Amir1: FWIW I know next Q there is going to be procurement for ORES which is projected to meet the needs for up to 2018 at the least [13:34:47] <_joe_> so if we had 3 dcs, we should be able to withstand the traffic with 2 of them [13:34:49] 06/2018 that is [13:35:04] PROBLEM - Check systemd state on pybal-test2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:35:19] ema I am having a hard time figuring something out in https://wikitech.wikimedia.org/wiki/Global_traffic_routing#Cache-to-application_routing [13:35:22] <_joe_> akosiaris: can you check ^^ [13:35:38] <_joe_> what is failing in pybal-test2003 I mean [13:35:45] I was about to ask [13:35:51] I think moritzm was on it [13:36:08] Thanks for the explanation, noted [13:36:15] redis-instance-tcp_6371.service [13:36:16] lol [13:36:18] _joe_: ^ [13:36:27] I have no idea why they have a redis installed [13:36:37] 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files e.g. djvu and jpg - https://phabricator.wikimedia.org/T161836#3146555 (10fgiunchedi) I can successfully see https://commons.wikimedia.org/wiki/File:Fawiki500k_celebration_by_Behdad_Abedi_(180).jpg now, ditto for https://... [13:36:40] maybe I should just purge them [13:36:50] ema: I am having a hard time understanding something [13:37:00] ema: in https://wikitech.wikimedia.org/wiki/Global_traffic_routing#Cache-to-application_routing [13:37:09] Within each backends stanza, the primary site listed on the left names the site where the traffic would exit the cache layer, and the hostname on the right is the applayer hostname it will contact to do so. The code which operates on this data doesn't actually care whether the hostname on the right is actually within the site named on the left. This allows for interesting operational possibilities such as: [13:37:11] ok [13:37:19] and then [13:37:21] This would cause inter-cache routing to behave like an active/active service (dropping from the cache to the applayer directly at both primary sites), but both site's caches will contact only the eqiad applayer service [13:37:42] it's the latter I don't get [13:37:53] (03PS2) 10Faidon Liambotis: mediawiki: kill more precise references [puppet] - 10https://gerrit.wikimedia.org/r/345546 [13:37:55] (03PS2) 10Faidon Liambotis: hhvm: kill a precise reference [puppet] - 10https://gerrit.wikimedia.org/r/345547 [13:37:57] (03PS2) 10Faidon Liambotis: apache: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345548 [13:37:59] (03PS2) 10Faidon Liambotis: installserver: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345549 [13:38:01] (03PS2) 10Faidon Liambotis: aptrepo: remove precise-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/345550 [13:38:02] aaah I think I just did [13:38:03] (03PS2) 10Faidon Liambotis: elasticsearch: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345552 [13:38:05] (03PS2) 10Faidon Liambotis: puppetmaster: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345554 [13:38:07] (03PS2) 10Faidon Liambotis: openstack/nova: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345556 [13:38:09] (03PS1) 10Faidon Liambotis: ssh: update comments to remove precise mentions [puppet] - 10https://gerrit.wikimedia.org/r/345834 [13:38:11] (03PS1) 10Faidon Liambotis: puppet: remove fail() guard for precise [puppet] - 10https://gerrit.wikimedia.org/r/345835 [13:38:19] (03PS2) 10Faidon Liambotis: ci: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345557 [13:38:21] (03PS2) 10Faidon Liambotis: interface: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345564 [13:38:22] yeah I 'll rewrite it then a bit to make it less confusing [13:38:23] (03PS2) 10Faidon Liambotis: aptly: remove special case to remove multiarch support on precise [puppet] - 10https://gerrit.wikimedia.org/r/345371 (https://phabricator.wikimedia.org/T111760) (owner: 10Dzahn) [13:38:25] (03PS2) 10Faidon Liambotis: lxc: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345558 [13:38:27] (03PS2) 10Faidon Liambotis: toollabs: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345559 [13:38:29] (03PS2) 10Faidon Liambotis: vagrant: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345563 [13:38:31] (03PS1) 10Faidon Liambotis: package_builder: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345836 [13:38:33] (03PS1) 10Faidon Liambotis: postgresql: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345837 [13:38:35] (03PS1) 10Faidon Liambotis: releases: remove the precise suite [puppet] - 10https://gerrit.wikimedia.org/r/345838 [13:38:37] (03PS1) 10Faidon Liambotis: zookeeper: remove precise's package version [puppet] - 10https://gerrit.wikimedia.org/r/345839 [13:38:39] (03PS1) 10Faidon Liambotis: labs_vagrant: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345840 [13:39:33] akosiaris: with that example, both eqiad and codfw entries point to appservers.svc.eqiad.wmnet [13:40:04] yes but I failed to parse in the beginning the fact that both eqiad/codfw had an eqiad hostname and after that nothing made sense [13:40:04] RECOVERY - puppet last run on labsdb1006 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [13:40:05] (03PS1) 10Jcrespo: mariadb: Deprecate precise repository usage [puppet] - 10https://gerrit.wikimedia.org/r/345841 [13:40:19] I am adding a note about that [13:40:24] PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:40:26] akosiaris: thanks! [13:40:50] thanks mw1169 [13:40:57] always a pleasure [13:41:00] ema: btw, this construct is supported for misc as well right ? [13:41:19] cause then it would make it very easy for ORES to effectively become active/active at all layers [13:41:49] mw1169: hhvm load 20 queued 2 [13:42:14] akosiaris: yes, supported on all clusters [13:42:33] Amir1: btw, we should revisit https://phabricator.wikimedia.org/T148999, I 'd like to understand better why ORES isn't cacheable/cached currently [13:42:37] ema: great, thanks! [13:42:53] akosiaris, I got postgress replication kickstarted [13:43:16] jynus: \o/ [13:43:22] but I think there changes needed on the puppetization [13:43:24] (03PS5) 10Giuseppe Lavagetto: profile::cumin::target: add parameters/tags to allow easy selection [puppet] - 10https://gerrit.wikimedia.org/r/343008 [13:43:36] s/all/replication/ [13:43:38] <_joe_> volans: merging ^^ [13:43:43] I will see if I can see where to send a patch [13:43:46] let me see the last diff [13:43:54] <_joe_> yeah please do [13:44:03] so removed the undef stuff? [13:44:08] <_joe_> yeah [13:44:13] jynus: please do [13:44:17] <_joe_> I realized that happens for DBs too [13:44:18] I guess we have cases where is not true :D [13:44:20] <_joe_> so it's useless [13:44:39] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/343008 (owner: 10Giuseppe Lavagetto) [13:44:53] <_joe_> the only use would be for me to have an easy way to determine how many systems have no role() in site.pp [13:44:54] what happens to DBs? [13:45:08] <_joe_> jynus: nothing happens TO dbs [13:45:18] <_joe_> something happens in puppet FOR DBs [13:45:24] PROBLEM - HHVM jobrunner on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:45:40] <_joe_> and nothing worrying; just that since you don't use role() in site.pp [13:45:46] 06Operations, 07Documentation: Improve SSH access information in onboarding documentation - https://phabricator.wikimedia.org/T160941#3146572 (10Milimetric) [13:45:54] yeah I though the same that would have been an easy way to check it but we should be able to do it just negating the ~ "role::.*" [13:46:00] (03CR) 10jerkins-bot: [V: 04-1] releases: remove the precise suite [puppet] - 10https://gerrit.wikimedia.org/r/345838 (owner: 10Faidon Liambotis) [13:46:25] 06Operations, 07Documentation: Improve SSH access information in onboarding documentation - https://phabricator.wikimedia.org/T160941#3115695 (10Milimetric) @Tbayer, I copied Zareen's notes in the description, please add anything else that you think is painful about SSH access. [13:46:44] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::cumin::target: add parameters/tags to allow easy selection [puppet] - 10https://gerrit.wikimedia.org/r/343008 (owner: 10Giuseppe Lavagetto) [13:46:54] <_joe_> come on jenkins [13:46:58] * _joe_ blames paladox [13:47:00] ok, for? [13:47:01] I want to help if I can with orchestration [13:47:03] <_joe_> uh sorry paladox [13:47:09] * _joe_ blames paravoid [13:47:17] lol [13:47:20] tab completion gone bad [13:47:21] <_joe_> jynus: targeting in cumin [13:47:27] <_joe_> akosiaris: yeah twice in 20 minutes [13:47:37] (03PS1) 10Alexandros Kosiaris: Remove redis from pybal-test2003 [puppet] - 10https://gerrit.wikimedia.org/r/345842 [13:47:42] uh? [13:47:46] I will be using role soon [13:47:54] <_joe_> paravoid: you submitted 20 patches or so [13:48:01] <_joe_> so the queue in CI is enormous [13:48:01] I have some blockers on refactoring [13:48:35] I need some hiera entries, isolating some classes, a proper common class [13:48:40] it takes some time [13:48:47] in fact there is this weird condition I have seen like once or twice in CI [13:48:55] where if you submit a long series of patches [13:49:04] zuul or something decides to commit harakiri [13:49:13] jynus: no worries, this is unrelated [13:49:24] RECOVERY - HHVM jobrunner on mw1168 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 7.775 second response time [13:49:30] <_joe_> akosiaris: ahah [13:49:35] at least that was my impression that one time I noticed it [13:49:38] <_joe_> elukey: are you looking at mw1168? [13:49:53] <_joe_> seems something didn't go right with the restart after the change [13:49:55] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: 57.14% of data above the critical threshold [140.0] [13:49:57] oh [13:50:10] ah, there you go :P [13:50:12] _joe_: it recovered right after my restart but then failed again [13:50:15] yeah well, that's CI's problem [13:50:22] yup [13:50:35] it's not like I queued gcc builds [13:51:24] PROBLEM - HHVM jobrunner on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:51:58] (03PS2) 10Alexandros Kosiaris: Remove redis from pybal-test2003 [puppet] - 10https://gerrit.wikimedia.org/r/345842 [13:52:05] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Remove redis from pybal-test2003 [puppet] - 10https://gerrit.wikimedia.org/r/345842 (owner: 10Alexandros Kosiaris) [13:52:40] (03PS2) 10Jcrespo: mariadb: Deprecate precise repository usage [puppet] - 10https://gerrit.wikimedia.org/r/345841 [13:53:02] (03CR) 10Muehlenhoff: [C: 031] package_builder: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345836 (owner: 10Faidon Liambotis) [13:54:14] RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [13:54:14] RECOVERY - HHVM jobrunner on mw1168 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [13:54:21] the contint1001 alarm should self resolve [13:54:24] it is a one time spike [13:54:51] (03CR) 10Alexandros Kosiaris: [C: 032] "Tests in PCC seemed fine and T143349 is resolved now. Merging" [puppet] - 10https://gerrit.wikimedia.org/r/334155 (owner: 10Alexandros Kosiaris) [13:55:01] (03PS4) 10Alexandros Kosiaris: realm.pp: Remove the pre 3.5 puppet handling code [puppet] - 10https://gerrit.wikimedia.org/r/334155 [13:55:25] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] realm.pp: Remove the pre 3.5 puppet handling code [puppet] - 10https://gerrit.wikimedia.org/r/334155 (owner: 10Alexandros Kosiaris) [13:56:02] (03CR) 10Ottomata: "But there's https://gerrit.wikimedia.org/r/#/c/320197 too, right/ Do we need both of these?" [puppet] - 10https://gerrit.wikimedia.org/r/319071 (owner: 10Muehlenhoff) [13:56:48] (03PS3) 10Elukey: Prepare analytics1027 for decommission [puppet] - 10https://gerrit.wikimedia.org/r/345117 (https://phabricator.wikimedia.org/T161597) [13:57:00] (03PS1) 10Gehel: maps - tuning options not available before postgresql 9.5 [puppet] - 10https://gerrit.wikimedia.org/r/345843 [13:57:11] (03PS1) 10Giuseppe Lavagetto: conftool-data: add entries for restbase-async [puppet] - 10https://gerrit.wikimedia.org/r/345844 [13:57:19] (03CR) 10Ottomata: [C: 031] role::analytics_cluster::hadoop::standby: Enable base::firewall in the role [puppet] - 10https://gerrit.wikimedia.org/r/341292 (owner: 10Muehlenhoff) [13:57:24] PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:57:29] (03CR) 10Marostegui: [C: 031] mariadb: Deprecate precise repository usage [puppet] - 10https://gerrit.wikimedia.org/r/345841 (owner: 10Jcrespo) [13:57:32] (03PS2) 10Giuseppe Lavagetto: conftool-data: add entries for restbase-async [puppet] - 10https://gerrit.wikimedia.org/r/345844 [13:57:34] (03PS2) 10Gehel: maps - tuning options not available before postgresql 9.5 [puppet] - 10https://gerrit.wikimedia.org/r/345843 [13:57:43] (03PS3) 10Marostegui: site.pp,linux-host-entries.ttyS1: Remove db1057 [puppet] - 10https://gerrit.wikimedia.org/r/345545 (https://phabricator.wikimedia.org/T160435) [13:58:12] RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [13:58:15] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] conftool-data: add entries for restbase-async [puppet] - 10https://gerrit.wikimedia.org/r/345844 (owner: 10Giuseppe Lavagetto) [13:58:35] (03PS3) 10Andrew Bogott: aptly: remove special case to remove multiarch support on precise [puppet] - 10https://gerrit.wikimedia.org/r/345371 (https://phabricator.wikimedia.org/T111760) (owner: 10Dzahn) [13:58:54] (03CR) 10Ottomata: [C: 031] Check the Request Authorization Header for '%u' [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/317825 (owner: 10R4q3NWnUx2CEhVyr) [13:59:25] (03PS4) 10Andrew Bogott: aptly: remove special case to remove multiarch support on precise [puppet] - 10https://gerrit.wikimedia.org/r/345371 (https://phabricator.wikimedia.org/T111760) (owner: 10Dzahn) [13:59:32] (03CR) 10Andrew Bogott: [C: 032] aptly: remove special case to remove multiarch support on precise [puppet] - 10https://gerrit.wikimedia.org/r/345371 (https://phabricator.wikimedia.org/T111760) (owner: 10Dzahn) [13:59:41] (03PS3) 10Gehel: maps - tuning options not available before postgresql 9.5 [puppet] - 10https://gerrit.wikimedia.org/r/345843 [14:00:04] !log oblivian@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=restbase-async,name=codfw [14:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:02] (03CR) 10Gehel: [C: 032] maps - tuning options not available before postgresql 9.5 [puppet] - 10https://gerrit.wikimedia.org/r/345843 (owner: 10Gehel) [14:01:27] (03PS5) 10Andrew Bogott: aptly: remove special case to remove multiarch support on precise [puppet] - 10https://gerrit.wikimedia.org/r/345371 (https://phabricator.wikimedia.org/T111760) (owner: 10Dzahn) [14:01:56] (03PS2) 10Giuseppe Lavagetto: changeprop: switch to discovery url [puppet] - 10https://gerrit.wikimedia.org/r/345806 [14:02:16] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3146611 (10jcrespo) a:03jcrespo I am kickstarting the replication right now- this required different pupetization of the repliation "grants". I wi... [14:02:32] RECOVERY - Check systemd state on maps-test2001 is OK: OK - running: The system is fully operational [14:03:02] RECOVERY - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] [14:03:02] RECOVERY - puppet last run on maps-test2001 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [14:03:27] (03CR) 10Giuseppe Lavagetto: [C: 032] changeprop: switch to discovery url [puppet] - 10https://gerrit.wikimedia.org/r/345806 (owner: 10Giuseppe Lavagetto) [14:03:53] (03PS3) 10Giuseppe Lavagetto: changeprop: switch to discovery url [puppet] - 10https://gerrit.wikimedia.org/r/345806 [14:03:54] <_joe_> wat [14:04:02] RECOVERY - puppet last run on maps-test2002 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [14:04:02] RECOVERY - Check systemd state on maps-test2003 is OK: OK - running: The system is fully operational [14:04:10] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] changeprop: switch to discovery url [puppet] - 10https://gerrit.wikimedia.org/r/345806 (owner: 10Giuseppe Lavagetto) [14:04:32] RECOVERY - Check systemd state on maps-test2002 is OK: OK - running: The system is fully operational [14:04:32] RECOVERY - Check systemd state on maps-test2004 is OK: OK - running: The system is fully operational [14:05:02] RECOVERY - Postgres Replication Lag on maps-test2004 is OK: OK - Rep Delay is: 48.80375 Seconds [14:05:03] RECOVERY - puppet last run on maps-test2003 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [14:05:03] RECOVERY - puppet last run on maps-test2004 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [14:05:16] (03PS3) 10Andrew Bogott: openstack/nova: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345556 (owner: 10Faidon Liambotis) [14:05:32] (03PS1) 10Alexandros Kosiaris: elasticsearch: Fix ERB instance variable notation [puppet] - 10https://gerrit.wikimedia.org/r/345845 [14:07:05] (03CR) 10Andrew Bogott: [C: 032] openstack/nova: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345556 (owner: 10Faidon Liambotis) [14:08:02] RECOVERY - Check systemd state on pybal-test2003 is OK: OK - running: The system is fully operational [14:08:09] 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files e.g. djvu and jpg - https://phabricator.wikimedia.org/T161836#3146624 (10fgiunchedi) p:05Unbreak!>03High Since some files linked here seem to 200 now (instead of 404) I'm lowering to "high", I'll keep looking at what... [14:08:57] (03CR) 10Muehlenhoff: "Yeah, both are required. 319071 needs to be merged before 320197 can happen." [puppet] - 10https://gerrit.wikimedia.org/r/319071 (owner: 10Muehlenhoff) [14:09:40] <_joe_> !log performing a rolling restart of changeprop after puppet runs on scb* [14:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:58] (03CR) 10Ema: [C: 031] "LGTM and to pcc https://puppet-compiler.wmflabs.org/5996/" [puppet] - 10https://gerrit.wikimedia.org/r/345117 (https://phabricator.wikimedia.org/T161597) (owner: 10Elukey) [14:11:02] (03PS4) 10Elukey: Prepare analytics1027 for decommission [puppet] - 10https://gerrit.wikimedia.org/r/345117 (https://phabricator.wikimedia.org/T161597) [14:11:38] (03PS3) 10Andrew Bogott: toollabs: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345559 (owner: 10Faidon Liambotis) [14:11:40] (03CR) 10Hashar: [C: 031] toollabs: drop precise-related monitoring check [puppet] - 10https://gerrit.wikimedia.org/r/337207 (https://phabricator.wikimedia.org/T143349) (owner: 10Dzahn) [14:12:12] (03CR) 10Hashar: [C: 031] hhvm: kill a precise reference [puppet] - 10https://gerrit.wikimedia.org/r/345547 (owner: 10Faidon Liambotis) [14:13:02] PROBLEM - Check systemd state on pybal-test2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:14:04] (03Abandoned) 10Hashar: package_builder: remove Precise images [puppet] - 10https://gerrit.wikimedia.org/r/345801 (owner: 10Hashar) [14:14:49] (03CR) 10Hashar: "On copper.eqiad.wmnet, one will have to do some manual clean up and remove:" [puppet] - 10https://gerrit.wikimedia.org/r/345836 (owner: 10Faidon Liambotis) [14:14:59] (03CR) 10Andrew Bogott: [C: 032] toollabs: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345559 (owner: 10Faidon Liambotis) [14:15:07] (03CR) 10Hashar: [C: 031] package_builder: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345836 (owner: 10Faidon Liambotis) [14:15:47] (03CR) 10Elukey: [C: 032] Prepare analytics1027 for decommission [puppet] - 10https://gerrit.wikimedia.org/r/345117 (https://phabricator.wikimedia.org/T161597) (owner: 10Elukey) [14:15:53] (03PS5) 10Elukey: Prepare analytics1027 for decommission [puppet] - 10https://gerrit.wikimedia.org/r/345117 (https://phabricator.wikimedia.org/T161597) [14:16:01] (03CR) 10Elukey: [V: 032 C: 032] Prepare analytics1027 for decommission [puppet] - 10https://gerrit.wikimedia.org/r/345117 (https://phabricator.wikimedia.org/T161597) (owner: 10Elukey) [14:17:22] PROBLEM - puppet last run on labcontrol1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:17:29] (03PS1) 10Jcrespo: osm-labs: Fix replication credentials [puppet] - 10https://gerrit.wikimedia.org/r/345847 (https://phabricator.wikimedia.org/T157359) [14:18:43] (03PS2) 10Jcrespo: osm-labs: Fix replication credentials [puppet] - 10https://gerrit.wikimedia.org/r/345847 (https://phabricator.wikimedia.org/T157359) [14:18:49] (03PS3) 10Andrew Bogott: toollabs: drop precise-related monitoring check [puppet] - 10https://gerrit.wikimedia.org/r/337207 (https://phabricator.wikimedia.org/T143349) (owner: 10Dzahn) [14:19:08] (03CR) 10Jcrespo: "Let's wait for the script to finish to see if this worked." [puppet] - 10https://gerrit.wikimedia.org/r/345847 (https://phabricator.wikimedia.org/T157359) (owner: 10Jcrespo) [14:19:27] (03PS1) 10Hashar: tests: no more ignore postgresql spec [puppet] - 10https://gerrit.wikimedia.org/r/345849 [14:20:22] PROBLEM - Hue Server on analytics1027 is CRITICAL: PROCS CRITICAL: 0 processes with command name python2.7, args /usr/lib/hue/build/env/bin/hue [14:21:10] fixing icinga --^ [14:21:23] RECOVERY - puppet last run on labcontrol1001 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [14:21:27] (03Abandoned) 10Andrew Bogott: toollabs: drop precise-related monitoring check [puppet] - 10https://gerrit.wikimedia.org/r/337207 (https://phabricator.wikimedia.org/T143349) (owner: 10Dzahn) [14:25:02] RECOVERY - Check systemd state on pybal-test2003 is OK: OK - running: The system is fully operational [14:26:51] 06Operations, 06Analytics-Kanban, 15User-Elukey: Reimage all the Hadoop worker nodes to Debian Jessie - https://phabricator.wikimedia.org/T160333#3146647 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by otto on neodymium.eqiad.wmnet for hosts: ``` ['analytics1047.eqiad.wmnet', 'analytics1048.eq... [14:30:21] 06Operations, 10Analytics, 10DBA: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3146654 (10Marostegui) Hi! How's the process to decommission db1047 going? [14:31:19] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM https://puppet-compiler.wmflabs.org/5997/ms-fe2005.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/345656 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [14:32:18] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 06DC-Ops, 13Patch-For-Review: Decom/Reclaim analytics1027 - https://phabricator.wikimedia.org/T161597#3146682 (10elukey) [14:33:01] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 06DC-Ops, 13Patch-For-Review: Decom/Reclaim analytics1027 - https://phabricator.wikimedia.org/T161597#3136552 (10elukey) a:05elukey>03Cmjohnson [14:33:55] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 06DC-Ops, 13Patch-For-Review: Decom/Reclaim analytics1027 - https://phabricator.wikimedia.org/T161597#3136552 (10elukey) Chris the host is atm set as role::spare, but whenever you are ready I can cleanup the rest of the puppet entries (DHCP, etc..). Let me k... [14:34:19] _joe_ Oh, sorry just got back. Was out all day. [14:34:37] <_joe_> paladox: I am sorry, I pinged you in error :P [14:34:52] Yep, just read :) [14:34:58] <_joe_> eheh :) [14:39:54] 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files e.g. djvu and jpg - https://phabricator.wikimedia.org/T161836#3146717 (10Wieralee) The book https://commons.wikimedia.org/wiki/File:PL_J%C3%B3zef_Ignacy_Kraszewski-Poezye_tom_2.djvu is transcribed at Polish Wikisource: n... [14:49:52] (03PS4) 10Volans: Swift: use discovery record for the imagescalers in codfw [puppet] - 10https://gerrit.wikimedia.org/r/345656 (https://phabricator.wikimedia.org/T160178) [14:52:04] 06Operations, 06Analytics-Kanban, 15User-Elukey: Reimage all the Hadoop worker nodes to Debian Jessie - https://phabricator.wikimedia.org/T160333#3146723 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['analytics1047.eqiad.wmnet', 'analytics1048.eqiad.wmnet'] ``` and were **ALL** successful. [14:54:27] (03CR) 10Volans: [C: 032] Swift: use discovery record for the imagescalers in codfw [puppet] - 10https://gerrit.wikimedia.org/r/345656 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [14:55:08] !log deploying the use of discovery URL to swift-proxy hosts in codfw T160178#3136906 [14:55:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:17] T160178: MediaWiki Datacenter Switchover automation - https://phabricator.wikimedia.org/T160178 [14:55:25] <_joe_> !log reducing ttl on the restbase-async discovery record, then flipping eqiad to active [14:55:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:26] !log oblivian@puppetmaster1001 conftool action : set/ttl=10; selector: dnsdisc=restbase-async [14:56:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:31] !log oblivian@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=restbase-async,name=eqiad [14:58:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:00] !log oblivian@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=restbase-async,name=eqiad [15:01:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:13] !log oblivian@puppetmaster1001 conftool action : set/ttl=300; selector: dnsdisc=restbase-async [15:01:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:39] (03PS3) 10Jcrespo: mariadb: Deprecate precise repository usage [puppet] - 10https://gerrit.wikimedia.org/r/345841 [15:01:44] (03CR) 10Gehel: [C: 031] "LGTM - if we have tests, we should run them!" [puppet] - 10https://gerrit.wikimedia.org/r/345849 (owner: 10Hashar) [15:02:43] !log volans@puppetmaster1001 conftool action : set/pooled=no; selector: dc=codfw,name=ms-fe2005.codfw.wmnet [15:02:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:56] (03CR) 10Gehel: [C: 031] "LGTM - this template will be removed once upgrade of logstash is completed, but it does no hurt to fix template anyway..." [puppet] - 10https://gerrit.wikimedia.org/r/345845 (owner: 10Alexandros Kosiaris) [15:05:22] (03Abandoned) 10Muehlenhoff: Remove a number of obsolete conditionals from mediawiki classes [puppet] - 10https://gerrit.wikimedia.org/r/345502 (owner: 10Muehlenhoff) [15:11:20] (03CR) 10Jcrespo: [C: 032] mariadb: Deprecate precise repository usage [puppet] - 10https://gerrit.wikimedia.org/r/345841 (owner: 10Jcrespo) [15:14:26] (03PS1) 10Marostegui: db-codfw,db-eqiad.php: Remove db1057 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345856 (https://phabricator.wikimedia.org/T160435) [15:14:43] (03CR) 10Marostegui: [C: 04-1] "Not to be deployed until Monday" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345856 (https://phabricator.wikimedia.org/T160435) (owner: 10Marostegui) [15:16:15] (03CR) 10Marostegui: [C: 04-1] "Not to be deployed until Monday" [puppet] - 10https://gerrit.wikimedia.org/r/345545 (https://phabricator.wikimedia.org/T160435) (owner: 10Marostegui) [15:17:08] !log volans@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=imagescaler-rw,name=eqiad [15:17:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:03] 06Operations, 06Labs: Investigate ceasing new Trusty instance creation in Labs - https://phabricator.wikimedia.org/T161899#3146739 (10chasemp) [15:21:43] 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files (e.g. djvu, jpg, png) on Commons and other sites - https://phabricator.wikimedia.org/T161836#3146752 (10Aklapper) [15:22:01] !log volans@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=imagescaler-rw,name=eqiad [15:22:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:44] 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files (e.g. djvu, jpg, png) on Commons and other sites - https://phabricator.wikimedia.org/T161836#3144779 (10Superzerocool) Hi, I'll add another one: https://commons.wikimedia.org/wiki/File:MyanmarChin.png https://upload.wik... [15:26:46] 06Operations, 05Goal: reduce amount of remaining Ubuntu 12.04 (precise) systems in production - https://phabricator.wikimedia.org/T123525#3146763 (10jcrespo) [15:26:50] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Migrate labsdb1005/1006/1007 to jessie - https://phabricator.wikimedia.org/T123731#3146758 (10jcrespo) 05Open>03Resolved a:03jcrespo This is done. Some small follups (not related to jessie) at: T157359 [15:29:47] (03CR) 10Jcrespo: [C: 031] db-codfw,db-eqiad.php: Remove db1057 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345856 (https://phabricator.wikimedia.org/T160435) (owner: 10Marostegui) [15:37:08] (03CR) 10Muehlenhoff: [C: 031] admin: add pmiazga to researchers group [puppet] - 10https://gerrit.wikimedia.org/r/345469 (https://phabricator.wikimedia.org/T161658) (owner: 10Dzahn) [15:38:49] PROBLEM - wikidata.org dispatch lag is higher than 300s on wikidata is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1899 bytes in 0.155 second response time [15:43:49] RECOVERY - wikidata.org dispatch lag is higher than 300s on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1880 bytes in 0.137 second response time [15:44:13] !log volans@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=codfw,name=ms-fe2005.codfw.wmnet [15:44:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:19] !log mobrovac@tin Started deploy [trending-edits/deploy@26b5eb4]: Config change: lower min_edits to 15 T160127 [15:49:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:26] T160127: Re-evaluate purging strategies - https://phabricator.wikimedia.org/T160127 [15:52:05] 06Operations, 10ops-ulsfo, 10fundraising-tech-ops: upgrade backup4001 hard disk array - https://phabricator.wikimedia.org/T157473#3146836 (10Jgreen) 05Open>03Resolved resolving this task because the hardware is full of fail, we're procuring a new system for codfw instead [15:52:14] (03CR) 10Nuria: "@jcrespo: We cannot delete all tables after 90 days per agreement with our users so if this scrip was to delete data has to do so selectiv" [puppet] - 10https://gerrit.wikimedia.org/r/345646 (https://phabricator.wikimedia.org/T124307) (owner: 10Ottomata) [15:52:58] !log volans@puppetmaster1001 conftool action : set/pooled=no; selector: dc=codfw,name=ms-fe2006.codfw.wmnet [15:53:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:41] (03CR) 10Faidon Liambotis: [C: 032] Install jessie systems with Linux 4.9 by default [puppet] - 10https://gerrit.wikimedia.org/r/345314 (https://phabricator.wikimedia.org/T154934) (owner: 10Muehlenhoff) [15:55:19] !log volans@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=codfw,name=ms-fe2006.codfw.wmnet [15:55:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:49] PROBLEM - puppet last run on analytics1028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:55:56] !log mobrovac@tin Finished deploy [trending-edits/deploy@26b5eb4]: Config change: lower min_edits to 15 T160127 (duration: 06m 37s) [15:56:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:03] T160127: Re-evaluate purging strategies - https://phabricator.wikimedia.org/T160127 [15:57:15] !log volans@puppetmaster1001 conftool action : set/pooled=no; selector: dc=codfw,name=ms-fe2007.codfw.wmnet [15:57:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:12] !log volans@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=codfw,name=ms-fe2007.codfw.wmnet [15:59:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:35] 06Operations, 10ops-ulsfo: decommission backup4001 - https://phabricator.wikimedia.org/T161904#3146871 (10Jgreen) [16:00:20] !log volans@puppetmaster1001 conftool action : set/pooled=no; selector: dc=codfw,name=ms-fe2008.codfw.wmnet [16:00:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:46] !log volans@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=codfw,name=ms-fe2008.codfw.wmnet [16:02:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:29] 06Operations, 10Traffic: Removing support for DES-CBC3-SHA TLS cipher (drops IE8-on-XP support) - https://phabricator.wikimedia.org/T147199#3146894 (10BBlack) We've been stalling on this a bit too long now. I'd like to start kicking off this process and getting in touch with Community as well. I've kinda bac... [16:06:02] (03PS1) 10Volans: Swift-proxy: use discovery everywhere for rewrites [puppet] - 10https://gerrit.wikimedia.org/r/345860 (https://phabricator.wikimedia.org/T160178) [16:14:27] (03PS1) 10Gehel: osm - remove dead code [puppet] - 10https://gerrit.wikimedia.org/r/345861 [16:17:25] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3146925 (10jcrespo) Ok, now puppet works, but either it puppet needs more work or it fails silently-this needs more researech. Replication is not wo... [16:24:49] RECOVERY - puppet last run on analytics1028 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [16:37:29] PROBLEM - puppet last run on maps1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:39:29] 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files (e.g. djvu, jpg, png) on Commons and other sites - https://phabricator.wikimedia.org/T161836#3147007 (10Reedy) [16:41:23] 06Operations, 10ops-eqiad, 10fundraising-tech-ops, 13Patch-For-Review: rack and cable frdev1001 - https://phabricator.wikimedia.org/T159887#3147024 (10faidon) [16:41:25] 06Operations, 10ops-eqiad, 13Patch-For-Review: rack and cable frdb1002 - https://phabricator.wikimedia.org/T159886#3147025 (10faidon) [16:47:41] (03PS2) 10Dzahn: admin: add pmiazga to researchers group [puppet] - 10https://gerrit.wikimedia.org/r/345469 (https://phabricator.wikimedia.org/T161658) [16:49:03] (03CR) 10Dzahn: [C: 032] admin: add pmiazga to researchers group [puppet] - 10https://gerrit.wikimedia.org/r/345469 (https://phabricator.wikimedia.org/T161658) (owner: 10Dzahn) [16:52:52] 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files (e.g. djvu, jpg, png) on Commons and other sites - https://phabricator.wikimedia.org/T161836#3147062 (10Aklapper) Frpm T161910: https://commons.wikimedia.org/wiki/File:Vladimir_Frolochkin.JPG [16:56:29] PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:57:29] PROBLEM - HHVM jobrunner on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:58:19] RECOVERY - HHVM jobrunner on mw1168 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [17:00:29] PROBLEM - HHVM jobrunner on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:03:03] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Production shell access (request for researchers for pmiazga) - https://phabricator.wikimedia.org/T161658#3147078 (10Dzahn) Hi @pmiazga i went ahead and added you to the "researchers" group. I ran puppet on netmon1001 and i saw it created your user. So... [17:03:27] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Production shell access (request for researchers for pmiazga) - https://phabricator.wikimedia.org/T161658#3147079 (10Dzahn) 05Open>03Resolved [17:05:01] 06Operations, 06Analytics-Kanban, 15User-Elukey: Reimage all the Hadoop worker nodes to Debian Jessie - https://phabricator.wikimedia.org/T160333#3147080 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by otto on neodymium.eqiad.wmnet for hosts: ``` ['analytics1049.eqiad.wmnet', 'analytics1050.eq... [17:06:06] 06Operations, 10Analytics-EventLogging, 06Analytics-Kanban, 10DBA, 13Patch-For-Review: Improve eventlogging replication procedure - https://phabricator.wikimedia.org/T124307#3147093 (10Ottomata) ping @Marostegui, in case you didn't see it: https://gerrit.wikimedia.org/r/345646 [17:06:29] RECOVERY - puppet last run on maps1001 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [17:07:41] (03PS1) 10Gehel: osm - trying to fix tests [puppet] - 10https://gerrit.wikimedia.org/r/345866 [17:08:34] (03CR) 10Ottomata: "Ja, let's worry about auto-purging later. This script is just replication improvements. If we did decide to script auto purging, we sho" [puppet] - 10https://gerrit.wikimedia.org/r/345646 (https://phabricator.wikimedia.org/T124307) (owner: 10Ottomata) [17:08:51] (03CR) 10Ottomata: "jcrespo, any other comments on this change?" [puppet] - 10https://gerrit.wikimedia.org/r/345646 (https://phabricator.wikimedia.org/T124307) (owner: 10Ottomata) [17:11:11] (03CR) 10Gehel: [C: 04-1] "There are still errors which I don't really understand yet. If anyone has an idea..." [puppet] - 10https://gerrit.wikimedia.org/r/345866 (owner: 10Gehel) [17:12:19] RECOVERY - HHVM jobrunner on mw1168 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.004 second response time [17:14:29] PROBLEM - HHVM jobrunner on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:37] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to hive for joewalsh - https://phabricator.wikimedia.org/T161663#3147110 (10Dzahn) 05Open>03stalled [17:15:19] RECOVERY - HHVM jobrunner on mw1168 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 5.238 second response time [17:16:09] _joe_: ^ there it is again? [17:17:29] PROBLEM - HHVM jobrunner on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:17:31] yea, load around 20 and lots of ffmpeg2theora but it's running and moving [17:18:44] oohh. [17:18:45] mw1168 mcelog: Processor 13 heated above trip temperature. Throttling enabled. [17:18:59] "Please check your system cooling. Performance will be impacted" [17:19:19] PROBLEM - puppet last run on cp3006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:19:31] but also Processor 10 below trip temperature. Throttling disabled [17:20:04] (03CR) 10Madhuvishy: nfs-mounts: per cluster definitions for mounts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/345631 (https://phabricator.wikimedia.org/T158883) (owner: 10Rush) [17:26:39] PROBLEM - puppet last run on cp1053 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:28:42] 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files (e.g. djvu, jpg, png) on Commons and other sites - https://phabricator.wikimedia.org/T161836#3147156 (10Aklapper) From T161916: https://commons.wikimedia.org/wiki/File:Vladimir_Frolochkin.JPG [17:28:59] 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files (e.g. djvu, jpg, png) on Commons and other sites - https://phabricator.wikimedia.org/T161836#3147161 (10Aklapper) [17:29:05] 06Operations: videoscalers - high load / overheating - https://phabricator.wikimedia.org/T161918#3147165 (10Dzahn) [17:29:13] made a ticket for that https://phabricator.wikimedia.org/T161918 [17:29:22] for the mw1168 that is [17:29:52] 06Operations, 06Analytics-Kanban, 15User-Elukey: Reimage all the Hadoop worker nodes to Debian Jessie - https://phabricator.wikimedia.org/T160333#3147181 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['analytics1049.eqiad.wmnet', 'analytics1050.eqiad.wmnet'] ``` and were **ALL** successful. [17:31:00] 06Operations: videoscalers (mw1168) - high load / overheating - https://phabricator.wikimedia.org/T161918#3147183 (10Dzahn) [17:32:22] RECOVERY - HHVM jobrunner on mw1168 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 2.266 second response time [17:34:19] (03Abandoned) 10Yuvipanda: devpi: Add module + role [puppet] - 10https://gerrit.wikimedia.org/r/282102 (owner: 10Yuvipanda) [17:34:22] PROBLEM - HHVM jobrunner on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:34:34] (03Abandoned) 10Yuvipanda: paws_internal: Install statistics related packages [puppet] - 10https://gerrit.wikimedia.org/r/319650 (https://phabricator.wikimedia.org/T149543) (owner: 10Yuvipanda) [17:34:38] (03Abandoned) 10Yuvipanda: toollabs: Move exec_environ package list to hiera [puppet] - 10https://gerrit.wikimedia.org/r/324699 (https://phabricator.wikimedia.org/T152089) (owner: 10Yuvipanda) [17:34:42] (03Abandoned) 10Yuvipanda: [WIP] Add a dump of all packages from exec_environ [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/324709 (owner: 10Yuvipanda) [17:36:18] (03CR) 10Dzahn: [C: 032] ci: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345557 (owner: 10Faidon Liambotis) [17:37:23] (03Abandoned) 10Yuvipanda: labspuppetbackend: Use sqlalchemy for connection pooling [puppet] - 10https://gerrit.wikimedia.org/r/296830 (https://phabricator.wikimedia.org/T133412) (owner: 10Yuvipanda) [17:37:39] (03Abandoned) 10Yuvipanda: notebook: Provision researcher acceounts on notebook servers [puppet] - 10https://gerrit.wikimedia.org/r/315329 (owner: 10Yuvipanda) [17:37:43] (03PS3) 10Dzahn: ci: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345557 (owner: 10Faidon Liambotis) [17:39:51] (03PS4) 10Dzahn: ci: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345557 (owner: 10Faidon Liambotis) [17:40:08] (03PS3) 10Yuvipanda: k8s: Use packages everywhere [puppet] - 10https://gerrit.wikimedia.org/r/345441 [17:43:13] RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [17:43:42] if k8s tools check fails that's me [17:44:22] RECOVERY - HHVM jobrunner on mw1168 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 4.275 second response time [17:46:23] PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:23] PROBLEM - HHVM jobrunner on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:47:16] mutante hmm it's doing it on 1169 too [17:47:30] paladox: yea, they are both videoscaler.. same effect [17:47:38] yep [17:47:48] so it's not a random hardware failure on one of them [17:47:54] ok [17:47:58] both are getting warm [17:48:09] oh [17:48:12] RECOVERY - puppet last run on cp3006 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [17:49:59] 06Operations: videoscalers (mw1168, mw1169) - high load / overheating - https://phabricator.wikimedia.org/T161918#3147230 (10Dzahn) [17:50:31] 06Operations: videoscalers (mw1168, mw1169) - high load / overheating - https://phabricator.wikimedia.org/T161918#3147165 (10Dzahn) mw1169 is behaving the same way. It also has the throttling log entries due to temperature. Not randomly just on one of them. [17:51:56] (03PS1) 10Volans: Fix typo in discovery name [switchdc] - 10https://gerrit.wikimedia.org/r/345868 (https://phabricator.wikimedia.org/T160178) [17:52:09] (03PS4) 10Yuvipanda: k8s: Use packages everywhere [puppet] - 10https://gerrit.wikimedia.org/r/345441 [17:52:12] RECOVERY - HHVM jobrunner on mw1168 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [17:54:23] PROBLEM - HHVM jobrunner on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:55:42] RECOVERY - puppet last run on cp1053 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [17:56:12] RECOVERY - HHVM jobrunner on mw1168 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.003 second response time [17:56:12] RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.003 second response time [17:58:21] (03PS5) 10Yuvipanda: k8s: Use packages everywhere [puppet] - 10https://gerrit.wikimedia.org/r/345441 [17:58:31] (03CR) 10Yuvipanda: [V: 032 C: 032] k8s: Use packages everywhere [puppet] - 10https://gerrit.wikimedia.org/r/345441 (owner: 10Yuvipanda) [18:12:14] (03PS7) 10Rush: nfs-mounts: per cluster definitions for mounts [puppet] - 10https://gerrit.wikimedia.org/r/345631 (https://phabricator.wikimedia.org/T158883) [18:12:25] PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:12:25] PROBLEM - HHVM jobrunner on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:14:15] RECOVERY - HHVM jobrunner on mw1168 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.004 second response time [18:14:25] RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 9.448 second response time [18:14:32] 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files (e.g. djvu, jpg, png) on Commons and other sites - https://phabricator.wikimedia.org/T161836#3147264 (10Paladox) [18:14:47] !log ruthenium mounting /dev/mapper/ruthenium--vg-tank which wasnt used at all.. bam.. over 477GB of free space [18:14:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:06] 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files (e.g. djvu, jpg, png) on Commons and other sites - https://phabricator.wikimedia.org/T161836#3144779 (10Paladox) Can we set the status back to unbreak? Too many duplicate tasks are being created. [18:17:16] (03PS8) 10Rush: nfs-mounts: per cluster definitions for mounts [puppet] - 10https://gerrit.wikimedia.org/r/345631 (https://phabricator.wikimedia.org/T158883) [18:17:26] PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:17:26] PROBLEM - HHVM jobrunner on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:19:15] RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.009 second response time [18:19:15] RECOVERY - HHVM jobrunner on mw1168 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.009 second response time [18:27:15] 06Operations: logrotate for ruthenium - https://phabricator.wikimedia.org/T161920#3147276 (10Dzahn) [18:27:19] !log ruthenium mounting /dev/mapper/ruthenium--vg-tank into /srv/visualdiff/pngs | deleted "mysql" and "dumps" data that was on previously unmounted partition , subbu checked that wasn't needed anymore, we still need logrotate (T161920) [18:27:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:27] T161920: logrotate for ruthenium - https://phabricator.wikimedia.org/T161920 [18:34:11] 06Operations: logrotate for ruthenium - https://phabricator.wikimedia.org/T161920#3147321 (10Dzahn) correction: logrotate for /var/log/parsoid ! desired rules: rotate after log size hits 500mb; gzip after rotation; delete all files older than 24 hours. and since we have more space: alright,... [18:37:56] PROBLEM - puppet last run on cp1045 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:54:25] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:03:50] 06Operations, 06Analytics-Kanban, 15User-Elukey: Reimage all the Hadoop worker nodes to Debian Jessie - https://phabricator.wikimedia.org/T160333#3147369 (10Ottomata) > Messages like the above one were logged frequently in the Yarn NodeManager logs. A daemon restart fixed the issue, but we didn't find any go... [19:05:55] RECOVERY - puppet last run on cp1045 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [19:16:16] !log ruthenium also deleting ancient "htmldumper" data, gwicke confirmed it's not needed anymore [19:16:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:05] (03PS3) 10Andrew Bogott: Keystone 2fa: Use the wikitech API rather than checking the db directly. [puppet] - 10https://gerrit.wikimedia.org/r/345231 [19:23:25] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [19:25:06] 06Operations: videoscalers (mw1168, mw1169) - high load / overheating - https://phabricator.wikimedia.org/T161918#3147165 (10elukey) We tried to lower down transcode_prioritized to 10 with https://gerrit.wikimedia.org/r/#/c/345829/ but it probably didn't help much. The issue appears only on mw116[89] and not on... [19:31:25] PROBLEM - HHVM jobrunner on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:31:27] PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:34:15] RECOVERY - HHVM jobrunner on mw1168 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.008 second response time [19:34:15] RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.006 second response time [19:35:00] brion: o/ - any idea what is causing --^ ? [19:36:27] elukey: Overheating and throttling? [19:36:33] Because they're busy [19:36:39] Because people are queuing many jobs [19:38:11] nono I don't think it is related to thermal issues, we have been seeing them in many nodes [19:42:59] !log reedy@tin Synchronized php-1.29.0-wmf.18/extensions/Echo: Stop badge hacks from messing up the entire page on IE 11 on MonoBook T161689 (duration: 00m 50s) [19:43:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:05] T161689: VisualEditor 1.23 is 404 - https://phabricator.wikimedia.org/T161689 [19:43:30] damn it, swapped numbers [19:44:02] !log Stop badge hacks from messing up the entire page on IE 11 on MonoBook T161869 [19:44:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:08] T161869: Notifications overlap large area of content in Monobook skin with Internet Explorer - https://phabricator.wikimedia.org/T161869 [19:45:47] elukey: load on videoscalers is lower now though. it's like 11 instead of 20 [19:45:51] thanks Reedy. looks fixed [19:46:13] mutante: o/ [19:46:29] elukey: so should we paste "hhvmadmin check-health" output when we see it ? [19:46:38] hhvmadm [19:47:33] mutante: I think we can check also https://grafana-admin.wikimedia.org/dashboard/db/prometheus-apache-hhvm-dc-stats?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=videoscaler&var-instance=mw1168 (and mw1169) [19:47:44] sometimes it gets to 20, and queue grows [19:48:19] I am tempted to lower down transcodes again on mw116[89] [19:48:55] from 8 to 6? [19:49:21] also transcode_prioritized from 10 to 8 [19:49:31] eh, 10 to 8 [19:49:35] right [19:50:16] but why all of a sudden? Probably surge in jobs submitted, but mw1260/59 are fine [19:50:16] (03PS1) 10GWicke: WIP: Add cache-control option that allows for short term client caching [puppet] - 10https://gerrit.wikimedia.org/r/345877 (https://phabricator.wikimedia.org/T161284) [19:51:26] PROBLEM - HHVM jobrunner on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:51:26] PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:52:16] RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [19:52:16] RECOVERY - HHVM jobrunner on mw1168 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.006 second response time [19:56:04] yea.. the ones that have way less RAM are the ones that are not affected [19:56:44] the number of running processes looks similar [20:00:37] so either a bit more threads for hhvm or less transcodes [20:01:57] (03PS4) 10Yuvipanda: tools: Automount useful credentials onto containers [puppet] - 10https://gerrit.wikimedia.org/r/327235 [20:04:52] (03CR) 10Yuvipanda: "Boo, even with new package it fails because the first character is now "/", hence it gets translated to "-". Meh." [puppet] - 10https://gerrit.wikimedia.org/r/327235 (owner: 10Yuvipanda) [20:05:38] elukey: i wonder how large the backlog of videos to be scaled is. i guess less transcodes [20:06:25] so if https://commons.wikimedia.org/wiki/Special:TimedMediaHandler is reliable a lot :) [20:07:14] or from terbium [20:07:27] webVideoTranscode: 6414 queued; 152 claimed (106 active, 46 abandoned); 0 delayed [20:07:30] webVideoTranscodePrioritized: 0 queued; 62 claimed (3 active, 59 abandoned); 0 delayed [20:07:36] (03PS1) 10Yuvipanda: Fix HostPathAutomounter completely failing [debs/kubernetes] - 10https://gerrit.wikimedia.org/r/345879 [20:08:15] maybe we should add just 2 more videoscalers [20:08:25] if we can take them from regular appserver pool [20:08:32] or have the hardware [20:09:11] that would be good, but in theory videoscalers should keep processing [20:09:30] without reaching max load and start queueing in hhvm [20:09:44] I mean we should only think about adding scalers to speed up the queue consumption [20:09:52] yea, why do we never see the issue with mw1259 [20:10:31] it has less RAM and uses swap , yet the beefy one does this [20:11:34] 06Operations, 10hardware-requests, 10netops: MX480 & QFX5100 for esams (April 2017) - https://phabricator.wikimedia.org/T161930#3147522 (10faidon) [20:11:42] ah, well, they have transcode_prioritized: 4 only [20:11:56] from role/common/ [20:12:39] the total amount of transcoders is 10 (out of 15 hhvm threads, the 70%) [20:15:12] (03PS1) 10Elukey: Lower down total transcodes for mw116[89] [puppet] - 10https://gerrit.wikimedia.org/r/345881 [20:15:45] mutante: --^ [20:16:06] (03PS2) 10Dzahn: Lower down total transcodes for mw116[89] [puppet] - 10https://gerrit.wikimedia.org/r/345881 (https://phabricator.wikimedia.org/T161918) (owner: 10Elukey) [20:16:08] so the idea is that we lower down prioritized since there seems to be no queueing in there (at least from terbium) [20:16:11] ahahhaah [20:16:12] (03CR) 10Dzahn: [C: 031] Lower down total transcodes for mw116[89] [puppet] - 10https://gerrit.wikimedia.org/r/345881 (https://phabricator.wikimedia.org/T161918) (owner: 10Elukey) [20:16:40] sorry for a moment I thought that you filed the same code review as I did :P [20:16:53] heh, no, i just wanted to link to the ticket :) [20:17:01] all i edited was the commit message [20:17:16] thanks! [20:17:26] Probably Giuseppe will kill me for this merge [20:17:41] but let's try it :D [20:17:52] heh, ok :o [20:17:57] (03CR) 10Elukey: [C: 032] Lower down total transcodes for mw116[89] [puppet] - 10https://gerrit.wikimedia.org/r/345881 (https://phabricator.wikimedia.org/T161918) (owner: 10Elukey) [20:18:50] !log stopping jobrunners on mw116[89] and restarting hhvm after https://gerrit.wikimedia.org/r/345881 [20:18:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:25] done! [20:25:43] elukey: thanks [20:25:52] hope that it will make some difference [20:26:14] theoretically now more or less we have the same proportion between hhvm threads and transcodes on the scalers [20:26:33] ok, that seems reasoanble yea [20:27:15] PROBLEM - puppet last run on db1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:48:31] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Production shell access (request for researchers for pmiazga) - https://phabricator.wikimedia.org/T161658#3147627 (10pmiazga) @Dzahn - everything works now, thank you. [20:52:34] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Production shell access (request for researchers for pmiazga) - https://phabricator.wikimedia.org/T161658#3147628 (10Dzahn) @pmiazga Great! Thanks for confirming. Have a nice weekend. [20:56:15] RECOVERY - puppet last run on db1018 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [20:56:27] (03PS1) 10Andrew Bogott: labs-ip-alias-dump.py: Use keystone api v3 [puppet] - 10https://gerrit.wikimedia.org/r/345886 (https://phabricator.wikimedia.org/T158650) [20:56:29] (03PS1) 10Andrew Bogott: labs-ip-alias-dump.py: Use novaobserver creds rather than novaadmin [puppet] - 10https://gerrit.wikimedia.org/r/345887 (https://phabricator.wikimedia.org/T158650) [20:58:42] (03CR) 10Andrew Bogott: [C: 032] labs-ip-alias-dump.py: Use keystone api v3 [puppet] - 10https://gerrit.wikimedia.org/r/345886 (https://phabricator.wikimedia.org/T158650) (owner: 10Andrew Bogott) [20:58:44] (03CR) 10jerkins-bot: [V: 04-1] labs-ip-alias-dump.py: Use novaobserver creds rather than novaadmin [puppet] - 10https://gerrit.wikimedia.org/r/345887 (https://phabricator.wikimedia.org/T158650) (owner: 10Andrew Bogott) [21:03:42] (03PS1) 10Dzahn: parsoid-testing: add logrotate [puppet] - 10https://gerrit.wikimedia.org/r/345888 (https://phabricator.wikimedia.org/T161920) [21:04:29] (03PS2) 10Andrew Bogott: labs-ip-alias-dump.py: Use novaobserver creds rather than novaadmin [puppet] - 10https://gerrit.wikimedia.org/r/345887 (https://phabricator.wikimedia.org/T158650) [21:05:23] (03CR) 10Subramanya Sastry: parsoid-testing: add logrotate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/345888 (https://phabricator.wikimedia.org/T161920) (owner: 10Dzahn) [21:06:18] (03CR) 10Andrew Bogott: [C: 032] labs-ip-alias-dump.py: Use novaobserver creds rather than novaadmin [puppet] - 10https://gerrit.wikimedia.org/r/345887 (https://phabricator.wikimedia.org/T158650) (owner: 10Andrew Bogott) [21:08:18] (03PS5) 10Yuvipanda: tools: Automount useful credentials onto containers [puppet] - 10https://gerrit.wikimedia.org/r/327235 [21:08:20] (03CR) 10Yuvipanda: [C: 032] tools: Automount useful credentials onto containers [puppet] - 10https://gerrit.wikimedia.org/r/327235 (owner: 10Yuvipanda) [21:08:32] (03CR) 10Yuvipanda: [V: 032 C: 032] "Fixed with https://gerrit.wikimedia.org/r/#/c/345879/" [puppet] - 10https://gerrit.wikimedia.org/r/327235 (owner: 10Yuvipanda) [21:08:45] (03CR) 10Yuvipanda: [C: 032] Fix HostPathAutomounter completely failing [debs/kubernetes] - 10https://gerrit.wikimedia.org/r/345879 (owner: 10Yuvipanda) [21:09:36] I think that was my last merge into ops puppet for a while [21:12:22] yuvipanda: :/ one sad and one happy eye? [21:12:32] :) [21:12:35] PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py] [21:17:48] (03PS1) 10Andrew Bogott: Exclude 'admin' project from labs-ip-alias-dump.py [puppet] - 10https://gerrit.wikimedia.org/r/345890 [21:19:48] (03CR) 10Andrew Bogott: [C: 032] Exclude 'admin' project from labs-ip-alias-dump.py [puppet] - 10https://gerrit.wikimedia.org/r/345890 (owner: 10Andrew Bogott) [21:21:29] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to hive for joewalsh - https://phabricator.wikimedia.org/T161663#3147690 (10Fjalapeno) Approved [21:21:44] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to hive for joewalsh - https://phabricator.wikimedia.org/T161663#3147691 (10Fjalapeno) a:05Fjalapeno>03Dzahn [21:22:35] RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [21:28:17] (03PS1) 10Yuvipanda: Get rid of nonsensical default [debs/kubernetes] - 10https://gerrit.wikimedia.org/r/345909 [21:28:31] mutante: I was wrong :P something broke in the meantime [21:38:35] (03PS1) 10Yuvipanda: admin: Remove my old key [puppet] - 10https://gerrit.wikimedia.org/r/345910 [21:38:55] (03PS2) 10Yuvipanda: admin: Remove my old key [puppet] - 10https://gerrit.wikimedia.org/r/345910 [21:39:03] (03CR) 10Yuvipanda: [V: 032 C: 032] admin: Remove my old key [puppet] - 10https://gerrit.wikimedia.org/r/345910 (owner: 10Yuvipanda) [22:09:05] PROBLEM - HHVM jobrunner on mw1260 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:09:55] RECOVERY - HHVM jobrunner on mw1260 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [22:12:05] PROBLEM - HHVM jobrunner on mw1260 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:12:55] RECOVERY - HHVM jobrunner on mw1260 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [22:13:13] yuvipanda: hehe, happens [22:15:45] 06Operations, 10RESTBase, 10ArchCom-RfC (ArchCom-Approved), 07Availability, and 5 others: RFC: Request timeouts and retries - https://phabricator.wikimedia.org/T97204#3147738 (10Krinkle) [22:18:54] (03CR) 10Dzahn: parsoid-testing: add logrotate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/345888 (https://phabricator.wikimedia.org/T161920) (owner: 10Dzahn) [22:19:46] (03PS2) 10Dzahn: parsoid-testing: add logrotate [puppet] - 10https://gerrit.wikimedia.org/r/345888 (https://phabricator.wikimedia.org/T161920) [22:21:13] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/5999/ruthenium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/345888 (https://phabricator.wikimedia.org/T161920) (owner: 10Dzahn) [22:22:25] PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:23:25] RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 2.456 second response time [22:26:05] (03CR) 10Yuvipanda: [C: 032] Get rid of nonsensical default [debs/kubernetes] - 10https://gerrit.wikimedia.org/r/345909 (owner: 10Yuvipanda) [22:34:44] (03CR) 10Dzahn: [C: 031] "so this will fulfill 2 of the requirements "after a week" and "gzip", just not looking at a max size. this is a pretty default combination" [puppet] - 10https://gerrit.wikimedia.org/r/345888 (https://phabricator.wikimedia.org/T161920) (owner: 10Dzahn) [22:36:56] mutante: that seems to be my last patch for real now :) [22:37:02] am out now! Have fun everyone! [22:37:11] yuvipanda: have a good .. [22:37:23] there he goes [22:40:19] 06Operations, 13Patch-For-Review: videoscalers (mw1168, mw1169) - high load / overheating - https://phabricator.wikimedia.org/T161918#3147165 (10brion) Note there are probably a lot of jobs in the non-prioritized queue still backed up last I looked; Note I also have a fill-in batch job on terbium that's thrott... [23:02:25] PROBLEM - HHVM jobrunner on mw1259 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:03:02] 13:12 < mutante> yea, why do we never see the issue with mw1259 [23:03:03] 16:05 < icinga-wm> PROBLEM - HHVM jobrunner on mw1259 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:03:05] :p [23:03:14] elukey: ^ [23:03:15] RECOVERY - HHVM jobrunner on mw1259 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [23:03:53] (03CR) 10Dzahn: [C: 032] parsoid-testing: add logrotate [puppet] - 10https://gerrit.wikimedia.org/r/345888 (https://phabricator.wikimedia.org/T161920) (owner: 10Dzahn) [23:04:08] (03PS3) 10Dzahn: parsoid-testing: add logrotate [puppet] - 10https://gerrit.wikimedia.org/r/345888 (https://phabricator.wikimedia.org/T161920) [23:05:44] 06Operations, 13Patch-For-Review: videoscalers (mw1168, mw1169) - high load / overheating - https://phabricator.wikimedia.org/T161918#3147765 (10Dzahn) earlier we thought that mw1259 (videoscaler but other hardware) was not affected, but... 13:12 < mutante> yea, why do we never see the issue with mw1259 .. 16... [23:06:21] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to hive for joewalsh - https://phabricator.wikimedia.org/T161663#3147766 (10Dzahn) 05stalled>03Open [23:07:24] (03PS2) 10Dzahn: admin: add joewalsh to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/345398 (https://phabricator.wikimedia.org/T161663) [23:11:35] PROBLEM - puppet last run on ms-be1038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:11:56] !log ruthenium: logrotate --force /etc/logrotate.d/parsoid (note this is existing file "parsoid" not new file "parsoid_testing") (T161920) [23:12:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:06] T161920: logrotate for ruthenium - https://phabricator.wikimedia.org/T161920 [23:13:39] 06Operations, 13Patch-For-Review: logrotate for ruthenium - https://phabricator.wikimedia.org/T161920#3147773 (10Dzahn) p:05Triage>03Normal [23:18:51] 06Operations, 13Patch-For-Review: logrotate for ruthenium - https://phabricator.wikimedia.org/T161920#3147775 (10Dzahn) @ssastry Right after adding our new logrotate snippet for parsoid-testing on ruthenium, i noticed there is also an existing logrotate config for parsoid. /etc/logrotate.d/parsoid ``` # log... [23:19:27] (03PS1) 10Dzahn: Revert "parsoid-testing: add logrotate" [puppet] - 10https://gerrit.wikimedia.org/r/345944 [23:20:25] PROBLEM - HHVM jobrunner on mw1259 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:20:35] (03CR) 10Dzahn: [C: 032] "reasoning and approval on linked ticket" [puppet] - 10https://gerrit.wikimedia.org/r/345398 (https://phabricator.wikimedia.org/T161663) (owner: 10Dzahn) [23:21:25] RECOVERY - HHVM jobrunner on mw1259 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 4.432 second response time [23:21:46] (03CR) 10Dzahn: [C: 032] "there is existing logrotate config for parsoid handling the same file(s)" [puppet] - 10https://gerrit.wikimedia.org/r/345944 (owner: 10Dzahn) [23:23:18] (03PS2) 10Dzahn: Revert "parsoid-testing: add logrotate" [puppet] - 10https://gerrit.wikimedia.org/r/345944 [23:24:24] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to hive for joewalsh - https://phabricator.wikimedia.org/T161663#3147794 (10Dzahn) @Fjalapeno Thnk you! @JoeWalsh Merged and i ran puppet on `stat1002.eqiad.wmnet`, confirming your user has been added to the group requested: This sho... [23:24:51] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to hive for joewalsh - https://phabricator.wikimedia.org/T161663#3147795 (10Dzahn) 05Open>03Resolved [23:33:36] (03CR) 10Dzahn: [C: 031] Gerrit: Add log4j.logger.org.apache.sshd.common.keyprovider.FileKeyPairProvider=INFO to log4j [puppet] - 10https://gerrit.wikimedia.org/r/345583 (owner: 10Paladox) [23:36:06] (03CR) 10Dzahn: [C: 032] dataset: Capitalizing the words dumps and images. [puppet] - 10https://gerrit.wikimedia.org/r/345795 (owner: 10Felipe L. Ewald) [23:36:11] (03PS4) 10Dzahn: dataset: Capitalizing the words dumps and images. [puppet] - 10https://gerrit.wikimedia.org/r/345795 (owner: 10Felipe L. Ewald) [23:40:35] RECOVERY - puppet last run on ms-be1038 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [23:48:34] (03PS3) 10Dzahn: puppetmaster: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345554 (owner: 10Faidon Liambotis) [23:51:27] 06Operations, 10Ops-Access-Requests: Requesting access to hive for joewalsh - https://phabricator.wikimedia.org/T161663#3147797 (10Dzahn) [23:51:31] (03PS4) 10Dzahn: puppetmaster: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345554 (owner: 10Faidon Liambotis) [23:55:18] (03CR) 10Dzahn: [C: 032] puppetmaster: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345554 (owner: 10Faidon Liambotis) [23:56:06] (03PS3) 10Dzahn: elasticsearch: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345552 (owner: 10Faidon Liambotis) [23:56:08] (03PS4) 10Dzahn: elasticsearch: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345552 (owner: 10Faidon Liambotis) [23:56:42] (03CR) 10Dzahn: [C: 032] elasticsearch: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345552 (owner: 10Faidon Liambotis)