[00:00:11] PROBLEM - check_puppetrun on pay-lvs2001 is CRITICAL: CRITICAL: puppet fail [00:00:11] PROBLEM - check_puppetrun on pay-lvs2002 is CRITICAL: CRITICAL: puppet fail [00:00:11] PROBLEM - check_mysql on fdb2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null) [00:04:23] PROBLEM - check_mysql on payments2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null) [00:04:23] PROBLEM - check_puppetrun on payments2001 is CRITICAL: CRITICAL: Puppet has 20 failures [00:04:23] PROBLEM - check_puppetrun on payments2002 is CRITICAL: CRITICAL: Puppet has 20 failures [00:04:23] PROBLEM - check_puppetrun on payments2003 is CRITICAL: CRITICAL: Puppet has 20 failures [00:05:11] PROBLEM - check_puppetrun on pay-lvs2002 is CRITICAL: CRITICAL: puppet fail [00:05:12] PROBLEM - check_puppetrun on pay-lvs2001 is CRITICAL: CRITICAL: puppet fail [00:05:12] PROBLEM - check_mysql on fdb2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null) [00:09:22] PROBLEM - check_mysql on payments2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null) [00:09:23] PROBLEM - check_puppetrun on payments2002 is CRITICAL: CRITICAL: Puppet has 20 failures [00:09:23] PROBLEM - check_puppetrun on payments2001 is CRITICAL: CRITICAL: Puppet has 20 failures [00:09:23] PROBLEM - check_puppetrun on payments2003 is CRITICAL: CRITICAL: Puppet has 20 failures [00:10:11] PROBLEM - check_puppetrun on pay-lvs2002 is CRITICAL: CRITICAL: puppet fail [00:10:11] PROBLEM - check_puppetrun on pay-lvs2001 is CRITICAL: CRITICAL: puppet fail [00:10:11] PROBLEM - check_mysql on fdb2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null) [00:14:22] PROBLEM - check_mysql on payments2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null) [00:14:23] PROBLEM - check_puppetrun on payments2002 is CRITICAL: CRITICAL: Puppet has 20 failures [00:14:23] PROBLEM - check_puppetrun on payments2001 is CRITICAL: CRITICAL: Puppet has 20 failures [00:14:23] PROBLEM - check_puppetrun on payments2003 is CRITICAL: CRITICAL: Puppet has 20 failures [00:15:11] PROBLEM - check_puppetrun on pay-lvs2002 is CRITICAL: CRITICAL: puppet fail [00:15:11] PROBLEM - check_puppetrun on pay-lvs2001 is CRITICAL: CRITICAL: puppet fail [00:15:11] PROBLEM - check_mysql on fdb2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null) [00:19:23] PROBLEM - check_mysql on payments2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null) [00:19:23] PROBLEM - check_puppetrun on payments2001 is CRITICAL: CRITICAL: Puppet has 20 failures [00:19:23] PROBLEM - check_puppetrun on payments2003 is CRITICAL: CRITICAL: Puppet has 20 failures [00:19:23] PROBLEM - check_puppetrun on payments2002 is CRITICAL: CRITICAL: Puppet has 20 failures [00:20:11] PROBLEM - check_mysql on fdb2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null) [00:20:11] PROBLEM - check_puppetrun on pay-lvs2001 is CRITICAL: CRITICAL: puppet fail [00:20:11] PROBLEM - check_puppetrun on pay-lvs2002 is CRITICAL: CRITICAL: puppet fail [00:24:23] PROBLEM - check_puppetrun on payments2002 is CRITICAL: CRITICAL: Puppet has 20 failures [00:24:23] PROBLEM - check_mysql on payments2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null) [00:24:23] PROBLEM - check_puppetrun on payments2003 is CRITICAL: CRITICAL: Puppet has 20 failures [00:24:23] PROBLEM - check_puppetrun on payments2001 is CRITICAL: CRITICAL: Puppet has 20 failures [00:25:11] PROBLEM - check_puppetrun on pay-lvs2001 is CRITICAL: CRITICAL: puppet fail [00:25:11] PROBLEM - check_mysql on fdb2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null) [00:25:11] PROBLEM - check_puppetrun on pay-lvs2002 is CRITICAL: CRITICAL: puppet fail [00:29:22] PROBLEM - check_mysql on payments2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null) [00:29:23] PROBLEM - check_puppetrun on payments2002 is CRITICAL: CRITICAL: Puppet has 20 failures [00:29:23] PROBLEM - check_puppetrun on payments2001 is CRITICAL: CRITICAL: Puppet has 20 failures [00:29:23] PROBLEM - check_puppetrun on payments2003 is CRITICAL: CRITICAL: Puppet has 20 failures [00:30:11] PROBLEM - check_puppetrun on pay-lvs2001 is CRITICAL: CRITICAL: puppet fail [00:30:11] PROBLEM - check_puppetrun on pay-lvs2002 is CRITICAL: CRITICAL: puppet fail [00:30:11] PROBLEM - check_mysql on fdb2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null) [00:34:22] PROBLEM - check_mysql on payments2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null) [00:34:23] PROBLEM - check_puppetrun on payments2001 is CRITICAL: CRITICAL: Puppet has 20 failures [00:34:23] PROBLEM - check_puppetrun on payments2002 is CRITICAL: CRITICAL: Puppet has 20 failures [00:34:23] PROBLEM - check_puppetrun on payments2003 is CRITICAL: CRITICAL: Puppet has 20 failures [00:35:11] PROBLEM - check_puppetrun on pay-lvs2001 is CRITICAL: CRITICAL: puppet fail [00:35:11] PROBLEM - check_puppetrun on pay-lvs2002 is CRITICAL: CRITICAL: puppet fail [00:35:11] PROBLEM - check_mysql on fdb2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null) [00:39:22] PROBLEM - check_mysql on payments2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null) [00:39:22] PROBLEM - check_puppetrun on payments2003 is CRITICAL: CRITICAL: Puppet has 20 failures [00:39:23] PROBLEM - check_puppetrun on payments2001 is CRITICAL: CRITICAL: Puppet has 20 failures [00:39:23] PROBLEM - check_puppetrun on payments2002 is CRITICAL: CRITICAL: Puppet has 20 failures [00:40:11] PROBLEM - check_puppetrun on pay-lvs2001 is CRITICAL: CRITICAL: puppet fail [00:40:11] PROBLEM - check_puppetrun on pay-lvs2002 is CRITICAL: CRITICAL: puppet fail [00:40:12] PROBLEM - check_mysql on fdb2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null) [00:42:46] (03PS1) 10Tim Landscheidt: WIP: ldap: Make ldaplist use paging for queries [puppet] - 10https://gerrit.wikimedia.org/r/295177 (https://phabricator.wikimedia.org/T122595) [00:44:22] PROBLEM - check_mysql on payments2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null) [00:44:23] PROBLEM - check_puppetrun on payments2002 is CRITICAL: CRITICAL: Puppet has 20 failures [00:44:23] PROBLEM - check_puppetrun on payments2001 is CRITICAL: CRITICAL: Puppet has 20 failures [00:44:23] PROBLEM - check_puppetrun on payments2003 is CRITICAL: CRITICAL: Puppet has 20 failures [00:45:11] PROBLEM - check_puppetrun on pay-lvs2002 is CRITICAL: CRITICAL: puppet fail [00:45:11] PROBLEM - check_puppetrun on pay-lvs2001 is CRITICAL: CRITICAL: puppet fail [00:45:12] PROBLEM - check_mysql on fdb2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null) [00:49:23] PROBLEM - check_mysql on payments2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null) [00:49:24] PROBLEM - check_puppetrun on payments2002 is CRITICAL: CRITICAL: Puppet has 20 failures [00:49:24] PROBLEM - check_puppetrun on payments2001 is CRITICAL: CRITICAL: Puppet has 20 failures [00:49:24] PROBLEM - check_puppetrun on payments2003 is CRITICAL: CRITICAL: Puppet has 20 failures [00:50:11] PROBLEM - check_puppetrun on pay-lvs2001 is CRITICAL: CRITICAL: puppet fail [00:50:11] PROBLEM - check_puppetrun on pay-lvs2002 is CRITICAL: CRITICAL: puppet fail [00:50:11] PROBLEM - check_mysql on fdb2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null) [00:54:23] PROBLEM - check_mysql on payments2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null) [00:54:23] PROBLEM - check_puppetrun on payments2002 is CRITICAL: CRITICAL: Puppet has 20 failures [00:54:23] PROBLEM - check_puppetrun on payments2003 is CRITICAL: CRITICAL: Puppet has 20 failures [00:54:24] PROBLEM - check_puppetrun on payments2001 is CRITICAL: CRITICAL: Puppet has 20 failures [00:55:11] PROBLEM - check_mysql on fdb2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null) [00:55:11] PROBLEM - check_puppetrun on pay-lvs2001 is CRITICAL: CRITICAL: puppet fail [00:55:11] PROBLEM - check_puppetrun on pay-lvs2002 is CRITICAL: CRITICAL: puppet fail [00:55:14] ACKNOWLEDGEMENT - check_puppetrun on pay-lvs2001 is CRITICAL: CRITICAL: puppet fail Jeff_Green pfw@codfw malfunction(?) [00:55:14] ACKNOWLEDGEMENT - check_puppetrun on pay-lvs2002 is CRITICAL: CRITICAL: puppet fail Jeff_Green pfw@codfw malfunction(?) [00:57:34] ACKNOWLEDGEMENT - check_mysql on payments2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null) Jeff_Green pfw@codfw malfunction? [00:57:34] ACKNOWLEDGEMENT - check_puppetrun on payments2001 is CRITICAL: CRITICAL: Puppet has 20 failures Jeff_Green pfw@codfw malfunction? [00:57:35] ACKNOWLEDGEMENT - check_puppetrun on payments2002 is CRITICAL: CRITICAL: Puppet has 20 failures Jeff_Green pfw@codfw malfunction? [00:57:36] ACKNOWLEDGEMENT - check_puppetrun on payments2003 is CRITICAL: CRITICAL: Puppet has 20 failures Jeff_Green pfw@codfw malfunction? [00:58:53] ACKNOWLEDGEMENT - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=829 [critical =325] Jeff_Green pfw@codfw malfunction? [00:58:53] ACKNOWLEDGEMENT - check_mysql on fdb2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null) Jeff_Green pfw@codfw malfunction? [01:06:23] (03PS1) 10Dereckson: Enable NewUserMessage on pl.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295178 (https://phabricator.wikimedia.org/T138169) [02:23:18] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.6) (duration: 09m 54s) [02:23:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:29:01] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Jun 20 02:29:01 UTC 2016 (duration 5m 44s) [02:29:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:50:43] PROBLEM - puppet last run on mw1205 is CRITICAL: CRITICAL: Puppet has 1 failures [02:52:34] Hello! How can I contribute to the Ops Team? [03:00:25] I'll check the logs later, if someone can answer my question please don't hesitate to :) [03:12:39] PROBLEM - puppet last run on ms-be3004 is CRITICAL: CRITICAL: puppet fail [03:17:18] RECOVERY - puppet last run on mw1205 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:20:20] elacheche_anis: Do you know Puppet? [03:20:49] Most of the operations configuration is managed via Puppet in Git repositories. You could probably start there and poke around. [03:23:00] I can learn puppet Debra :) I know some Ansible (actually am a n00b in this :/ :( ) [03:24:07] OK, I'll take a look at the Git Repo,thx, BTW, the phabricator link in the wiki returns no results, that's a great thing for the team, as there is no open issues, bad for me as I couldn't find any task to start with x) [03:24:37] elacheche_anis: Which wiki page? [03:24:41] There are plenty of open tasks. :-) [03:26:41] Debra: https://wikitech.wikimedia.org/wiki/Get_involved#How_to_Find_Tasks → The Phabriator Link https://phabricator.wikimedia.org/maniphest/query/N8wwjZqJB3NB/#R [03:28:25] elacheche_anis: There's https://phabricator.wikimedia.org/maniphest/query/HM820qFxglVZ/#R at least. [03:29:15] That link on the wiki seems old. [03:29:16] OK, the think in the wiki uses the wrong tags? [03:29:20] Probably, yeah. [03:29:29] I didn't realize that wiki had a "Get involved" page. [03:29:34] I can update it, or someone can?! [03:29:40] Sure, be bold! [03:29:41] s/can/should [03:36:37] Great Task list! I should explore it deeper, I'll keep that for tomorrow, I should get some sleep now (it's 4am here).. Thanks for helping Debra [03:40:10] No problem. [03:40:25] RECOVERY - puppet last run on ms-be3004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:57:53] PROBLEM - puppet last run on db1063 is CRITICAL: CRITICAL: Puppet has 1 failures [04:12:51] PROBLEM - puppet last run on mw1103 is CRITICAL: CRITICAL: Puppet has 1 failures [04:23:42] RECOVERY - puppet last run on db1063 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:38:05] RECOVERY - puppet last run on mw1103 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:18:24] TimStarling: around? [05:18:38] yes [05:19:10] TimStarling: since you have admin on wikitech, can you block a few spammers [05:19:23] i'm guessing their usernames should be surpressed as well [05:19:44] ok [05:29:04] Don't stewards have access to wikitech? [05:29:10] Well they should, technically. [05:30:29] I think it's a non-CA wiki [05:30:55] you know it is tied to labs adminstration [05:32:28] Bsadowski1: considering its hosted on labs, and its specially requested that you don't use your normal password or keys on it its probably best that stewards don't [05:38:10] <_joe_> yes, it's non-CA [05:38:25] <_joe_> and it's not hosted on labs, FWIW [05:44:07] (03PS1) 10KartikMistry: foma: Initial Debian packaging [debs/contenttranslation/foma] - 10https://gerrit.wikimedia.org/r/295183 (https://phabricator.wikimedia.org/T120087) [05:47:16] (03PS1) 10Ori.livneh: Better cache headers for 'Powered by MediaWiki' badge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295184 [05:53:36] (03CR) 10Ori.livneh: [C: 032] Better cache headers for 'Powered by MediaWiki' badge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295184 (owner: 10Ori.livneh) [05:54:17] (03Merged) 10jenkins-bot: Better cache headers for 'Powered by MediaWiki' badge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295184 (owner: 10Ori.livneh) [05:56:02] !log ori@tin Synchronized static/images: Id5804a80: Better cache headers for 'Powered by MediaWiki' badge (1/2) (duration: 00m 33s) [05:56:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:57:00] !log ori@tin Synchronized wmf-config/CommonSettings.php: Id5804a80: Better cache headers for 'Powered by MediaWiki' badge (2/2) (duration: 00m 35s) [05:57:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:57:13] (03PS1) 10KartikMistry: apertium-sme-nob: Initial Debian packaging [debs/contenttranslation/apertium-sme-nob] - 10https://gerrit.wikimedia.org/r/295185 (https://phabricator.wikimedia.org/T120087) [05:59:49] 06Operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 4 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#2392823 (10KartikMistry) [06:00:29] 06Operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 4 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#1506774 (10KartikMistry) [06:32:20] PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:41] PROBLEM - puppet last run on db2062 is CRITICAL: CRITICAL: puppet fail [06:32:50] PROBLEM - puppet last run on mc2007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:03] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:40:03] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 624 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5245982 keys - replication_delay is 624 [06:42:12] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5204095 keys - replication_delay is 0 [06:47:01] <_joe_> !log activating the jessie jobrunner, mw1299 [06:47:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:49:21] RECOVERY - Apache HTTP on mw2241 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 8.680 second response time [06:49:22] RECOVERY - puppet last run on mw2241 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [06:50:02] RECOVERY - puppet last run on mw1299 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:12] 06Operations, 06Commons, 10media-storage: Update rsvg on the image scalers - https://phabricator.wikimedia.org/T112421#2392903 (10MoritzMuehlenhoff) @tgr We're in the process of moving the mw* systems to jessie, so we can take the migration to jessie as an opportunity to move to 2.40.16. I had already opened... [06:56:33] RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [06:57:32] RECOVERY - puppet last run on mc2007 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:57:51] RECOVERY - Apache HTTP on mw2242 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 8.284 second response time [06:58:41] RECOVERY - puppet last run on db2062 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [06:58:42] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:07:35] RECOVERY - Apache HTTP on mw2245 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 8.389 second response time [07:07:36] RECOVERY - puppet last run on mw2245 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [07:07:55] RECOVERY - puppet last run on mw2244 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [07:07:56] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [07:09:26] RECOVERY - Apache HTTP on mw2244 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.132 second response time [07:12:05] 06Operations: Staging area for the next version of the transparency report - https://phabricator.wikimedia.org/T138197#2392917 (10ori) [07:12:15] RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [07:17:26] PROBLEM - puppet last run on mw2109 is CRITICAL: CRITICAL: puppet fail [07:17:26] PROBLEM - puppet last run on mw2229 is CRITICAL: CRITICAL: puppet fail [07:17:26] PROBLEM - puppet last run on cp2023 is CRITICAL: CRITICAL: puppet fail [07:17:27] PROBLEM - puppet last run on elastic1004 is CRITICAL: CRITICAL: puppet fail [07:17:35] PROBLEM - puppet last run on db2009 is CRITICAL: CRITICAL: puppet fail [07:17:36] PROBLEM - puppet last run on mw2117 is CRITICAL: CRITICAL: puppet fail [07:17:46] PROBLEM - puppet last run on db1022 is CRITICAL: CRITICAL: puppet fail [07:17:56] PROBLEM - puppet last run on planet1001 is CRITICAL: CRITICAL: puppet fail [07:17:56] PROBLEM - puppet last run on mw1204 is CRITICAL: CRITICAL: puppet fail [07:18:06] PROBLEM - puppet last run on db1066 is CRITICAL: CRITICAL: puppet fail [07:18:06] PROBLEM - puppet last run on helium is CRITICAL: CRITICAL: puppet fail [07:18:06] PROBLEM - puppet last run on ganeti2001 is CRITICAL: CRITICAL: puppet fail [07:18:15] PROBLEM - puppet last run on mw2163 is CRITICAL: CRITICAL: puppet fail [07:18:16] PROBLEM - puppet last run on maps2002 is CRITICAL: CRITICAL: puppet fail [07:18:16] PROBLEM - puppet last run on lvs3003 is CRITICAL: CRITICAL: puppet fail [07:18:16] PROBLEM - puppet last run on mw2114 is CRITICAL: CRITICAL: puppet fail [07:18:17] PROBLEM - puppet last run on mw2105 is CRITICAL: CRITICAL: puppet fail [07:18:17] PROBLEM - puppet last run on analytics1057 is CRITICAL: CRITICAL: puppet fail [07:18:17] PROBLEM - puppet last run on mw1205 is CRITICAL: CRITICAL: puppet fail [07:18:25] PROBLEM - puppet last run on mw2240 is CRITICAL: CRITICAL: puppet fail [07:18:25] PROBLEM - puppet last run on fluorine is CRITICAL: CRITICAL: puppet fail [07:18:25] PROBLEM - puppet last run on cp2026 is CRITICAL: CRITICAL: puppet fail [07:18:25] PROBLEM - puppet last run on analytics1003 is CRITICAL: CRITICAL: puppet fail [07:18:25] PROBLEM - puppet last run on auth1001 is CRITICAL: CRITICAL: puppet fail [07:18:26] PROBLEM - puppet last run on lvs1005 is CRITICAL: CRITICAL: puppet fail [07:18:26] PROBLEM - puppet last run on analytics1048 is CRITICAL: CRITICAL: puppet fail [07:18:27] PROBLEM - puppet last run on cp4002 is CRITICAL: CRITICAL: puppet fail [07:18:27] PROBLEM - puppet last run on cp3040 is CRITICAL: CRITICAL: puppet fail [07:18:28] PROBLEM - puppet last run on mw2244 is CRITICAL: CRITICAL: puppet fail [07:18:36] PROBLEM - puppet last run on elastic1008 is CRITICAL: CRITICAL: puppet fail [07:18:36] PROBLEM - puppet last run on mw2083 is CRITICAL: CRITICAL: puppet fail [07:18:36] PROBLEM - puppet last run on cp3047 is CRITICAL: CRITICAL: puppet fail [07:18:36] PROBLEM - puppet last run on wtp2001 is CRITICAL: CRITICAL: puppet fail [07:18:37] PROBLEM - puppet last run on cp3018 is CRITICAL: CRITICAL: puppet fail [07:18:37] PROBLEM - puppet last run on mw1241 is CRITICAL: CRITICAL: puppet fail [07:18:45] PROBLEM - puppet last run on cp3010 is CRITICAL: CRITICAL: puppet fail [07:18:45] PROBLEM - puppet last run on mw2087 is CRITICAL: CRITICAL: puppet fail [07:18:46] PROBLEM - puppet last run on db1031 is CRITICAL: CRITICAL: puppet fail [07:18:46] PROBLEM - puppet last run on elastic1046 is CRITICAL: CRITICAL: puppet fail [07:18:46] PROBLEM - puppet last run on mw2247 is CRITICAL: CRITICAL: puppet fail [07:18:46] PROBLEM - puppet last run on mw2082 is CRITICAL: CRITICAL: puppet fail [07:18:46] PROBLEM - puppet last run on ms-fe1001 is CRITICAL: CRITICAL: puppet fail [07:18:47] PROBLEM - puppet last run on xenon is CRITICAL: CRITICAL: puppet fail [07:18:56] PROBLEM - puppet last run on mw1143 is CRITICAL: CRITICAL: puppet fail [07:18:56] PROBLEM - puppet last run on cp2016 is CRITICAL: CRITICAL: puppet fail [07:18:57] PROBLEM - puppet last run on db2045 is CRITICAL: CRITICAL: puppet fail [07:18:57] PROBLEM - puppet last run on elastic1018 is CRITICAL: CRITICAL: puppet fail [07:19:05] PROBLEM - puppet last run on mw1189 is CRITICAL: CRITICAL: puppet fail [07:19:05] PROBLEM - puppet last run on kafka1001 is CRITICAL: CRITICAL: puppet fail [07:19:05] PROBLEM - puppet last run on ms-be1006 is CRITICAL: CRITICAL: puppet fail [07:19:06] PROBLEM - puppet last run on mw1003 is CRITICAL: CRITICAL: puppet fail [07:19:06] PROBLEM - puppet last run on mw1267 is CRITICAL: CRITICAL: puppet fail [07:19:06] PROBLEM - puppet last run on ms-be2004 is CRITICAL: CRITICAL: puppet fail [07:19:06] PROBLEM - puppet last run on mw2079 is CRITICAL: CRITICAL: puppet fail [07:19:07] PROBLEM - puppet last run on mw2113 is CRITICAL: CRITICAL: puppet fail [07:19:07] PROBLEM - puppet last run on mw2134 is CRITICAL: CRITICAL: puppet fail [07:19:15] PROBLEM - puppet last run on wtp2020 is CRITICAL: CRITICAL: puppet fail [07:19:15] PROBLEM - puppet last run on mw1173 is CRITICAL: CRITICAL: puppet fail [07:19:15] PROBLEM - puppet last run on mw1153 is CRITICAL: CRITICAL: puppet fail [07:19:16] PROBLEM - puppet last run on mw2184 is CRITICAL: CRITICAL: puppet fail [07:19:16] PROBLEM - puppet last run on rdb2005 is CRITICAL: CRITICAL: puppet fail [07:19:17] PROBLEM - puppet last run on db2059 is CRITICAL: CRITICAL: puppet fail [07:19:25] PROBLEM - puppet last run on ruthenium is CRITICAL: CRITICAL: puppet fail [07:19:26] PROBLEM - puppet last run on mw1107 is CRITICAL: CRITICAL: puppet fail [07:19:26] PROBLEM - puppet last run on mw2196 is CRITICAL: CRITICAL: puppet fail [07:19:27] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: puppet fail [07:19:27] PROBLEM - puppet last run on elastic1027 is CRITICAL: CRITICAL: puppet fail [07:19:35] PROBLEM - puppet last run on db1050 is CRITICAL: CRITICAL: puppet fail [07:19:36] PROBLEM - puppet last run on analytics1035 is CRITICAL: CRITICAL: puppet fail [07:19:36] PROBLEM - puppet last run on mw1150 is CRITICAL: CRITICAL: puppet fail [07:19:36] PROBLEM - puppet last run on db2008 is CRITICAL: CRITICAL: puppet fail [07:19:37] PROBLEM - puppet last run on db1079 is CRITICAL: CRITICAL: puppet fail [07:19:37] PROBLEM - puppet last run on cp3021 is CRITICAL: CRITICAL: puppet fail [07:19:37] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: puppet fail [07:19:37] PROBLEM - puppet last run on lvs2004 is CRITICAL: CRITICAL: puppet fail [07:23:27] <_joe_> wat? [07:24:31] I made a typo [07:24:36] in my private repo commit [07:24:43] fixed a moment too late [07:24:56] <_joe_> np :) [07:25:24] killed icinga-wm to avoid 10,000 lines of shame, i'll bring it right back [07:25:36] <_joe_> eheh np I just wanted to check [07:25:47] <_joe_> the private repo comes with no safety net [07:27:32] i had :set list in vim, which shows invisible characters, with '$' to represent a linebreak [07:27:41] and i also had a literal '$' at the end of a line [07:28:06] <_joe_> heh [07:34:58] RECOVERY - puppet last run on xenon is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:36:26] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:36:27] RECOVERY - puppet last run on cp4004 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [07:36:37] RECOVERY - puppet last run on mw2242 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:38:06] PROBLEM - Disk space on mw1280 is CRITICAL: Timeout while attempting connection [07:38:27] PROBLEM - MD RAID on mw1280 is CRITICAL: Timeout while attempting connection [07:39:18] PROBLEM - Apache HTTP on mw1280 is CRITICAL: Connection timed out [07:39:26] PROBLEM - configured eth on mw1280 is CRITICAL: Timeout while attempting connection [07:39:27] RECOVERY - puppet last run on restbase1015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:39:46] PROBLEM - dhclient process on mw1280 is CRITICAL: Timeout while attempting connection [07:39:56] PROBLEM - mediawiki-installation DSH group on mw1280 is CRITICAL: Host mw1280 is not in mediawiki-installation dsh group [07:40:18] PROBLEM - nutcracker port on mw1280 is CRITICAL: Timeout while attempting connection [07:40:37] PROBLEM - nutcracker process on mw1280 is CRITICAL: Timeout while attempting connection [07:40:56] PROBLEM - puppet last run on mw1280 is CRITICAL: Timeout while attempting connection [07:41:16] PROBLEM - salt-minion processes on mw1280 is CRITICAL: Timeout while attempting connection [07:41:46] PROBLEM - Check size of conntrack table on mw1280 is CRITICAL: Timeout while attempting connection [07:42:06] PROBLEM - DPKG on mw1280 is CRITICAL: Timeout while attempting connection [07:42:47] new app server in icinga, now I can silence it (mw1279 will follow soon!) [07:43:18] RECOVERY - puppet last run on kafka1001 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [07:43:27] !log rebalancing shards on elasticsearch eqiad cluster [07:43:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:43:37] RECOVERY - puppet last run on ganeti2001 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [07:43:46] RECOVERY - puppet last run on analytics1057 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [07:43:47] RECOVERY - puppet last run on mw2240 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [07:43:48] RECOVERY - puppet last run on cp2023 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [07:43:57] RECOVERY - puppet last run on elastic1004 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [07:44:06] RECOVERY - puppet last run on helium is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [07:44:07] RECOVERY - puppet last run on db2059 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [07:44:07] RECOVERY - puppet last run on ganeti2004 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [07:44:08] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/294939 (https://phabricator.wikimedia.org/T137188) (owner: 10Faidon Liambotis) [07:44:17] RECOVERY - puppet last run on db2009 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [07:44:24] (03PS1) 10Ori.livneh: Provision a "staging" microsite for the transparency report [puppet] - 10https://gerrit.wikimedia.org/r/295192 (https://phabricator.wikimedia.org/T138197) [07:44:27] RECOVERY - puppet last run on mw2229 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:44:27] RECOVERY - puppet last run on ganeti1002 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [07:44:36] RECOVERY - puppet last run on cp3018 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [07:44:38] RECOVERY - puppet last run on wtp2020 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:44:46] RECOVERY - puppet last run on analytics1035 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [07:44:46] RECOVERY - puppet last run on db1031 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:44:47] RECOVERY - puppet last run on elastic1046 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [07:44:47] RECOVERY - puppet last run on cp3010 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [07:44:47] RECOVERY - puppet last run on bast1001 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [07:44:57] RECOVERY - puppet last run on mw1173 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [07:45:06] RECOVERY - puppet last run on planet1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:45:08] RECOVERY - puppet last run on db1050 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [07:45:16] RECOVERY - puppet last run on elastic1018 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [07:45:16] RECOVERY - puppet last run on ms-be2004 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [07:45:17] RECOVERY - puppet last run on db2045 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [07:45:17] RECOVERY - puppet last run on mw2109 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [07:45:18] RECOVERY - puppet last run on copper is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:45:27] RECOVERY - puppet last run on db1066 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:45:27] RECOVERY - puppet last run on ms-be1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:45:27] RECOVERY - puppet last run on db2008 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [07:45:27] RECOVERY - puppet last run on mc2016 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [07:45:36] RECOVERY - puppet last run on es1014 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [07:45:46] RECOVERY - puppet last run on maps2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:45:46] RECOVERY - puppet last run on db2054 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [07:45:46] RECOVERY - puppet last run on analytics1030 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [07:45:47] RECOVERY - puppet last run on mw1107 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [07:45:47] RECOVERY - puppet last run on lvs3003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:45:47] RECOVERY - puppet last run on mw2114 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [07:45:47] RECOVERY - puppet last run on ms-fe1002 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [07:45:48] RECOVERY - puppet last run on mw1205 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [07:45:56] RECOVERY - puppet last run on mw2105 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [07:45:56] RECOVERY - puppet last run on analytics1003 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [07:45:57] RECOVERY - puppet last run on fluorine is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:45:57] RECOVERY - puppet last run on cp2026 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:45:57] RECOVERY - puppet last run on auth1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:45:57] RECOVERY - puppet last run on achernar is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:45:57] RECOVERY - puppet last run on sinistra is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [07:45:58] RECOVERY - puppet last run on analytics1048 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:45:58] RECOVERY - puppet last run on lvs1004 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [07:45:59] RECOVERY - puppet last run on lvs1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:45:59] RECOVERY - puppet last run on elastic2004 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [07:46:06] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [07:46:06] RECOVERY - puppet last run on db1022 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:46:07] RECOVERY - puppet last run on db2065 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [07:46:08] RECOVERY - puppet last run on cp4002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:46:08] RECOVERY - puppet last run on cp3040 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:46:16] RECOVERY - puppet last run on ms-be1015 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [07:46:17] RECOVERY - puppet last run on mw2163 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [07:46:17] RECOVERY - puppet last run on mw2117 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [07:46:17] RECOVERY - puppet last run on mw1253 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [07:46:18] RECOVERY - puppet last run on db1040 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:46:18] RECOVERY - puppet last run on rdb2005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:46:26] RECOVERY - puppet last run on elastic1008 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:46:26] RECOVERY - puppet last run on cp3021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:46:27] RECOVERY - puppet last run on mw2083 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [07:46:27] RECOVERY - puppet last run on oresrdb1001 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [07:46:36] RECOVERY - puppet last run on logstash1002 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [07:46:36] RECOVERY - puppet last run on db2047 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [07:46:37] RECOVERY - puppet last run on wtp2010 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [07:46:37] RECOVERY - puppet last run on mw1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:46:37] RECOVERY - puppet last run on mw1241 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:46:37] RECOVERY - puppet last run on lvs2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:46:37] RECOVERY - puppet last run on wtp2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:46:38] RECOVERY - puppet last run on cp3047 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:46:38] RECOVERY - puppet last run on elastic1027 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [07:46:46] RECOVERY - puppet last run on mw1153 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:46:46] RECOVERY - puppet last run on mw1150 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:46:46] RECOVERY - puppet last run on mw1235 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [07:46:47] RECOVERY - puppet last run on mw1267 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:46:47] RECOVERY - puppet last run on ruthenium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:46:47] RECOVERY - puppet last run on mw2127 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [07:46:48] RECOVERY - puppet last run on mw2087 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:46:56] RECOVERY - puppet last run on bast3001 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [07:46:56] RECOVERY - puppet last run on etherpad1001 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [07:46:56] RECOVERY - puppet last run on cp4014 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [07:46:56] RECOVERY - puppet last run on mw2247 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:46:57] RECOVERY - puppet last run on restbase2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:46:57] RECOVERY - puppet last run on cp2003 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [07:46:57] RECOVERY - puppet last run on ms-fe1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:46:58] RECOVERY - puppet last run on mw2082 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:47:06] RECOVERY - puppet last run on mw1204 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:47:07] RECOVERY - puppet last run on mw1166 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [07:47:07] RECOVERY - puppet last run on mw1155 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [07:47:07] RECOVERY - puppet last run on wtp2016 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [07:47:07] RECOVERY - puppet last run on wtp1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:47:16] RECOVERY - puppet last run on db1079 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:47:16] RECOVERY - puppet last run on cp2014 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [07:47:17] RECOVERY - puppet last run on cp2016 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:47:17] RECOVERY - puppet last run on cp2011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:47:26] RECOVERY - puppet last run on mw2184 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [07:47:26] RECOVERY - puppet last run on labvirt1009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:47:26] RECOVERY - puppet last run on labstore1001 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [07:47:27] RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:47:27] RECOVERY - puppet last run on mw1189 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:47:27] RECOVERY - puppet last run on cp3042 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:47:27] RECOVERY - puppet last run on labcontrol1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:47:36] RECOVERY - puppet last run on mendelevium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:47:36] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:47:36] RECOVERY - puppet last run on mw1131 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:47:37] RECOVERY - puppet last run on mw2196 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [07:47:37] RECOVERY - puppet last run on db1042 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:47:37] RECOVERY - puppet last run on labsdb1005 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [07:47:38] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:47:38] RECOVERY - puppet last run on graphite2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:47:38] RECOVERY - puppet last run on cp4015 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [07:47:46] RECOVERY - puppet last run on mw2176 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:47:46] RECOVERY - puppet last run on cp1066 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:47:46] RECOVERY - puppet last run on db1021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:47:47] RECOVERY - puppet last run on dbproxy1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:47:47] RECOVERY - puppet last run on elastic2008 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [07:47:47] RECOVERY - puppet last run on db1051 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [07:47:48] RECOVERY - puppet last run on db2038 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [07:47:48] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [07:47:53] (03PS1) 10Ori.livneh: Add passwords::misc::private_static_site [labs/private] - 10https://gerrit.wikimedia.org/r/295193 [07:47:56] RECOVERY - puppet last run on mw1154 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [07:47:56] RECOVERY - puppet last run on mw2070 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [07:47:56] RECOVERY - puppet last run on mw2096 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [07:47:56] RECOVERY - puppet last run on mw2143 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:47:56] RECOVERY - puppet last run on mw1137 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [07:47:57] RECOVERY - puppet last run on mw1104 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:47:57] RECOVERY - puppet last run on mw2084 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [07:47:58] RECOVERY - puppet last run on mw2113 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [07:47:58] RECOVERY - puppet last run on cp1071 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [07:48:07] RECOVERY - puppet last run on mw2233 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:48:08] RECOVERY - puppet last run on kafka2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:48:08] RECOVERY - puppet last run on mw1208 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [07:48:08] RECOVERY - puppet last run on mw1211 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [07:48:16] RECOVERY - puppet last run on db1016 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [07:48:17] RECOVERY - puppet last run on mw2123 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:48:17] RECOVERY - puppet last run on cp1058 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [07:48:17] RECOVERY - puppet last run on elastic1041 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:48:18] RECOVERY - puppet last run on es1016 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [07:48:18] RECOVERY - puppet last run on cp4019 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [07:48:19] (03CR) 10Ori.livneh: [C: 032 V: 032] Add passwords::misc::private_static_site [labs/private] - 10https://gerrit.wikimedia.org/r/295193 (owner: 10Ori.livneh) [07:48:26] RECOVERY - puppet last run on labstore2003 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [07:48:26] RECOVERY - puppet last run on mw2093 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:48:26] RECOVERY - puppet last run on ms-fe2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:48:27] RECOVERY - puppet last run on mw1194 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:48:27] RECOVERY - puppet last run on mw1230 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [07:48:27] RECOVERY - puppet last run on mw1021 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:48:27] RECOVERY - puppet last run on sca2002 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [07:48:36] RECOVERY - puppet last run on antimony is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:48:37] RECOVERY - puppet last run on mw2131 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [07:48:37] RECOVERY - puppet last run on mw2090 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:48:46] RECOVERY - puppet last run on lvs1011 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [07:48:47] RECOVERY - puppet last run on cp1073 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [07:48:47] RECOVERY - puppet last run on mw1179 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [07:48:47] RECOVERY - puppet last run on mw2212 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:48:47] RECOVERY - puppet last run on db1082 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:48:47] RECOVERY - puppet last run on ms-be3003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:48:47] RECOVERY - puppet last run on elastic1006 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [07:48:48] RECOVERY - puppet last run on ms-be3001 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [07:48:48] RECOVERY - puppet last run on cp4017 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [07:48:49] RECOVERY - puppet last run on elastic1040 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:48:49] RECOVERY - puppet last run on db1034 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:48:50] RECOVERY - puppet last run on db1009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:48:50] RECOVERY - puppet last run on mw1213 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:48:57] RECOVERY - puppet last run on mw1238 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [07:48:57] RECOVERY - puppet last run on mw2067 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:48:57] RECOVERY - puppet last run on mw2079 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:48:57] RECOVERY - puppet last run on mw2142 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:48:57] RECOVERY - puppet last run on mc1012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:48:57] RECOVERY - puppet last run on lvs3004 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [07:49:06] RECOVERY - puppet last run on conf2002 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [07:49:06] RECOVERY - puppet last run on cp3030 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [07:49:06] RECOVERY - puppet last run on db1068 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:49:06] RECOVERY - puppet last run on mw2110 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:49:06] RECOVERY - puppet last run on mw2182 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [07:49:07] RECOVERY - puppet last run on druid1001 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [07:49:07] RECOVERY - puppet last run on db2029 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:49:08] RECOVERY - puppet last run on maps2001 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [07:49:08] RECOVERY - puppet last run on mc2014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:49:16] RECOVERY - puppet last run on multatuli is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:49:16] RECOVERY - puppet last run on ms-be2009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:49:17] RECOVERY - puppet last run on analytics1056 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [07:49:17] RECOVERY - puppet last run on mw2134 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:49:17] RECOVERY - puppet last run on cp1062 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [07:49:18] RECOVERY - puppet last run on restbase1009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:49:18] RECOVERY - puppet last run on mw1020 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [07:49:18] RECOVERY - puppet last run on bromine is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [07:49:26] RECOVERY - puppet last run on mw2092 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:49:26] RECOVERY - puppet last run on cp2010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:49:27] RECOVERY - puppet last run on dataset1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:49:27] RECOVERY - puppet last run on cp3006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:49:27] RECOVERY - puppet last run on elastic1045 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:49:27] RECOVERY - puppet last run on elastic1016 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [07:49:27] RECOVERY - puppet last run on db2057 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:49:28] RECOVERY - puppet last run on mc1008 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [07:49:28] RECOVERY - puppet last run on db1071 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [07:49:36] RECOVERY - puppet last run on mw2168 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:49:36] RECOVERY - puppet last run on mw2223 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:49:36] RECOVERY - puppet last run on elastic1013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:49:37] RECOVERY - puppet last run on ms-be1021 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [07:49:37] RECOVERY - puppet last run on serpens is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:49:37] RECOVERY - puppet last run on analytics1031 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:49:37] RECOVERY - puppet last run on ms-be3002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:49:38] RECOVERY - puppet last run on ocg1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:49:38] RECOVERY - puppet last run on cp2022 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [07:49:46] RECOVERY - puppet last run on rdb1006 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [07:49:47] RECOVERY - puppet last run on mw1094 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [07:49:47] RECOVERY - puppet last run on ms-fe3002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:49:47] RECOVERY - puppet last run on mw1011 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:49:47] RECOVERY - puppet last run on elastic2013 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [07:49:56] RECOVERY - puppet last run on lvs4001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:49:56] RECOVERY - puppet last run on restbase1014 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [07:49:57] RECOVERY - puppet last run on db2061 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [07:49:57] RECOVERY - puppet last run on analytics1052 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [07:49:58] RECOVERY - puppet last run on analytics1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:50:06] RECOVERY - puppet last run on mw1206 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:50:08] RECOVERY - puppet last run on seaborgium is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [07:50:16] RECOVERY - puppet last run on mw1169 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [07:50:16] RECOVERY - puppet last run on mw2172 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [07:50:17] RECOVERY - puppet last run on db2023 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [07:50:17] RECOVERY - puppet last run on db2041 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [07:50:17] RECOVERY - puppet last run on cp1065 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [07:50:18] RECOVERY - puppet last run on ms-be2010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:50:18] RECOVERY - puppet last run on snapshot1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:50:18] RECOVERY - puppet last run on mw2174 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:50:27] RECOVERY - puppet last run on mw2101 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:50:36] RECOVERY - puppet last run on analytics1029 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:50:36] RECOVERY - puppet last run on pc2005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:50:36] RECOVERY - puppet last run on mw2085 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:50:37] RECOVERY - puppet last run on db2037 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:50:37] RECOVERY - puppet last run on wtp1002 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [07:50:37] RECOVERY - puppet last run on db1024 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [07:50:37] RECOVERY - puppet last run on wasat is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [07:50:41] (03PS2) 10Ori.livneh: Provision a "staging" microsite for the transparency report [puppet] - 10https://gerrit.wikimedia.org/r/295192 (https://phabricator.wikimedia.org/T138197) [07:50:46] RECOVERY - puppet last run on cp1099 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:50:47] RECOVERY - puppet last run on db1026 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:50:47] RECOVERY - puppet last run on mw2098 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [07:50:47] RECOVERY - puppet last run on mw2107 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [07:50:47] RECOVERY - puppet last run on mw2130 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:50:47] RECOVERY - puppet last run on cp4005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:50:48] RECOVERY - puppet last run on cp3031 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [07:50:48] RECOVERY - puppet last run on snapshot1005 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [07:50:48] RECOVERY - puppet last run on kafka1013 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [07:50:49] RECOVERY - puppet last run on analytics1051 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:50:56] RECOVERY - puppet last run on mw1136 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:50:57] RECOVERY - puppet last run on db1093 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [07:50:57] RECOVERY - puppet last run on maps-test2001 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [07:50:57] RECOVERY - puppet last run on mw1180 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [07:50:57] RECOVERY - puppet last run on elastic1035 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:50:57] RECOVERY - puppet last run on labvirt1005 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [07:50:57] RECOVERY - puppet last run on mw1278 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:50:58] RECOVERY - puppet last run on mw2062 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:50:58] RECOVERY - puppet last run on lvs2003 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [07:50:59] RECOVERY - puppet last run on mw2111 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:51:06] RECOVERY - puppet last run on ms-be2008 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [07:51:07] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [07:51:07] RECOVERY - puppet last run on mw2200 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:51:07] RECOVERY - puppet last run on kafka1014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:51:08] RECOVERY - puppet last run on cp3013 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [07:51:08] RECOVERY - puppet last run on mw2152 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:51:08] RECOVERY - puppet last run on conf2003 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [07:51:17] RECOVERY - puppet last run on db2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:51:17] (03CR) 10Ori.livneh: [C: 032 V: 032] Provision a "staging" microsite for the transparency report [puppet] - 10https://gerrit.wikimedia.org/r/295192 (https://phabricator.wikimedia.org/T138197) (owner: 10Ori.livneh) [07:51:17] RECOVERY - puppet last run on mw2150 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [07:51:26] RECOVERY - puppet last run on mw2133 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [07:51:26] RECOVERY - puppet last run on mw1138 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:51:27] RECOVERY - puppet last run on tegmen is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:51:27] RECOVERY - puppet last run on stat1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:51:27] RECOVERY - puppet last run on analytics1032 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [07:51:28] RECOVERY - puppet last run on bohrium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:51:28] RECOVERY - puppet last run on mw2203 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:51:28] RECOVERY - puppet last run on conf2001 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [07:51:28] RECOVERY - puppet last run on labcontrol1002 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [07:51:36] RECOVERY - puppet last run on mw2237 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [07:51:37] RECOVERY - puppet last run on db1062 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:51:37] RECOVERY - puppet last run on es2012 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [07:51:37] RECOVERY - puppet last run on cp2015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:51:46] RECOVERY - puppet last run on cp3046 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [07:51:46] RECOVERY - puppet last run on mw1245 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [07:51:47] RECOVERY - puppet last run on cp3009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:51:47] RECOVERY - puppet last run on db2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:51:47] RECOVERY - puppet last run on ms-be1017 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [07:51:48] RECOVERY - puppet last run on hassium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:51:48] RECOVERY - puppet last run on lvs1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:51:57] RECOVERY - puppet last run on wdqs1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:51:58] RECOVERY - puppet last run on wtp1021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:51:58] RECOVERY - puppet last run on uranium is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [07:51:59] RECOVERY - puppet last run on labsdb1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:52:06] RECOVERY - puppet last run on labvirt1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:52:07] RECOVERY - puppet last run on ms-be2014 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:52:07] RECOVERY - puppet last run on ms-be1009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:52:08] RECOVERY - puppet last run on mw1178 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [07:52:08] RECOVERY - puppet last run on mw1218 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:52:16] RECOVERY - puppet last run on mira is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:52:16] RECOVERY - puppet last run on mw1275 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:52:17] RECOVERY - puppet last run on mw1181 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:52:17] RECOVERY - puppet last run on mw1132 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [07:52:17] RECOVERY - puppet last run on analytics1044 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:52:18] RECOVERY - puppet last run on wtp2009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:52:18] RECOVERY - puppet last run on mw2094 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:52:26] RECOVERY - puppet last run on db1076 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:52:26] RECOVERY - puppet last run on db1081 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:52:27] RECOVERY - puppet last run on mc1011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:52:27] RECOVERY - puppet last run on krypton is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [07:52:27] RECOVERY - puppet last run on aqs1001 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [07:52:27] RECOVERY - puppet last run on cp2017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:52:27] RECOVERY - puppet last run on mw1240 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [07:52:28] RECOVERY - puppet last run on mw2202 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [07:52:28] RECOVERY - puppet last run on cp2019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:52:29] RECOVERY - puppet last run on db1055 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:52:36] RECOVERY - puppet last run on db2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:52:37] RECOVERY - puppet last run on ms-be1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:52:38] RECOVERY - puppet last run on db1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:52:47] RECOVERY - puppet last run on ms-fe1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:52:47] RECOVERY - puppet last run on elastic1023 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [07:52:56] RECOVERY - puppet last run on ms-be2013 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [07:52:56] RECOVERY - puppet last run on mw1116 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [07:52:56] RECOVERY - puppet last run on mw1165 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:52:57] RECOVERY - puppet last run on mw2068 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [07:52:57] RECOVERY - puppet last run on analytics1036 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:53:07] RECOVERY - puppet last run on ms-be2016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:53:07] RECOVERY - puppet last run on mw2231 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:53:16] RECOVERY - puppet last run on wtp1013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:53:17] RECOVERY - puppet last run on mw2106 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:53:27] RECOVERY - puppet last run on ocg1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:53:27] RECOVERY - puppet last run on mw1198 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:53:36] RECOVERY - puppet last run on mw1233 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:53:47] RECOVERY - puppet last run on mw2193 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:54:07] RECOVERY - puppet last run on ms-be2017 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:54:17] RECOVERY - puppet last run on mw1134 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:54:17] RECOVERY - puppet last run on mw2167 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:54:17] RECOVERY - puppet last run on mw1256 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:54:28] RECOVERY - puppet last run on mw2186 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:54:37] RECOVERY - puppet last run on mw2144 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:08:11] 06Operations, 10Wikimedia-Apache-configuration, 07HHVM, 07Wikimedia-log-errors: Fix Apache proxy_fcgi error "Invalid argument: AH01075: Error dispatching request to" (Causing HTTP 503) - https://phabricator.wikimedia.org/T73487#2392961 (10elukey) Adding some random thoughts about the following snippet of c... [08:09:17] RECOVERY - Apache HTTP on mw1280 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.009 second response time [08:13:37] RECOVERY - configured eth on mw1280 is OK: OK - interfaces up [08:13:47] RECOVERY - Check size of conntrack table on mw1280 is OK: OK: nf_conntrack is 0 % full [08:13:57] RECOVERY - dhclient process on mw1280 is OK: PROCS OK: 0 processes with command name dhclient [08:14:18] RECOVERY - Disk space on mw1280 is OK: DISK OK [08:14:36] RECOVERY - nutcracker port on mw1280 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [08:14:46] RECOVERY - MD RAID on mw1280 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [08:14:47] RECOVERY - nutcracker process on mw1280 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [08:15:27] RECOVERY - salt-minion processes on mw1280 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:15:33] !log installing libxlst security updates [08:15:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:18:27] RECOVERY - DPKG on mw1280 is OK: All packages OK [08:32:51] Hi, is it possible to turn off changing position of params in templates for VE? A lot of users of cswiki want this. Thanks for your advice. [08:33:55] Urbanecm: file a ticket in phabricator [08:35:57] PROBLEM - puppet last run on mw2235 is CRITICAL: CRITICAL: puppet fail [08:43:42] Okay. Filled as T138200. Can somebody process this request soon? Thx [08:50:46] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures [08:52:17] PROBLEM - puppet last run on mw2182 is CRITICAL: CRITICAL: Puppet has 1 failures [08:53:32] (03CR) 10Elukey: [C: 04-1] "Thanks a lot for working on this! I added a couple of comments to the code review, plus tried to run the puppet compiler but encountered s" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/295123 (owner: 10Nicko) [08:54:45] (03PS1) 10Filippo Giunchedi: prometheus: use group root [puppet] - 10https://gerrit.wikimedia.org/r/295197 [08:56:49] (03PS1) 10Muehlenhoff: Bump the size limit for labs openldap server to 4096 [puppet] - 10https://gerrit.wikimedia.org/r/295198 (https://phabricator.wikimedia.org/T122595) [08:57:00] !log deploying 5dfe738 in ores nodes [08:57:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:57:11] * Amir1 is crossing fingers, please don't go down [08:57:16] Amir1: o/ [08:57:23] hey [08:57:28] <_joe_> Amir1: that's *very* reassuring [08:57:28] elukey: o/ [08:57:57] it shouldn't go down, we fixed everything. I waited until monday [08:58:10] but I'm scared a little :D [08:58:18] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] prometheus: use group root [puppet] - 10https://gerrit.wikimedia.org/r/295197 (owner: 10Filippo Giunchedi) [09:00:15] Okay, the deployment errored but everything is up and I rolled it back [09:00:16] RECOVERY - puppet last run on mw2235 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [09:00:33] (never went down fwiw) [09:05:55] and only one node errored, I think a codfw node [09:09:21] scb2001.codfw.wmnet seems to be broken [09:09:30] let me dig deper [09:09:45] (03Abandoned) 10Muehlenhoff: Use paged searches in ldaplist [puppet] - 10https://gerrit.wikimedia.org/r/262745 (owner: 10Muehlenhoff) [09:12:39] (03PS1) 10Jcrespo: Repool db1073 at 100% weight; depool db1071 for reimaging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295199 [09:14:57] (03PS1) 10Jforrester: Enable VisualEditor by default for all users of French Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295200 (https://phabricator.wikimedia.org/T133263) [09:18:57] RECOVERY - puppet last run on mw2182 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:19:58] (03PS7) 10Filippo Giunchedi: prometheus: add nginx reverse proxy [puppet] - 10https://gerrit.wikimedia.org/r/290479 (https://phabricator.wikimedia.org/T126785) [09:20:37] (03CR) 10jenkins-bot: [V: 04-1] prometheus: add nginx reverse proxy [puppet] - 10https://gerrit.wikimedia.org/r/290479 (https://phabricator.wikimedia.org/T126785) (owner: 10Filippo Giunchedi) [09:25:42] (03CR) 10Filippo Giunchedi: prometheus: add nginx reverse proxy (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/290479 (https://phabricator.wikimedia.org/T126785) (owner: 10Filippo Giunchedi) [09:26:38] 06Operations, 10ops-eqiad: db1009 degraded RAID (failed disk) - https://phabricator.wikimedia.org/T138203#2393091 (10jcrespo) [09:31:43] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 06WMF-NDA-Requests: NDA request for @thiemowmde - https://phabricator.wikimedia.org/T135994#2393095 (10thiemowmde) 05Resolved>03Open Log in with //what//? I tried to login with my Wikitech login name "Thiemo Mättig (WMDE)" and the password I... [09:34:14] PROBLEM - Apache HTTP on mw1279 is CRITICAL: Connection timed out [09:35:38] ---^ new appserver [09:35:57] <_joe_> heh I didn't remember if ti was in rotation [09:36:00] <_joe_> is it? [09:36:38] <_joe_> it's not, ok [09:37:27] _joe_: buongiorno! This one is fixed now right? https://phabricator.wikimedia.org/T137689 [09:40:01] <_joe_> ema: not completely I guess [09:40:20] <_joe_> that's one of the things I have to do today in fact [09:41:03] (03PS1) 10Giuseppe Lavagetto: salt: add conftool module [puppet] - 10https://gerrit.wikimedia.org/r/295202 [09:42:17] (03CR) 10jenkins-bot: [V: 04-1] salt: add conftool module [puppet] - 10https://gerrit.wikimedia.org/r/295202 (owner: 10Giuseppe Lavagetto) [09:42:26] <_joe_> grrr pep8 I bet [09:42:27] _joe_: oh, alright. I thought the logrotate 'su' stuff would have been enough :) [09:42:38] <_joe_> ema: nope, see the cronspam [09:42:57] <_joe_> I'll fix it all today [09:43:13] cool, thanks [09:44:11] (03PS8) 10Filippo Giunchedi: prometheus: add nginx reverse proxy [puppet] - 10https://gerrit.wikimedia.org/r/290479 (https://phabricator.wikimedia.org/T126785) [09:44:40] (03CR) 10jenkins-bot: [V: 04-1] prometheus: add nginx reverse proxy [puppet] - 10https://gerrit.wikimedia.org/r/290479 (https://phabricator.wikimedia.org/T126785) (owner: 10Filippo Giunchedi) [09:44:47] yeah yeah [09:44:51] (03PS9) 10Filippo Giunchedi: prometheus: add nginx reverse proxy [puppet] - 10https://gerrit.wikimedia.org/r/290479 (https://phabricator.wikimedia.org/T126785) [09:45:22] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 06WMF-NDA-Requests: NDA request for @thiemowmde - https://phabricator.wikimedia.org/T135994#2393124 (10jcrespo) a:05jcrespo>03None The on duty clinic ops will try to solve this. But yes, that is your correct `cn`, which you should be able to... [09:45:36] 06Operations, 10Wikimedia-Apache-configuration, 07HHVM, 07Wikimedia-log-errors: Fix Apache proxy_fcgi error "Invalid argument: AH01075: Error dispatching request to" (Causing HTTP 503) - https://phabricator.wikimedia.org/T73487#2393128 (10Joe) The "got bogus version 1" is a logging bug and has been fixed i... [09:46:24] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:50:36] !log rolling reboot of restbase2001/restbase2002 for upgrade to Linux 4.4 [09:50:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:57:55] 06Operations, 10Wikimedia-Apache-configuration, 07HHVM, 07Wikimedia-log-errors: Fix Apache proxy_fcgi error "Invalid argument: AH01075: Error dispatching request to" (Causing HTTP 503) - https://phabricator.wikimedia.org/T73487#2393152 (10elukey) With the following code snippet I obtain: ``` First test: b... [10:03:34] (03CR) 10Jcrespo: [C: 032] Repool db1073 at 100% weight; depool db1071 for reimaging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295199 (owner: 10Jcrespo) [10:04:00] (03PS2) 10Giuseppe Lavagetto: salt: add conftool module [puppet] - 10https://gerrit.wikimedia.org/r/295202 [10:05:11] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1073 at 100% weight; depool db1071 for reimaging (duration: 00m 27s) [10:05:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:05:15] (03CR) 10jenkins-bot: [V: 04-1] salt: add conftool module [puppet] - 10https://gerrit.wikimedia.org/r/295202 (owner: 10Giuseppe Lavagetto) [10:05:24] <_joe_> I know, I know [10:05:35] (03PS3) 10Giuseppe Lavagetto: salt: add conftool module [puppet] - 10https://gerrit.wikimedia.org/r/295202 [10:07:02] 06Operations, 13Patch-For-Review: Automated service restarts for common low-level system services - https://phabricator.wikimedia.org/T135991#2318329 (10fgiunchedi) >>! In T135991#2390413, @faidon wrote: > - rsyslog: this closes the socket to all processes, some of which may not handle it gracefully and reconn... [10:07:48] RECOVERY - Apache HTTP on mw1279 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.002 second response time [10:12:17] (03CR) 10Ori.livneh: prometheus: add nginx reverse proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/290479 (https://phabricator.wikimedia.org/T126785) (owner: 10Filippo Giunchedi) [10:19:49] (03CR) 10Ori.livneh: prometheus: add nginx reverse proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/290479 (https://phabricator.wikimedia.org/T126785) (owner: 10Filippo Giunchedi) [10:24:47] (03PS1) 10KartikMistry: apertium-es-it: Rebuild for Jessie and other fixes [debs/contenttranslation/apertium-es-it] - 10https://gerrit.wikimedia.org/r/295206 (https://phabricator.wikimedia.org/T107306) [10:25:23] (03PS1) 10Ema: Port varnishrls to new VSL API [puppet] - 10https://gerrit.wikimedia.org/r/295207 (https://phabricator.wikimedia.org/T131353) [10:28:12] (03PS2) 10Ema: Port varnishrls to new VSL API [puppet] - 10https://gerrit.wikimedia.org/r/295207 (https://phabricator.wikimedia.org/T131353) [10:28:23] 06Operations, 06Analytics-Kanban, 10Traffic, 13Patch-For-Review: Verify why varnishkafka stats and webrequest logs count differs - https://phabricator.wikimedia.org/T136314#2393240 (10elukey) We finally tracked down all the sources of null/missing end timestamps coming from Varnish: 1) Varnish Pipe logs,... [10:30:44] (03PS2) 10KartikMistry: apertium-es-it: Rebuild for Jessie and other fixes [debs/contenttranslation/apertium-es-it] - 10https://gerrit.wikimedia.org/r/295206 (https://phabricator.wikimedia.org/T107306) [10:31:40] !log restbase started mobile-sections dump for eswiki on restbase1009 for T136964 [10:31:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:33:02] akosiaris: you may want to start looking at Apertium package reviews :) [10:35:16] !log db1071 stop, backup and reimage [10:35:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:36:20] kart_: it's a public holiday in Greece today [10:38:27] (03PS2) 10Filippo Giunchedi: swift: enable statsd for all daemons [puppet] - 10https://gerrit.wikimedia.org/r/294691 [10:38:36] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: enable statsd for all daemons [puppet] - 10https://gerrit.wikimedia.org/r/294691 (owner: 10Filippo Giunchedi) [10:40:41] there are some servers that are trying to hit a depooled server [10:41:21] ah, I can see it now [10:41:42] there was an error on the comments [10:42:57] (03PS1) 10Jcrespo: Depool db1071 completelly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295208 [10:43:44] (03CR) 10Jcrespo: [C: 032] Depool db1071 completelly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295208 (owner: 10Jcrespo) [10:44:54] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1071 completelly (duration: 00m 25s) [10:44:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:45:12] (03PS1) 10Elukey: Add mw1279 and mw1280 to MediaWiki scap DSH list. [puppet] - 10https://gerrit.wikimedia.org/r/295209 [10:45:30] thanks to cleaning up noisy log errors we can detect now spikes of less than 100 errors and correct them [10:45:36] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp San Francisco page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp San Francisco page via mobile-sections-lead responds with malformed body: NoneType object has no attribute get [10:45:36] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp San Francisco page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp San Francisco page via mobile-sections-lead responds with malformed body: NoneType object has no attribute get [10:45:53] <_joe_> uhm [10:46:10] <_joe_> so it seems that page is returning an empty result [10:46:18] <_joe_> mobrovac: ^^ [10:47:24] <_joe_> I am going to take a look [10:47:46] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [10:47:47] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [10:48:53] <_joe_> it was transient it seems :/ [10:50:20] moritzm: thanks. I was about to poke alex more :) [10:51:37] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures [10:51:47] _joe_, https://en.wikipedia.org/w/index.php?title=San_Francisco&diff=726151557&oldid=726112550 [10:53:29] !log roll-restart swift on ms-be2* to apply https://gerrit.wikimedia.org/r/294691 [10:53:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:53:57] so either the app should be more reliable (probably difficult) or the check should not assume SF will always have a nice header, which on a wiki is not true [10:55:31] (03PS4) 10Giuseppe Lavagetto: salt: add conftool module [puppet] - 10https://gerrit.wikimedia.org/r/295202 [10:55:40] <_joe_> jynus: ahahahahahaha [10:55:44] (03CR) 10Elukey: [C: 032] Add mw1279 and mw1280 to MediaWiki scap DSH list. [puppet] - 10https://gerrit.wikimedia.org/r/295209 (owner: 10Elukey) [10:56:00] <_joe_> jynus: NBA trolling [10:58:31] !log deploying bdc1e2b in ores nodes [10:58:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:58:36] (03PS1) 10Muehlenhoff: Update debdeploy config for maps caches [puppet] - 10https://gerrit.wikimedia.org/r/295211 [10:58:36] NBA trolling *with syntax error* [10:59:06] clearly a fail [11:00:22] _joe_: sigh, thnx for looking into it [11:00:46] seeing the history of that, maybe that should be semi-protected, latest edits are mostly lebron jokes [11:00:49] <_joe_> jynus: still, was a juicy troll attempt [11:00:57] <_joe_> ahah [11:02:59] 06Operations, 13Patch-For-Review: Automated service restarts for common low-level system services - https://phabricator.wikimedia.org/T135991#2318329 (10ori) >>! In T135991#2393165, @fgiunchedi wrote: > It seems one of the core problems is ensuring no restarts have been forgot As long as you know and remember... [11:04:15] it got down again [11:04:18] wtf [11:04:55] the same old issue [11:06:58] PROBLEM - ores on scb1001 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 136 bytes in 0.009 second response time [11:07:08] yup [11:07:13] it's down, it shouldn't be [11:07:26] can ops help? [11:07:26] PROBLEM - ores on scb2002 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 136 bytes in 0.075 second response time [11:07:37] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - ores_8081 - Could not depool server scb2001.codfw.wmnet because of too many down! [11:07:56] Amir1: sure [11:07:57] PROBLEM - ores on scb2001 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 136 bytes in 0.076 second response time [11:08:03] thanks elukey [11:08:17] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - ores_8081 - Could not depool server scb1002.eqiad.wmnet because of too many down! [11:08:30] first, run puppet agent in scb100[12] nodes [11:08:32] PROBLEM - LVS HTTP IPv4 on ores.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 136 bytes in 0.002 second response time [11:08:38] Amir1: doing it [11:08:38] 06Operations, 13Patch-For-Review: Automated service restarts for common low-level system services - https://phabricator.wikimedia.org/T135991#2318329 (10Joe) @ori `checkrestart` does exactly that. [11:09:06] PROBLEM - ores on scb1002 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 136 bytes in 0.002 second response time [11:09:07] PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - ores_8081 - Could not depool server scb2001.codfw.wmnet because of too many down! [11:09:27] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - ores_8081 - Could not depool server scb1002.eqiad.wmnet because of too many down! [11:09:31] _joe_: TIL [11:09:35] <_joe_> why run puppet? [11:09:36] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] [11:09:47] PROBLEM - PyBal backends health check on lvs1012 is CRITICAL: PYBAL CRITICAL - ores_8081 - Could not depool server scb1002.eqiad.wmnet because of too many down! [11:09:53] <_joe_> Amir1: so what is going on? [11:09:55] _joe_: there is a race condition issue [11:10:04] we probably fixed it [11:10:08] ops have to race to run puppet or else the service goes down? [11:10:16] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [1000.0] [11:10:17] but it happened again [11:10:27] <_joe_> Amir1: ok so the fix is? [11:10:32] <_joe_> temporary, I mean [11:10:36] <_joe_> rolling back? [11:10:46] <_joe_> restarting the service? [11:10:49] not rolling back [11:11:01] restart the service should fix it [11:11:09] (temp. of course) [11:11:13] <_joe_> ok, elukey I am doing that [11:11:26] _joe_ celery gets restarted with puppet [11:11:30] the last one is wsgi [11:11:33] then it will be ok [11:11:44] the former is done, the latter needs to be done [11:11:47] 06Operations, 13Patch-For-Review: Automated service restarts for common low-level system services - https://phabricator.wikimedia.org/T135991#2393309 (10ori) >>! In T135991#2393305, @Joe wrote: > @ori `checkrestart` does exactly that. TIL. It doesn't look like we have alerting for that, though. [11:12:15] <_joe_> elukey: uwsgi needs a restart? [11:12:17] Amir1: all right next time we could just bring up celery-ores-worker again rather than running puppet :) [11:12:30] elukey: it should be fixed another way [11:12:32] _joe_: last time I had to do it because it returning 500s [11:12:39] and we fixed it [11:12:45] at least we thought we fixed it [11:13:44] ok done :) [11:13:53] <_joe_> !log restarting uwsgi orse service [11:13:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:14:41] Amir1: https://ores.wmflabs.org/ seems up, can you double check? [11:14:55] <_joe_> that's labs elukey [11:14:58] <_joe_> this is prod [11:15:08] ah snap wrong link [11:16:00] <_joe_> still getting fetch failed in pybal [11:16:12] Jun 20 11:15:09 scb1001 firejail[19788]: TeX parse error: Undefined control sequence \emph [11:16:26] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [11:16:45] this is after the restart but not sure if related [11:16:47] <_joe_> elukey: that has nothing to do with ores [11:17:33] that has \emph{nothing} to do with ores [11:17:45] aahaha thanks [11:17:50] sorry for the spam [11:17:52] <_joe_> and the status for the uwsgi-ores service shows nothing wrong apparently [11:18:12] <_joe_> Amir1: where can I see the logs for uwsgi-ores? [11:18:19] elukey: it's still down [11:18:27] PROBLEM - puppet last run on cp3047 is CRITICAL: CRITICAL: puppet fail [11:18:28] _joe_: in /var/log/ores/main.log [11:18:42] so it might not be related to this [11:20:02] <_joe_> Amir1: no, they are not [11:20:03] /srv/log/ores/main.log [11:20:10] <_joe_> there ^^ [11:20:11] <_joe_> :) [11:20:11] not var [11:20:30] ori: thanks, it was in var in beta [11:21:22] celery-ores-worker.service seems down on scb1001 [11:22:06] <_joe_> ok [11:22:18] there is a [11:22:18] Jun 20 11:10:11 scb1001 celery[23839]: ImportError: Could not load stopwords for revscoring.languages.english. [11:22:49] elukey: yeah [11:22:52] mmm puppet tried to bring it back when I ran it [11:23:09] It seems there is something wrong with reading configs [11:23:44] <_joe_> can we roolback please? [11:23:53] <_joe_> this outage lasted long enough [11:23:54] _joe_: yeah [11:23:57] sure [11:24:01] let me do it [11:24:22] how we do rollback via scap? [11:25:08] <_joe_> I guess you deploy the preceding version? [11:25:38] my changes are not related to anything with config, I think it just triggered something deeply wrong [11:25:40] <_joe_> I don't know, to be honest, how you'd rollback [11:26:52] alternatively you can revert your change and roll-forward of course [11:27:17] rlling [11:27:23] *rollbacking to ae71d842dfc0958e06922062dd09d49243332a6a [11:27:28] !log rollbacking ae71d842dfc0958e06922062dd09d49243332a6a [11:27:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:27:37] (bad logging) [11:27:49] !log for ores in scb nodes [11:27:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:28:00] thank you [11:28:04] it's back up now [11:28:12] can see celery working now [11:28:18] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [11:28:32] RECOVERY - LVS HTTP IPv4 on ores.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 2644 bytes in 0.006 second response time [11:28:40] so correct prod endpoint is https://ores.wikimedia.org/ right? [11:28:47] elukey: yeah [11:29:07] RECOVERY - ores on scb1002 is OK: HTTP OK: HTTP/1.0 200 OK - 2833 bytes in 0.016 second response time [11:29:08] RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy [11:29:17] RECOVERY - ores on scb1001 is OK: HTTP OK: HTTP/1.0 200 OK - 2833 bytes in 0.012 second response time [11:29:26] <_joe_> Amir1: is the prod service actively used already? [11:29:27] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [11:29:27] my deployment wasn't related to the issue, it seems. My deployment actually brought a commit that wasn't deployed [11:29:45] _joe_: yeah, in extensions in wikidata and fawiki [11:29:46] RECOVERY - ores on scb2002 is OK: HTTP OK: HTTP/1.0 200 OK - 2833 bytes in 0.093 second response time [11:29:48] RECOVERY - PyBal backends health check on lvs1012 is OK: PYBAL OK - All pools are healthy [11:30:04] <_joe_> Amir1: ok then you need to write an incident documentation, I guess [11:30:04] but it just causes failed job and they come back [11:30:06] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy [11:30:12] I will [11:30:17] RECOVERY - ores on scb2001 is OK: HTTP OK: HTTP/1.0 200 OK - 2833 bytes in 0.090 second response time [11:30:30] <_joe_> https://wikitech.wikimedia.org/wiki/Incident_documentation [11:30:43] https://grafana-admin.wikimedia.org/dashboard/db/ores [11:30:49] I did for wikilabels before [11:33:52] _joe_: it can't stay like this, I need to fix it and we might have an "expected down time" [11:33:59] with it work with Ops [11:34:00] ? [11:34:24] all of our test environments work just fine [11:34:35] <_joe_> Amir1: what do you mean? Please be more explicit [11:34:55] the ores settings in prod need fixing [11:35:42] specially on config (the change I was deploying was not related to configs, it couldn't make any issues but it seems the old changes that were coming with it, made the issue) [11:35:59] <_joe_> ok so to have the right version what would we need? [11:36:01] !log roll-restart swift on ms-be1* to apply https://gerrit.wikimedia.org/r/294691 [11:36:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:36:08] <_joe_> a coordinated config + code release? [11:36:13] <_joe_> please advise [11:36:42] <_joe_> if that is the case, we can do it [11:36:48] _joe_: mostly, i need to have an hour of down time with proper rights [11:37:00] maybe we can do it in one node [11:37:02] <_joe_> an hour of downtime? why? [11:37:13] <_joe_> yes, we can do it all on one node [11:37:24] <_joe_> can I ask you to send a mail to ops@? [11:37:25] logs in ores shows it can't find configs [11:37:35] <_joe_> I am currently going to lunch [11:38:04] I need to see where it's going to read and why it's incorrect to make proper patches [11:38:25] <_joe_> oh so you need to use a node as a testbed [11:38:29] <_joe_> use codfw [11:38:31] yeah [11:38:42] <_joe_> they're not serving traffic I hope [11:38:48] does LVS sends requests to codfw? [11:39:03] <_joe_> only if you refer the codfw lvs in your config [11:39:20] alex set up our LVS [11:39:23] I need to check [11:39:23] <_joe_> so codfw lvs sends traffic to the codfw servers, eqiad => eqiad [11:39:35] <_joe_> yes, the point is how you refer to it in mediawiki-config [11:39:40] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:39:44] <_joe_> I can take a look after lunch [11:39:48] <_joe_> bbl [11:39:51] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:39:53] yeah, that would be amazing [11:39:54] thanks [11:44:40] RECOVERY - puppet last run on cp3047 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [11:44:49] RECOVERY - mediawiki-installation DSH group on mw1280 is OK: OK [11:47:41] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [11:51:00] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures [12:04:00] <_joe_> Amir1: so for the record mediawiki at the moment points to ores.wikimedia.org, which is the public endpoint [12:04:15] <_joe_> not sure if that's a good idea, but that means all traffic goes to eqiad [12:04:38] _joe_: are you sure? I checked the LVS: https://gerrit.wikimedia.org/r/#/c/291945/3/hieradata/common/lvs/configuration.yaml [12:04:42] <_joe_> so, you can freely use codfw [12:04:49] <_joe_> Amir1: yes I am sure [12:05:04] <_joe_> lvs is set up in both codfw and eqiad, but since mediawiki calls [12:05:27] <_joe_> $wgOresBaseUrl = 'https://ores.wikimedia.org/'; [12:05:36] <_joe_> this goes through the caching layers [12:05:46] yeah okay [12:05:47] <_joe_> and those only forward requests to eqiad [12:05:50] thanks [12:05:55] <_joe_> to the eqiad lvs I mean [12:06:14] <_joe_> I also think that (using the public endpoint) is very wrong, but that's for later [12:06:30] <_joe_> the tl;dr is: you can experiment on scb2001 [12:07:00] <_joe_> but please don't leave a mess [12:13:57] !log started deploying ores in scb2001 bdc1e2bd [12:14:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:14:20] now it's down, let's see [12:16:49] <_joe_> Amir1: need me to disable puppet htere? [12:16:59] nah [12:17:02] no need [12:17:12] I think I'm getting why it's incorrect [12:17:22] let me run the last test and we are done [12:17:33] 06Operations, 13Patch-For-Review: Automated service restarts for common low-level system services - https://phabricator.wikimedia.org/T135991#2393453 (10MoritzMuehlenhoff) Service restarts triggered by library updates are already tracked, debdeploy does that similarly to what checkrestart or a manual "lsof -nX... [12:18:26] _joe_: okay https://github.com/wiki-ai/ores-wikimedia-config/pull/63/files [12:18:41] sorted() is the reason behind this issue [12:18:51] PROBLEM - ores on scb2001 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 136 bytes in 0.073 second response time [12:19:06] /etc/ores should come last [12:19:10] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 53 probes of 390 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [12:19:14] but sorted brings it to first [12:19:48] 06Operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 4 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#2393462 (10KartikMistry) [12:19:58] fixing it is easy [12:20:02] let me make a patch [12:20:42] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 62 probes of 410 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map [12:23:52] PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 36 probes of 389 (alerts on 19) - https://atlas.ripe.net/measurements/1790945/#!map [12:25:40] PROBLEM - HHVM rendering on mw1145 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:26:31] PROBLEM - Apache HTTP on mw1145 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:26:42] (03CR) 10Gehel: T137422 Refactor the default cassandra monitoring into a separate class (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/295123 (owner: 10Nicko) [12:28:12] PROBLEM - configured eth on mw1145 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:28:12] PROBLEM - DPKG on mw1145 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:28:31] PROBLEM - SSH on mw1145 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:28:42] PROBLEM - dhclient process on mw1145 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:28:51] PROBLEM - nutcracker process on mw1145 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:29:10] PROBLEM - puppet last run on mw1145 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:29:11] PROBLEM - Check size of conntrack table on mw1145 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:29:21] PROBLEM - salt-minion processes on mw1145 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:29:21] PROBLEM - Disk space on mw1145 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:29:31] PROBLEM - HHVM processes on mw1145 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:29:51] PROBLEM - nutcracker port on mw1145 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:30:03] (03PS2) 10Gehel: fix puppet unit test in elastic search [puppet] - 10https://gerrit.wikimedia.org/r/295127 (owner: 10Maturain) [12:31:33] (03CR) 10Gehel: [C: 032] fix puppet unit test in elastic search [puppet] - 10https://gerrit.wikimedia.org/r/295127 (owner: 10Maturain) [12:32:41] RECOVERY - ores on scb2001 is OK: HTTP OK: HTTP/1.0 200 OK - 2833 bytes in 0.091 second response time [12:33:30] _joe_: I confirm that was the issue [12:33:39] scb1001 works like charm [12:33:45] sorry scb2001 [12:34:15] let me make the patch in gerrit and deploy again [12:34:54] if that's okay with Ops [12:36:18] Amir1: quick question - did the sorted() error not come up after the beta deployment? [12:36:51] PROBLEM - Disk space on mw1282 is CRITICAL: Timeout while attempting connection [12:36:51] PROBLEM - Disk space on mw1281 is CRITICAL: Timeout while attempting connection [12:37:11] PROBLEM - MD RAID on mw1281 is CRITICAL: Timeout while attempting connection [12:37:11] PROBLEM - MD RAID on mw1282 is CRITICAL: Timeout while attempting connection [12:37:55] these ones are the two new appservers, silencing them [12:38:01] PROBLEM - Apache HTTP on mw1282 is CRITICAL: Connection timed out [12:38:01] PROBLEM - Apache HTTP on mw1281 is CRITICAL: Connection timed out [12:38:01] <_joe_> elukey: kk [12:38:12] PROBLEM - configured eth on mw1281 is CRITICAL: Timeout while attempting connection [12:38:12] PROBLEM - configured eth on mw1282 is CRITICAL: Timeout while attempting connection [12:38:30] elukey: nope [12:38:31] PROBLEM - dhclient process on mw1282 is CRITICAL: Timeout while attempting connection [12:38:37] not in labs, etc. [12:38:55] <_joe_> Amir1: ok so if you found the fix, you can deploy it [12:39:02] thanks [12:39:25] <_joe_> Amir1: let's deploy to one node only first, maybe? [12:40:13] _joe_: I did it in scb2001 and tested internally and it worked [12:40:33] but the commit was a local hack, I need to make a patch in gerrit and +2 it [12:40:57] <_joe_> yes, once you've done it, deploy to one server first, verify it's working, then proceed to the next ones [12:41:17] _joe_: okay [12:41:45] !log rebooting ms1001 for update to Linux 4.4 [12:41:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:45:43] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [12:46:30] it seems I have trouble connecting to labs, maybe Europe connection issue again? [12:47:16] super slow [12:47:31] https://www.irccloud.com/pastebin/GhbncTTv/ [12:47:51] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [12:48:29] (03PS6) 10Gehel: Configuration for new maps cluster in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/294914 (https://phabricator.wikimedia.org/T138092) [12:50:02] RECOVERY - puppet last run on mw2249 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [12:50:08] (03CR) 10Gehel: [C: 032] Configuration for new maps cluster in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/294914 (https://phabricator.wikimedia.org/T138092) (owner: 10Gehel) [12:50:20] Everything else works fine for me but I can't connect to gerrit [12:50:52] and labs [12:51:29] you may be going through Telia which seems to be having problems [12:51:39] >70% packet loss to gerrit from the WMDE office [12:52:10] yes Telia or afterwards (from our pov) problem [12:52:32] RECOVERY - DPKG on mw1145 is OK: All packages OK [12:52:32] RECOVERY - nutcracker process on mw1145 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [12:52:41] RECOVERY - HHVM processes on mw1145 is OK: PROCS OK: 6 processes with command name hhvm [12:52:52] RECOVERY - puppet last run on mw1145 is OK: OK: Puppet is currently enabled, last run 58 minutes ago with 0 failures [12:53:03] RECOVERY - nutcracker port on mw1145 is OK: TCP OK - 0.000 second response time on port 11212 [12:53:41] RECOVERY - SSH on mw1145 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.7 (protocol 2.0) [12:53:53] RECOVERY - configured eth on mw1145 is OK: OK - interfaces up [12:54:12] RECOVERY - dhclient process on mw1145 is OK: PROCS OK: 0 processes with command name dhclient [12:54:12] RECOVERY - salt-minion processes on mw1145 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:54:13] RECOVERY - Disk space on mw1145 is OK: DISK OK [12:54:22] RECOVERY - Check size of conntrack table on mw1145 is OK: OK: nf_conntrack is 0 % full [12:54:59] full mtr https://phabricator.wikimedia.org/P3272 [12:55:12] RECOVERY - puppet last run on mw2250 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [12:55:55] that's quite annoying :( [12:56:11] !log deactivating peerings with Telia Carrier/AS1299 (eqiad/codfw/ulsfo) [12:56:12] !log installing maps1001.eqiad.wmnet (secondary cluster, no traffic there yet) - T138092 [12:56:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:56:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:57:42] (03CR) 10Tim Landscheidt: [C: 031] Bump the size limit for labs openldap server to 4096 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/295198 (https://phabricator.wikimedia.org/T122595) (owner: 10Muehlenhoff) [12:57:59] paravoid: thanks [12:58:51] (03Abandoned) 10Tim Landscheidt: WIP: ldap: Make ldaplist use paging for queries [puppet] - 10https://gerrit.wikimedia.org/r/295177 (https://phabricator.wikimedia.org/T122595) (owner: 10Tim Landscheidt) [13:01:25] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 11 probes of 389 (alerts on 19) - https://atlas.ripe.net/measurements/1790945/#!map [13:02:52] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 12 probes of 390 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [13:03:58] !log deploying 8e65182 to scb2001 [13:04:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:05:14] and the node is up! [13:05:38] Amir1, you may want to mention ores when you log [13:05:53] jynus: I keep forgetting that, sorry [13:06:28] paravoid: thx, that helped [13:06:30] !log full deployment for 8e65182 in ores nodes [13:06:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:07:02] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 4 probes of 410 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map [13:09:41] RECOVERY - Apache HTTP on mw1281 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.008 second response time [13:10:44] _joe_ elukey: Deployment is done without any down times [13:10:58] the race condition issue is fixed as well [13:13:52] RECOVERY - Disk space on mw1281 is OK: DISK OK [13:14:33] RECOVERY - MD RAID on mw1281 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [13:14:42] RECOVERY - Apache HTTP on mw1282 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.002 second response time [13:15:41] RECOVERY - configured eth on mw1281 is OK: OK - interfaces up [13:15:45] Amir1: good! a good follow up for the incident report would be to follow a "safe" deployment from now on (beta, one codfw node, etc..) [13:17:42] RECOVERY - MD RAID on mw1282 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [13:18:22] RECOVERY - Disk space on mw1282 is OK: DISK OK [13:18:52] RECOVERY - configured eth on mw1282 is OK: OK - interfaces up [13:19:11] RECOVERY - dhclient process on mw1282 is OK: PROCS OK: 0 processes with command name dhclient [13:22:19] (03PS1) 10KartikMistry: apertium-fr-es: New upstream and rebuild for Jessie [debs/contenttranslation/apertium-fr-es] - 10https://gerrit.wikimedia.org/r/295220 (https://phabricator.wikimedia.org/T107306) [13:23:23] 06Operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 4 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#2393602 (10KartikMistry) [13:24:41] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (bad PMCID) is CRITICAL: Could not fetch url http://citoid.svc.codfw.wmnet:1970/api: Timeout on connection while downloading http://citoid.svc.codfw.wmnet:1970/api [13:24:51] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:25:12] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:26:51] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy [13:27:02] !log reactivating peerings with Telia Carrier/AS1299 (eqiad/codfw/ulsfo) [13:27:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:27:24] !log restarted hhvm on mw1145 after temp. freeze due to memory pressure (hhvm debug in /tmp/hhvm.17794.bt.) [13:27:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:28:22] RECOVERY - Apache HTTP on mw1145 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.207 second response time [13:28:42] RECOVERY - HHVM rendering on mw1145 is OK: HTTP OK: HTTP/1.1 200 OK - 70615 bytes in 0.616 second response time [13:28:53] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:28:53] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:29:02] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad PMCID) is CRITICAL: Could not fetch url http://citoid.svc.eqiad.wmnet:1970/api: Timeout on connection while downloading http://citoid.svc.eqiad.wmnet:1970/api [13:29:28] 06Operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 4 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#2393612 (10KartikMistry) [13:31:03] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [13:31:03] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [13:33:32] (03CR) 10Hashar: [C: 031] fix puppet unit test for squid3 [puppet] - 10https://gerrit.wikimedia.org/r/295130 (owner: 10Maturain) [13:37:27] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [13:37:37] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [13:38:08] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [13:39:02] (03CR) 10Filippo Giunchedi: prometheus: add nginx reverse proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/290479 (https://phabricator.wikimedia.org/T126785) (owner: 10Filippo Giunchedi) [13:42:55] cmjohnson1: Hi! I have some issues booting mw1274.eqiad.wmnet, if you have 5 mins can we chat about it today/tomorrow? [13:42:56] (03CR) 10Hashar: "I have tested it locally. The test pass / catch error just fine." [puppet] - 10https://gerrit.wikimedia.org/r/295130 (owner: 10Maturain) [13:43:21] elukey...what's the problem? [13:43:39] cmjohnson1: When I boot I see "The system detected an exception during the UEFI pre-boot environment." [13:43:57] and "A system restart is required." [13:44:11] tried to hard reset/powercycle but didn't get far [13:44:35] so I was wondering if you have any suggestion (meanwhile I kept going with the other new appservers) [13:44:54] lots of s5 errors since 13:37:50 [13:45:19] there could be a h/w failure in there or a setting is wrong. I will need hook up the crash cart and see [13:45:43] cmjohnson1: do you want me to open a phab task? [13:45:53] (03CR) 10Andrew Bogott: [C: 04-1] "There are currently 8334 Ldap accounts, so we need a bigger limit." [puppet] - 10https://gerrit.wikimedia.org/r/295198 (https://phabricator.wikimedia.org/T122595) (owner: 10Muehlenhoff) [13:46:27] 07Blocked-on-Operations, 10Continuous-Integration-Infrastructure, 10Packaging, 05Gerrit-Migration, and 2 others: Package xhpast (libphutil) - https://phabricator.wikimedia.org/T137770#2393637 (10mmodell) [13:47:19] (03PS2) 10Andrew Bogott: labstore: Fix LDAP query for project members [puppet] - 10https://gerrit.wikimedia.org/r/295099 (https://phabricator.wikimedia.org/T138102) (owner: 10Tim Landscheidt) [13:47:27] 07Blocked-on-Operations, 10Continuous-Integration-Infrastructure, 10Packaging, 05Gerrit-Migration, and 2 others: Package xhpast (libphutil) - https://phabricator.wikimedia.org/T137770#2378301 (10mmodell) @fgiunchedi I landed D269, I believe this is ready to package and upload. Can you take it from here? [13:47:39] elukey: yes please [13:47:52] seems happening at the same time than slow Wikibase\Lib\Store\Sql\SqlEntityInfoBuilder::collectTermsForEntities [13:49:14] (03CR) 10Andrew Bogott: [C: 032] "Thanks Tim!" [puppet] - 10https://gerrit.wikimedia.org/r/295099 (https://phabricator.wikimedia.org/T138102) (owner: 10Tim Landscheidt) [13:51:41] (03PS1) 10Elukey: Add new appservers to the MediaWiki DSH scap list. [puppet] - 10https://gerrit.wikimedia.org/r/295222 [13:52:20] (03CR) 10Elukey: [C: 032] Add new appservers to the MediaWiki DSH scap list. [puppet] - 10https://gerrit.wikimedia.org/r/295222 (owner: 10Elukey) [13:53:42] (03PS1) 10Andrew Bogott: Revert "labstore: Fix LDAP query for project members" [puppet] - 10https://gerrit.wikimedia.org/r/295223 [13:55:10] (03PS2) 10Andrew Bogott: Revert "labstore: Fix LDAP query for project members" [puppet] - 10https://gerrit.wikimedia.org/r/295223 [13:56:48] (03CR) 10Andrew Bogott: [C: 032] Revert "labstore: Fix LDAP query for project members" [puppet] - 10https://gerrit.wikimedia.org/r/295223 (owner: 10Andrew Bogott) [13:58:03] (03PS1) 10Andrew Bogott: labstore: Fix LDAP query for project members [puppet] - 10https://gerrit.wikimedia.org/r/295225 (https://phabricator.wikimedia.org/T138102) [13:58:55] (03CR) 10Alex Monk: Provision a "staging" microsite for the transparency report (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/295192 (https://phabricator.wikimedia.org/T138197) (owner: 10Ori.livneh) [14:00:34] 06Operations, 10ops-eqiad, 06DC-Ops: New appserver mw1274 shows boot errors - https://phabricator.wikimedia.org/T138221#2393657 (10elukey) [14:00:45] (03CR) 10Andrew Bogott: [C: 04-1] "Sorry about the commit/revert -- I don't think the search string is right. Comment inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/295225 (https://phabricator.wikimedia.org/T138102) (owner: 10Andrew Bogott) [14:10:46] (03CR) 10Tim Landscheidt: labstore: Fix LDAP query for project members (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/295225 (https://phabricator.wikimedia.org/T138102) (owner: 10Andrew Bogott) [14:18:21] 06Operations, 07LDAP: Add wmf LDAP group members into nda group, delete wmf group - https://phabricator.wikimedia.org/T129786#2393694 (10Krenair) This may have just been undermined by the use of the wmf group without the nda group in {T138197} I suggest we do it regardless. [14:19:27] (03CR) 10Andrew Bogott: [C: 032] labstore: Fix LDAP query for project members (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/295225 (https://phabricator.wikimedia.org/T138102) (owner: 10Andrew Bogott) [14:21:12] (03PS1) 10Muehlenhoff: package_builder: Add gobject-introspection to package list [puppet] - 10https://gerrit.wikimedia.org/r/295226 [14:28:27] (03PS1) 10Gehel: Manage Postgresql data dir with Puppet [puppet] - 10https://gerrit.wikimedia.org/r/295227 (https://phabricator.wikimedia.org/T138092) [14:29:34] (03PS1) 10Giuseppe Lavagetto: hhvm: create file logs with the web user credentials [puppet] - 10https://gerrit.wikimedia.org/r/295229 [14:29:45] (03CR) 10jenkins-bot: [V: 04-1] Manage Postgresql data dir with Puppet [puppet] - 10https://gerrit.wikimedia.org/r/295227 (https://phabricator.wikimedia.org/T138092) (owner: 10Gehel) [14:33:20] (03CR) 10Ema: [C: 031] hhvm: create file logs with the web user credentials [puppet] - 10https://gerrit.wikimedia.org/r/295229 (owner: 10Giuseppe Lavagetto) [14:33:33] (03CR) 10Giuseppe Lavagetto: [C: 032] hhvm: create file logs with the web user credentials [puppet] - 10https://gerrit.wikimedia.org/r/295229 (owner: 10Giuseppe Lavagetto) [14:38:24] (03PS1) 10Jcrespo: Pool db1071 with low weight after maintenance, depool db1068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295230 [14:39:51] greg-g, https://wikitech.wikimedia.org/wiki/Deployments doesn't list any deployment changes on account of wikimania .. so, deployments proceed as usual? [14:40:17] (03PS2) 10Gehel: Manage Postgresql data dir with Puppet [puppet] - 10https://gerrit.wikimedia.org/r/295227 (https://phabricator.wikimedia.org/T138092) [14:41:34] (03CR) 10jenkins-bot: [V: 04-1] Manage Postgresql data dir with Puppet [puppet] - 10https://gerrit.wikimedia.org/r/295227 (https://phabricator.wikimedia.org/T138092) (owner: 10Gehel) [14:41:42] (03CR) 10Gehel: Manage Postgresql data dir with Puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/295227 (https://phabricator.wikimedia.org/T138092) (owner: 10Gehel) [14:42:59] (03PS6) 10Giuseppe Lavagetto: systemd: add systemd::sidekick [puppet] - 10https://gerrit.wikimedia.org/r/291949 [14:43:45] (03PS3) 10Gehel: Manage Postgresql data dir with Puppet [puppet] - 10https://gerrit.wikimedia.org/r/295227 (https://phabricator.wikimedia.org/T138092) [14:44:47] RECOVERY - mediawiki-installation DSH group on mw2249 is OK: OK [14:44:47] RECOVERY - mediawiki-installation DSH group on mw2248 is OK: OK [14:44:48] (03CR) 10Giuseppe Lavagetto: [C: 032] systemd: add systemd::sidekick [puppet] - 10https://gerrit.wikimedia.org/r/291949 (owner: 10Giuseppe Lavagetto) [14:45:03] (03CR) 10jenkins-bot: [V: 04-1] Manage Postgresql data dir with Puppet [puppet] - 10https://gerrit.wikimedia.org/r/295227 (https://phabricator.wikimedia.org/T138092) (owner: 10Gehel) [14:46:19] subbu: well wikimania has previously resulted in the mile high depolyers and mile high site repair awards, probably not [14:46:31] s/not/yes [14:46:48] (those two rewards, aren't connected, in all cases) [14:47:17] RECOVERY - mediawiki-installation DSH group on mw2250 is OK: OK [14:47:36] <_joe_> we're not going to be mile high this time [14:47:39] <_joe_> km high, yes [14:47:40] ok .. so, probably yes .. deployments continue as scheduled? [14:47:46] _joe_, :) [14:48:35] subbu: _joe_ yeah: train continues along with whatever else. only one person from releng will be at wikimania so we're mostly having a normal week [14:48:49] greg-g, ok. [15:00:04] anomie, ostriches, thcipriani, and marktraceur: Dear anthropoid, the time has come. Please deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160620T1500). [15:00:04] James_F, kart_, RoanKattouw, tgr, and Dereckson: A patch you scheduled for Morning SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [15:00:14] * James_F waves. [15:00:17] I'm around. [15:00:18] * RoanKattouw waves [15:01:04] I can SWAT today [15:01:29] (03PS4) 10Gehel: Manage Postgresql data dir with Puppet [puppet] - 10https://gerrit.wikimedia.org/r/295227 (https://phabricator.wikimedia.org/T138092) [15:01:46] Hi. [15:02:11] (03PS2) 10Thcipriani: Enable VisualEditor by default for all users of French Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295200 (https://phabricator.wikimedia.org/T133263) (owner: 10Jforrester) [15:02:35] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295200 (https://phabricator.wikimedia.org/T133263) (owner: 10Jforrester) [15:02:43] (03CR) 10jenkins-bot: [V: 04-1] Manage Postgresql data dir with Puppet [puppet] - 10https://gerrit.wikimedia.org/r/295227 (https://phabricator.wikimedia.org/T138092) (owner: 10Gehel) [15:03:11] (03Merged) 10jenkins-bot: Enable VisualEditor by default for all users of French Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295200 (https://phabricator.wikimedia.org/T133263) (owner: 10Jforrester) [15:04:04] (03PS5) 10Gehel: Manage Postgresql data dir with Puppet [puppet] - 10https://gerrit.wikimedia.org/r/295227 (https://phabricator.wikimedia.org/T138092) [15:04:43] !log thcipriani@tin Synchronized dblists/visualeditor-default.dblist: SWAT: [[gerrit:295200|Enable VisualEditor by default for all users of French Wikinews]] (duration: 00m 29s) [15:04:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:04:48] ^ James_F check please [15:05:15] (03PS6) 10Thcipriani: Deploy Compact Language Links as default (Stage 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294874 (https://phabricator.wikimedia.org/T136677) (owner: 10KartikMistry) [15:07:51] thcipriani: Yup, LGTM. [15:07:58] James_F: thanks for checking [15:08:26] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294874 (https://phabricator.wikimedia.org/T136677) (owner: 10KartikMistry) [15:09:05] (03Merged) 10jenkins-bot: Deploy Compact Language Links as default (Stage 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294874 (https://phabricator.wikimedia.org/T136677) (owner: 10KartikMistry) [15:10:41] (03CR) 10Gehel: "At last RuboCop is happy!" [puppet] - 10https://gerrit.wikimedia.org/r/295227 (https://phabricator.wikimedia.org/T138092) (owner: 10Gehel) [15:11:58] !log jmm@palladium conftool action : select; selector: name=mw1099.eqiad.wmnet [15:12:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:12:09] !log thcipriani@tin Synchronized dblists/cll-nondefault.dblist: SWAT: [[gerrit:294874|Deploy Compact Language Links as default (Stage 1)]] PART I (duration: 00m 29s) [15:12:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:12:47] !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:294874|Deploy Compact Language Links as default (Stage 1)]] PART II (duration: 00m 29s) [15:12:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:13:45] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:294874|Deploy Compact Language Links as default (Stage 1)]] PART III (duration: 00m 30s) [15:13:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:13:50] ^ kart_ check please [15:14:14] OK! [15:14:24] (03PS2) 10Thcipriani: Enable Flow beta feature on frwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294909 (https://phabricator.wikimedia.org/T138064) (owner: 10Catrope) [15:14:51] (03PS2) 10Jcrespo: Pool db1071 with low weight after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295230 [15:14:52] tgr: around for SWAT? [15:15:14] thcipriani: here [15:17:29] thcipriani: looks good! [15:17:33] thcipriani: thanks! [15:17:36] kart_: cool, thanks for checking :) [15:17:50] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294909 (https://phabricator.wikimedia.org/T138064) (owner: 10Catrope) [15:18:40] (03Merged) 10jenkins-bot: Enable Flow beta feature on frwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294909 (https://phabricator.wikimedia.org/T138064) (owner: 10Catrope) [15:20:36] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:294909|Enable Flow beta feature on frwikiquote]] (duration: 00m 28s) [15:20:39] ^ RoanKattouw check please [15:20:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:21:53] thcipriani: Working, thanks! [15:22:03] RoanKattouw: cool, thanks for checking! [15:22:26] (03PS1) 10Elukey: Add rack awareness to the aqs100[456] Cassandra nodes. [puppet] - 10https://gerrit.wikimedia.org/r/295233 [15:23:29] hmm, so for wmf.6 when I fetched, I also pulled down: https://gerrit.wikimedia.org/r/#/c/295024/ anyone know anything about that? [15:24:40] fine with deploying it if there's someone around to test it afterwords, I don't want it to get lost in the patch shuffle [15:24:42] slap ori? ;) [15:25:04] He +2ed with "LGTM; cherry-pick is your call." [15:25:14] So he was probably mistaken about which patch he was +2ing :) [15:25:19] Yeah [15:25:22] He +2'd the other too [15:25:50] yeah, +2'd master one 8 hours later [15:26:11] (03CR) 10Eevans: [C: 031] "As discussed on IRC, this will alter placement and require that the cluster be reset (or each instance decomm'd / rebootstrapped). Otherw" [puppet] - 10https://gerrit.wikimedia.org/r/295233 (owner: 10Elukey) [15:28:02] (03CR) 10EBernhardson: "Does this also need an update to regex.pp to set information about which machines are in which racks?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/294918 (owner: 10Gehel) [15:28:26] AaronSchulz: ori should https://gerrit.wikimedia.org/r/#/c/295024/ go out? I'll continue with SWAT for the time being and circle back to what to do with that patch at the end of SWAT. [15:28:36] (03CR) 10EBernhardson: "hieradata/regex.yaml actually." [puppet] - 10https://gerrit.wikimedia.org/r/294918 (owner: 10Gehel) [15:29:26] (03CR) 10EBernhardson: Configuration for new elasticsearch servers in eqiad. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/294918 (owner: 10Gehel) [15:30:08] 06Operations, 10ops-codfw: ms-be2012.codfw.wmnet: slot=10 dev=sdk failed - https://phabricator.wikimedia.org/T135975#2393875 (10Papaul) a:05Papaul>03fgiunchedi Disk replacement complete [15:31:01] !log thcipriani@tin Synchronized php-1.28.0-wmf.6/extensions/CentralAuth: SWAT: [[gerrit:295231|Split CentralAuthUser::queryAttached into cheap and expensive part]] (duration: 00m 31s) [15:31:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:31:06] ^ tgr check please [15:32:09] (03PS2) 10Thcipriani: Enable NewUserMessage on pl.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295178 (https://phabricator.wikimedia.org/T138169) (owner: 10Dereckson) [15:32:20] 07Blocked-on-Operations, 10Continuous-Integration-Infrastructure, 10Packaging, 05Gerrit-Migration, and 2 others: Package xhpast (libphutil) - https://phabricator.wikimedia.org/T137770#2393879 (10fgiunchedi) @mmodell for sure, I'm assuming you'd need it available for jessie and trusty? [15:33:22] thcipriani: works, thanks! [15:33:32] tgr: thanks for checking! [15:34:15] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295178 (https://phabricator.wikimedia.org/T138169) (owner: 10Dereckson) [15:34:51] (03Merged) 10jenkins-bot: Enable NewUserMessage on pl.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295178 (https://phabricator.wikimedia.org/T138169) (owner: 10Dereckson) [15:35:20] (03CR) 10Gehel: Configuration for new elasticsearch servers in eqiad. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/294918 (owner: 10Gehel) [15:36:28] 06Operations, 10ops-codfw: ms-be2003.codfw.wmnet: slot=4 dev=sde failed - https://phabricator.wikimedia.org/T137785#2393887 (10Papaul) a:03fgiunchedi Disk replacement complete [15:36:40] (03CR) 10Gehel: "Changing the masters to new nodes is tracked in T112556" [puppet] - 10https://gerrit.wikimedia.org/r/294918 (owner: 10Gehel) [15:37:15] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:295178|Enable NewUserMessage on pl.wikipedia]] (duration: 00m 25s) [15:37:17] ^ Dereckson check please [15:37:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:37:23] Checking. [15:37:57] (03PS2) 10Thcipriani: Allow sysops to add to/remove from confirmed on ca.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294933 (https://phabricator.wikimedia.org/T138069) (owner: 10Dereckson) [15:39:14] 06Operations, 10CirrusSearch, 06Discovery, 06Discovery-Search-Backlog, 13Patch-For-Review: Only use newer (elastic10{16..31}) servers as master capable elasticsearch nodes - https://phabricator.wikimedia.org/T112556#2393900 (10EBernhardson) 05declined>03Open [15:39:38] PROBLEM - puppet last run on mw1282 is CRITICAL: CRITICAL: Puppet has 1 failures [15:40:30] thcipriani: looks good at Special:Version, I don't see on https://pl.wikipedia.org/w/index.php?namespace=3&tagfilter=&title=Specjalna%3AOstatnie+zmiany a welcome message, but it's only enabled for accounts newly created directly on pl. so expected [15:40:46] Dereckson: ack. Thanks for checking. [15:41:24] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294933 (https://phabricator.wikimedia.org/T138069) (owner: 10Dereckson) [15:41:26] 06Operations, 10CirrusSearch, 06Discovery, 06Discovery-Search-Backlog, 13Patch-For-Review: Only use newer (elastic10{16..47}) servers as master capable elasticsearch nodes - https://phabricator.wikimedia.org/T112556#2393906 (10EBernhardson) [15:42:07] (03Merged) 10jenkins-bot: Allow sysops to add to/remove from confirmed on ca.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294933 (https://phabricator.wikimedia.org/T138069) (owner: 10Dereckson) [15:44:00] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:294933|Allow sysops to add to/remove from confirmed on ca.wikinews]] (duration: 00m 25s) [15:44:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:44:06] ^ Dereckson check please [15:45:11] Works. [15:45:41] Dereckson: thanks for checking. [15:47:47] AaronSchulz: ori I think I will revert https://gerrit.wikimedia.org/r/#/c/295024/1 which I pulled down with another, unrelated php-1.28.0-wmf.6 patch since (a) it's not clear if it was supposed to merge, (b) no one is around to test it if I SWAT it, and (c) I don't want anyone to sync it on accident without anyone to test [15:49:32] (03PS1) 10Catrope: Enable Echo transition flags in beta labs for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295234 (https://phabricator.wikimedia.org/T132954) [15:49:32] Thanks for deploying. [15:50:10] (03CR) 10Catrope: [C: 032] Enable Echo transition flags in beta labs for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295234 (https://phabricator.wikimedia.org/T132954) (owner: 10Catrope) [15:50:45] (03Merged) 10jenkins-bot: Enable Echo transition flags in beta labs for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295234 (https://phabricator.wikimedia.org/T132954) (owner: 10Catrope) [15:51:49] thcipriani: While you're in there, could you pull down ---^^ ? (labs-only change) [15:51:55] (03PS2) 10Elukey: DHCP: Add mw2243 MAC address Bug:T135466 [puppet] - 10https://gerrit.wikimedia.org/r/294745 (https://phabricator.wikimedia.org/T135466) (owner: 10Papaul) [15:52:23] RoanKattouw: yarp, np [15:52:43] (03PS2) 10Gehel: Configuration for new elasticsearch servers in eqiad. [puppet] - 10https://gerrit.wikimedia.org/r/294918 [15:54:01] (03CR) 10Elukey: "Checked racadm getsysinfo, the mac address is the first one:" [puppet] - 10https://gerrit.wikimedia.org/r/294745 (https://phabricator.wikimedia.org/T135466) (owner: 10Papaul) [15:54:12] (03CR) 10Elukey: [C: 032] "Checked racadm getsysinfo, the mac address is the first one:" [puppet] - 10https://gerrit.wikimedia.org/r/294745 (https://phabricator.wikimedia.org/T135466) (owner: 10Papaul) [15:55:26] 06Operations, 10Wikimedia-Mailing-lists: mailman maint window 2016-06-21 16:00 - 18:00 UTC - https://phabricator.wikimedia.org/T138228#2393964 (10RobH) [15:58:33] 07Blocked-on-Operations, 10Continuous-Integration-Infrastructure, 10Packaging, 05Gerrit-Migration, and 2 others: Package xhpast (libphutil) - https://phabricator.wikimedia.org/T137770#2393990 (10mmodell) @fgiunchedi Yes although jessie is my main target right now, trusty will be next. [16:03:35] (03CR) 10DCausse: Configuration for new elasticsearch servers in eqiad. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/294918 (owner: 10Gehel) [16:04:34] (03CR) 10Gehel: "/me is really bad at regexp. Thanks David!" [puppet] - 10https://gerrit.wikimedia.org/r/294918 (owner: 10Gehel) [16:05:04] pl.wikinews NewUserMessage is working: https://pl.wikipedia.org/wiki/Dyskusja_wikipedysty:Azxdw [16:05:12] pl.wikipedia pardon [16:11:27] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master). [16:11:47] (03CR) 10EBernhardson: "we don't have to, but i wonder if there would be less opportunity for mistakes using a more explicit regex syntax:" [puppet] - 10https://gerrit.wikimedia.org/r/294918 (owner: 10Gehel) [16:17:24] (03PS1) 10Dereckson: Fix pt.wikinews namespace issue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295239 (https://phabricator.wikimedia.org/T138230) [16:25:56] (03CR) 10Dzahn: [C: 04-1] "ok, seems like we all agree it should be on iridum instead of the cluster. so let's do that" [dns] - 10https://gerrit.wikimedia.org/r/293747 (https://phabricator.wikimedia.org/T123718) (owner: 10Dzahn) [16:26:23] (03CR) 10Dzahn: [C: 04-2] switch git.wikimedia.org from misc to text cluster [dns] - 10https://gerrit.wikimedia.org/r/293747 (https://phabricator.wikimedia.org/T123718) (owner: 10Dzahn) [16:33:32] (03PS3) 10Jforrester: Enable VisualEditor by default for all users of the French Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292750 (https://phabricator.wikimedia.org/T136993) [16:33:34] (03PS3) 10Jforrester: Enable VisualEditor by default for all users of the English Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292751 (https://phabricator.wikimedia.org/T136992) [16:33:36] (03PS3) 10Jforrester: Enable VisualEditor by default for all users of the Russian Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292748 (https://phabricator.wikimedia.org/T136995) [16:33:38] (03PS3) 10Jforrester: Enable VisualEditor by default for all users of the Italian Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292749 (https://phabricator.wikimedia.org/T136994) [16:33:40] (03PS3) 10Jforrester: Enable VisualEditor by default for all users of the German Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292752 (https://phabricator.wikimedia.org/T136991) [16:35:54] thcipriani: around? [16:36:19] kart_: aroundish. I'm in a meeting. What's up? [16:37:30] thcipriani: slight problem with my deployed patch, we've to revert, I guess. [16:37:56] thcipriani: did you sync dblist? [16:38:04] kart_: I did sync dblist. [16:38:24] 06Operations, 10Wikimedia-Mailing-lists: mailman maint window 2016-06-21 16:00 - 18:00 UTC - https://phabricator.wikimedia.org/T138228#2394131 (10RobH) I already got a few messages pointing out that taking down the wikimania mailing list the week of folks traveling there may not be ideal for communication! I'... [16:38:26] wondering then why it depoyed everywhere? :/ [16:39:30] kart_: ugh. OK, I can revert. [16:39:33] thcipriani: can you check, https://gerrit.wikimedia.org/r/#/c/294874 again? [16:40:22] should default stay as true? [16:40:31] it should be overridden, but looks weird [16:41:29] What priority is given to other dblists over project type ones? [16:41:34] Reedy: true is for it is beta feature [16:42:10] Reedy: what's that? where to set it? [16:42:22] Reedy: that is a good question. It seems like (given that it's false on enwiki) it may just be the first dblist it hits :\ [16:42:45] it should be dbname has highest billing, then group (wikipedia, wikiversity etc) then default [16:42:53] But that enabling list is just getting complex [16:42:59] 06Operations, 10Traffic, 06Community-Liaisons (Jul-Sep-2016): Help contact bot owners about the end of HTTP access to the API - https://phabricator.wikimedia.org/T136674#2394147 (10BBlack) 3 day log (over the weekend, basically since the last update above on the 17th): New usernames over the past 3 days: ``... [16:43:49] Just revert out the change from IS, and leave hte rest [16:45:08] Reedy: thanks. Fixing. [16:45:31] Then probably test the resultant config on the test host a bit more :) [16:46:28] Reedy: did that for testwikis, but dblists were new for me :/ [16:46:49] Reedy: I should only make wikipedia => true, right? [16:48:12] (03PS1) 10Steinsplitter: Revert "Deploy Compact Language Links as default (Stage 1)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295243 (https://phabricator.wikimedia.org/T136677) [16:49:29] Steinsplitter: I'm fixing it. [16:49:54] (03PS1) 10KartikMistry: Fix deployment of Compact Language Links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295245 [16:50:13] thcipriani: Reedy can you look ^ [16:51:58] (03PS1) 10Papaul: DNC: Add prod DNS entried for ms-be202[2-7] Bug:T136630 [dns] - 10https://gerrit.wikimedia.org/r/295246 (https://phabricator.wikimedia.org/T136630) [16:53:27] (03Abandoned) 10Steinsplitter: Revert "Deploy Compact Language Links as default (Stage 1)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295243 (https://phabricator.wikimedia.org/T136677) (owner: 10Steinsplitter) [16:53:43] thcipriani: ping me when around. [16:54:31] kart_: would you remove wikipedia, wiktionary, wikiversity, wikiquote, wikibooks, wikinews, wikivoyage, and cll-nondefault in IS.php? It makes it basically a revert of the patch, but makes it easier to test moving to a dblist. I think that's what Reedy was saying. [16:56:58] thcipriani: OK [16:57:09] am I here now? [16:57:17] yuvipanda: yes [16:58:01] thcipriani: what we need is to not include cx-nondefault list. so, it should be true and stay. [16:58:11] thcipriani, kart_: ccould we please just revert all wikipedias and discuss later how to fix it [16:58:55] doing [16:59:00] Nikerabbit: Doable. [16:59:03] thcipriani: Please do. [16:59:48] elukey and _joe_ : https://wikitech.wikimedia.org/wiki/Incident_documentation/20160620-ores [16:59:52] FYI :) [17:00:02] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: revert cll patch (duration: 00m 25s) [17:00:04] gehel: Respected human, time to deploy Weekly Wikidata query service deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160620T1700). Please do the needful. [17:00:04] SMalyshev: A patch you scheduled for Weekly Wikidata query service deployment window is about to be deployed. Please be available during the process. [17:00:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:00:39] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [17:01:00] kart_: Nikerabbit I reverted the patch but only sync'd the wmf-config/InitialiseSettings.php portion. Let me know what you want to do WRT a real revert. [17:01:22] thcipriani: thanks! [17:01:39] Amir1: thanks! Would you mind to add more info to the summary section? It is good but a bit generic :) [17:01:51] thcipriani: thhanks. [17:01:52] discussing where to go now [17:02:06] SMalyshev: I'll start the deployment on beta first [17:02:17] gehel: cool, thanks [17:06:07] 06Operations, 06Community-Liaisons, 10Wikimedia-Mailing-lists: mailman maint window 2016-06-21 16:00 - 18:00 UTC - https://phabricator.wikimedia.org/T138228#2394226 (10Qgil) [17:08:19] 06Operations, 10DBA: m1-master switch from db1001 to db1016 - https://phabricator.wikimedia.org/T106312#2394232 (10jcrespo) After the clean up, these are the databases that are still there (aside from the mysql ones mysql, information_schema and performance schema): ``` bacula etherpadlite heartbeat librenms p... [17:08:22] (03PS3) 10BBlack: tlsproxy: redirect-only service on 8080 [puppet] - 10https://gerrit.wikimedia.org/r/294706 (https://phabricator.wikimedia.org/T107236) [17:08:45] (03CR) 10BBlack: [C: 032 V: 032] "Works as expected in manual and compiler testing." [puppet] - 10https://gerrit.wikimedia.org/r/294706 (https://phabricator.wikimedia.org/T107236) (owner: 10BBlack) [17:09:00] thcipriani: can you fully revert the patch? have to rethink approach with dblist. [17:09:13] kart_: yup. doing. [17:09:14] (03Abandoned) 10KartikMistry: Fix deployment of Compact Language Links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295245 (owner: 10KartikMistry) [17:09:34] 06Operations, 10Traffic, 13Patch-For-Review: Switch port 80 to nginx on primary clusters - https://phabricator.wikimedia.org/T107236#2394235 (10BBlack) [17:09:48] SMalyshev: wdqs-updater is not restarting on wdqs-beta. Checking... [17:13:45] thcipriani: ping me if anything needed, I'm around. [17:15:00] 06Operations, 06Performance-Team, 10Traffic, 10Wikimedia-Stream, 13Patch-For-Review: Move stream.wikimedia.org (rcstream) behind cache_misc - https://phabricator.wikimedia.org/T134871#2394258 (10BBlack) @ori did notification go out? We're now 7 days from cert expiry. [17:15:42] (03PS1) 10Thcipriani: Revert "Deploy Compact Language Links as default (Stage 1)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295248 [17:16:23] (03CR) 10Thcipriani: [C: 032] Revert "Deploy Compact Language Links as default (Stage 1)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295248 (owner: 10Thcipriani) [17:17:01] (03Merged) 10jenkins-bot: Revert "Deploy Compact Language Links as default (Stage 1)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295248 (owner: 10Thcipriani) [17:18:52] !log Rebooting pfw-codfw [17:18:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:19:28] !log upload libphutil / arcanist 0~git20160616-0wmf1 to jessie-wikimedia T137770 [17:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:19:50] nooo stashbot where are you? :( [17:19:58] 07Blocked-on-Operations, 10Continuous-Integration-Infrastructure, 10Packaging, 05Gerrit-Migration, and 2 others: Package xhpast (libphutil) - https://phabricator.wikimedia.org/T137770#2394266 (10mmodell) [17:22:38] PROBLEM - Host alnilam is DOWN: PING CRITICAL - Packet loss = 100% [17:22:47] PROBLEM - Host payments2003 is DOWN: PING CRITICAL - Packet loss = 100% [17:23:35] 06Operations, 10ops-codfw, 10DBA: es2017 and es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2394275 (10jcrespo) I can confirm the error did not happen on both systems after BIOS update and restart. [17:23:37] PROBLEM - Host saiph is DOWN: PING CRITICAL - Packet loss = 100% [17:23:47] PROBLEM - Host rigel is DOWN: PING CRITICAL - Packet loss = 100% [17:24:17] PROBLEM - Host fdb2001 is DOWN: PING CRITICAL - Packet loss = 100% [17:24:28] PROBLEM - Host heka is DOWN: PING CRITICAL - Packet loss = 100% [17:25:47] PROBLEM - Host pay-lvs2001 is DOWN: PING CRITICAL - Packet loss = 100% [17:25:56] PROBLEM - Router interfaces on pfw-codfw is CRITICAL: CRITICAL: No response from remote host 208.80.153.195 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [17:26:00] stashbot: <3 [17:26:33] !log deploying latest WDQS [17:26:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:29:07] RECOVERY - Host alnilam is UP: PING OK - Packet loss = 0%, RTA = 38.18 ms [17:29:18] RECOVERY - Host payments2003 is UP: PING OK - Packet loss = 0%, RTA = 39.07 ms [17:30:19] RECOVERY - Host rigel is UP: PING OK - Packet loss = 0%, RTA = 36.96 ms [17:30:19] SMalyshev: deployment done, looks good to me (test queries, UI, updater logs). Let me know if you want to check anything more... [17:30:28] RECOVERY - Host pay-lvs2001 is UP: PING OK - Packet loss = 0%, RTA = 36.99 ms [17:30:31] gehel: thanks! [17:30:34] 06Operations, 10ops-codfw, 10DBA: es2017 and es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2394293 (10jcrespo) CORRECTION: it is not happening on es2019, but it happened again on es2017- I would double check by rebooting once more that host (es2017)- the error was seconds before det... [17:30:37] RECOVERY - Host saiph is UP: PING OK - Packet loss = 0%, RTA = 37.64 ms [17:30:48] RECOVERY - Host fdb2001 is UP: PING OK - Packet loss = 0%, RTA = 36.66 ms [17:31:28] PROBLEM - Host bellatrix is DOWN: PING CRITICAL - Packet loss = 100% [17:31:37] PROBLEM - Host betelgeuse is DOWN: PING CRITICAL - Packet loss = 100% [17:32:31] !log https://tools.wmflabs.org/sal missing events between 2016-06-19T12:29 and 2016-06-20T17:26. [17:32:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:33:19] 06Operations, 10ops-codfw, 10DBA: es2017 and es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2394310 (10jcrespo) Logs: ``` SUP0516: Updating firmware for PowerEdge BIOS to version 2.1.6. 2016-06-15T16:52:33-0500 Log Sequence Number: 273 Detailed Description: Do not turn off the sys... [17:35:16] RECOVERY - check_mysql on payments2001 is OK: Uptime: 359584 Threads: 3 Questions: 11485 Slow queries: 5 Opens: 30 Flush tables: 1 Open tables: 64 Queries per second avg: 0.031 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [17:35:17] RECOVERY - Host heka is UP: PING OK - Packet loss = 0%, RTA = 37.36 ms [17:35:42] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2394312 (10Paladox) @greg Hi, I spoke with @dzahn about this as he will be in the same timezone as me I think he sai... [17:35:56] thcipriani: did not see log message, done? [17:35:58] PROBLEM - Host payments2002 is DOWN: PING CRITICAL - Packet loss = 100% [17:36:07] PROBLEM - Host pay-lvs2002 is DOWN: PING CRITICAL - Packet loss = 100% [17:36:15] (03PS1) 10BBlack: ncredir hostname and service IP [dns] - 10https://gerrit.wikimedia.org/r/295249 (https://phabricator.wikimedia.org/T133548) [17:36:31] kart_: yes, it should be reverted [17:36:57] PROBLEM - Host mintaka is DOWN: PING CRITICAL - Packet loss = 100% [17:37:06] PROBLEM - Host alnitak is DOWN: PING CRITICAL - Packet loss = 100% [17:38:04] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2394318 (10Paladox) [17:40:07] thcipriani: thanks! [17:40:15] PROBLEM - check_puppetrun on saiph is CRITICAL: CRITICAL: Puppet has 1 failures [17:40:15] RECOVERY - check_puppetrun on payments2001 is OK: OK: Puppet is currently enabled, last run 66 seconds ago with 0 failures [17:41:15] RECOVERY - Host payments2002 is UP: PING OK - Packet loss = 0%, RTA = 37.48 ms [17:41:25] RECOVERY - Router interfaces on pfw-codfw is OK: OK: host 208.80.153.195, interfaces up: 90, down: 0, dormant: 0, excluded: 0, unused: 0 [17:41:36] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master). [17:42:37] RECOVERY - Host pay-lvs2002 is UP: PING OK - Packet loss = 0%, RTA = 36.74 ms [17:43:07] PROBLEM - puppet last run on ms-fe2001 is CRITICAL: CRITICAL: Puppet has 1 failures [17:43:13] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2394335 (10Paladox) [17:43:37] RECOVERY - Host mintaka is UP: PING OK - Packet loss = 0%, RTA = 39.04 ms [17:43:57] RECOVERY - Host alnitak is UP: PING OK - Packet loss = 0%, RTA = 37.99 ms [17:45:07] PROBLEM - check_puppetrun on rigel is CRITICAL: CRITICAL: Puppet has 6 failures [17:45:07] RECOVERY - check_puppetrun on pay-lvs2001 is OK: OK: Puppet is currently enabled, last run 193 seconds ago with 0 failures [17:45:16] PROBLEM - check_puppetrun on saiph is CRITICAL: CRITICAL: Puppet has 1 failures [17:45:16] PROBLEM - check_puppetrun on alnilam is CRITICAL: CRITICAL: Puppet has 7 failures [17:45:16] RECOVERY - check_puppetrun on payments2002 is OK: OK: Puppet is currently enabled, last run 114 seconds ago with 0 failures [17:45:17] RECOVERY - Host betelgeuse is UP: PING OK - Packet loss = 0%, RTA = 37.44 ms [17:45:26] RECOVERY - Host bellatrix is UP: PING OK - Packet loss = 0%, RTA = 37.38 ms [17:46:46] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [17:47:49] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2374127 (10Paladox) a:05Danny_B>03Dzahn Re assigned the dns bit to @Dzahn @Danny_B did the apache re write rules. [17:50:15] PROBLEM - check_puppetrun on rigel is CRITICAL: CRITICAL: Puppet has 6 failures [17:50:15] RECOVERY - check_puppetrun on pay-lvs2002 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [17:50:15] RECOVERY - check_puppetrun on saiph is OK: OK: Puppet is currently enabled, last run 133 seconds ago with 0 failures [17:50:15] PROBLEM - check_puppetrun on alnilam is CRITICAL: CRITICAL: Puppet has 7 failures [17:50:16] RECOVERY - check_puppetrun on payments2003 is OK: OK: Puppet is currently enabled, last run 177 seconds ago with 0 failures [17:51:25] 06Operations, 10Ops-Access-Requests, 06Parsing-Team, 06Services: Allow the Services team to administer the Parsoid cluster - https://phabricator.wikimedia.org/T137879#2394346 (10RobH) a:05RobH>03ssastry So we specifically need input from @ssastry on allowing services to administer the parsing team's pa... [17:51:36] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: Migrate Gitblit (git.wikimedia.org) -> Diffusion (phabricator.wikimedia.org/diffusion) - https://phabricator.wikimedia.org/T137224#2394349 (10Danny_B) [17:53:09] 06Operations, 10Ops-Access-Requests: Allow *-admin groups to see systemd logs for their units - https://phabricator.wikimedia.org/T137878#2394352 (10RobH) a:05RobH>03Joe This request has been denied in the operations meeting, in its currently proposed use of sudo rights to view the syslog. Rather, its bee... [17:54:01] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: Migrate Gitblit (git.wikimedia.org) -> Diffusion (phabricator.wikimedia.org/diffusion) - https://phabricator.wikimedia.org/T137224#2394357 (10Paladox) [17:55:15] PROBLEM - check_puppetrun on rigel is CRITICAL: CRITICAL: Puppet has 6 failures [17:55:15] RECOVERY - check_puppetrun on alnilam is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [17:55:29] 06Operations, 10Ops-Access-Requests, 06Parsing-Team, 06Services: Allow the Services team to administer the Parsoid cluster - https://phabricator.wikimedia.org/T137879#2394358 (10ssastry) I approve. [17:55:35] 06Operations, 07Blocked-on-RelEng, 05Gitblit-Deprecate, 13Patch-For-Review: Phase out antimony.wikimedia.org (git.wikimedia.org / gitblit) - https://phabricator.wikimedia.org/T123718#2394360 (10Paladox) I'm going to merge this into T137224 since that task is also doing what this task here is describing. [17:55:49] 06Operations, 10Ops-Access-Requests, 06Parsing-Team, 06Services: Allow the Services team to administer the Parsoid cluster - https://phabricator.wikimedia.org/T137879#2394362 (10ssastry) a:05ssastry>03None [17:55:55] 06Operations, 07Blocked-on-RelEng, 05Gitblit-Deprecate, 13Patch-For-Review: Phase out antimony.wikimedia.org (git.wikimedia.org / gitblit) - https://phabricator.wikimedia.org/T123718#2394363 (10Paladox) [17:55:57] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: Migrate Gitblit (git.wikimedia.org) -> Diffusion (phabricator.wikimedia.org/diffusion) - https://phabricator.wikimedia.org/T137224#2374778 (10Paladox) [17:56:35] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: Migrate Gitblit (git.wikimedia.org) -> Diffusion (phabricator.wikimedia.org/diffusion) - https://phabricator.wikimedia.org/T137224#2394366 (10Paladox) [17:56:38] 06Operations, 07Blocked-on-RelEng, 05Gitblit-Deprecate, 13Patch-For-Review: Phase out antimony.wikimedia.org (git.wikimedia.org / gitblit) - https://phabricator.wikimedia.org/T123718#2394369 (10greg) [17:58:16] 06Operations, 13Patch-For-Review, 07Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2394373 (10Paladox) [17:58:18] 06Operations, 07Blocked-on-RelEng, 05Gitblit-Deprecate, 13Patch-For-Review: Phase out antimony.wikimedia.org (git.wikimedia.org / gitblit) - https://phabricator.wikimedia.org/T123718#2394371 (10Paladox) 05duplicate>03Open Per @greg since this is a tracking task. [17:58:20] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: Migrate Gitblit (git.wikimedia.org) -> Diffusion (phabricator.wikimedia.org/diffusion) - https://phabricator.wikimedia.org/T137224#2394374 (10greg) Please coordinate with releng first. [18:00:15] RECOVERY - check_puppetrun on rigel is OK: OK: Puppet is currently enabled, last run 126 seconds ago with 0 failures [18:00:25] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: Migrate Gitblit (git.wikimedia.org) -> Diffusion (phabricator.wikimedia.org/diffusion) - https://phabricator.wikimedia.org/T137224#2394387 (10Paladox) @greg oh, how, what do we say. [18:00:39] 06Operations, 13Patch-For-Review: Automated service restarts for common low-level system services - https://phabricator.wikimedia.org/T135991#2394390 (10fgiunchedi) I agree alerting would be tricky to get right in this case, most of the value would be in auditing what machines still need a service restart. If... [18:03:43] 06Operations, 10DBA: m1-master switch from db1001 to db1016 - https://phabricator.wikimedia.org/T106312#2394393 (10jcrespo) @Akosiaris @MoritzMuehlenhoff @fgiunchedi @demon @Krenair @Dzahn I intend to perform the failover on Wednesday 22, 16:00 UTC. I do not really need you to do anything, and this should be... [18:04:25] 06Operations, 10DBA: m1-master switch from db1001 to db1016 - https://phabricator.wikimedia.org/T106312#1464530 (10jcrespo) a:03jcrespo [18:06:15] RECOVERY - puppet last run on ms-fe2001 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [18:06:42] (03PS3) 10Jcrespo: Pool db1071 with low weight after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295230 [18:08:32] (03CR) 10Jcrespo: [C: 032] Pool db1071 with low weight after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295230 (owner: 10Jcrespo) [18:09:43] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Pool db1071 with low weight after maintenance (duration: 00m 26s) [18:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:10:14] (03CR) 10Hashar: "Akosiaris: good news, Jenkins can put the builds history to a different path which would make it easier to exclude them entirely. Jenkin" [puppet] - 10https://gerrit.wikimedia.org/r/293690 (https://phabricator.wikimedia.org/T80385) (owner: 10Muehlenhoff) [18:16:28] (03PS1) 10Hashar: contint: do not backup Jenkins build history [puppet] - 10https://gerrit.wikimedia.org/r/295255 (https://phabricator.wikimedia.org/T80385) [18:16:39] (03CR) 10Hashar: "https://gerrit.wikimedia.org/r/295255 :D" [puppet] - 10https://gerrit.wikimedia.org/r/293690 (https://phabricator.wikimedia.org/T80385) (owner: 10Muehlenhoff) [18:20:15] RECOVERY - check_mysql on fdb2001 is OK: Uptime: 590676 Threads: 1 Questions: 6134454 Slow queries: 3626 Opens: 690 Flush tables: 2 Open tables: 573 Queries per second avg: 10.385 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1 [18:20:32] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: Migrate Gitblit (git.wikimedia.org) -> Diffusion (phabricator.wikimedia.org/diffusion) - https://phabricator.wikimedia.org/T137224#2394448 (10greg) I'll write it this afternoon Pacific time. [18:24:26] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: Migrate Gitblit (git.wikimedia.org) -> Diffusion (phabricator.wikimedia.org/diffusion) - https://phabricator.wikimedia.org/T137224#2394472 (10Paladox) @greg ok, thanks. [18:26:31] 07Blocked-on-Operations, 10Continuous-Integration-Infrastructure, 10Packaging, 05Gerrit-Migration, and 2 others: Package xhpast (libphutil) - https://phabricator.wikimedia.org/T137770#2394504 (10fgiunchedi) @mmodell libphutil and arcanist are now in `jessie-wikimedia` I just noticed arcanist is missing a... [18:26:41] (03Abandoned) 10Papaul: DNC: Add prod DNS entried for ms-be202[2-7] Bug:T136630 [dns] - 10https://gerrit.wikimedia.org/r/295246 (https://phabricator.wikimedia.org/T136630) (owner: 10Papaul) [18:26:54] 06Operations: Frequent segfaults of rsvg-convert on image scalers - https://phabricator.wikimedia.org/T137876#2382253 (10Menner) I suspect this is not a recent issue and has always been present. [18:27:11] 06Operations, 07LDAP: Add wmf LDAP group members into nda group, delete wmf group - https://phabricator.wikimedia.org/T129786#2116331 (10ori) >>! In T129786#2393694, @Krenair wrote: > This may have just been undermined by the use of the wmf group without the nda group in {T138197} The request from Legal for... [18:28:33] 06Operations, 06Commons, 10media-storage: Update rsvg on the image scalers - https://phabricator.wikimedia.org/T112421#2394539 (10Menner) Recent Debain has librsvg 2.40.16 ready for backport too. https://packages.debian.org/source/stretch/librsvg [18:37:36] 06Operations, 07LDAP: Add wmf LDAP group members into nda group, delete wmf group - https://phabricator.wikimedia.org/T129786#2116331 (10hashar) Regarding Jenkins, historically we solely had the `wmf` group. Then volunteers needed access and we added `nda` and Wikimedia Deutschland got added to `wmde`. `wmf` a... [18:43:46] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 688 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5222819 keys - replication_delay is 688 [18:59:27] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5184186 keys - replication_delay is 0 [19:02:37] 06Operations, 07LDAP: Add wmf LDAP group members into nda group, delete wmf group - https://phabricator.wikimedia.org/T129786#2394840 (10Krenair) >>! In T129786#2394529, @ori wrote: >>>! In T129786#2393694, @Krenair wrote: >> This may have just been undermined by the use of the wmf group without the nda group... [19:04:48] 06Operations, 07LDAP: Add wmf LDAP group members into nda group, delete wmf group - https://phabricator.wikimedia.org/T129786#2116331 (10Krenair) (apparently phabricator re-adds subscribers when you edit posts mentioning them) [19:08:31] jynus, I'm not sure why you're pinging me at https://phabricator.wikimedia.org/T106312#2394393 but it sounds fine. any idea how long the swap might take? [19:14:21] 06Operations, 06Performance-Team, 10Traffic, 10Wikimedia-Stream, 13Patch-For-Review: Move stream.wikimedia.org (rcstream) behind cache_misc - https://phabricator.wikimedia.org/T134871#2394859 (10ori) >>! In T134871#2394258, @BBlack wrote: > @ori did notification go out? We're now 7 days from cert expiry... [19:17:16] !log aaron@tin Synchronized php-1.28.0-wmf.6/includes/api/ApiStashEdit.php: 82e14dc66f478fbdb9ca6eab1eeb4f9c68c99bd1 (duration: 00m 36s) [19:17:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:22:15] (03CR) 10Ori.livneh: [C: 04-1] "lgtm apart from missing require" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/295227 (https://phabricator.wikimedia.org/T138092) (owner: 10Gehel) [19:26:34] 06Operations, 10ops-codfw, 10media-storage, 13Patch-For-Review: rack/setup/deploy ms-be202[2-7] - https://phabricator.wikimedia.org/T136630#2394867 (10Papaul) Servers are configured and ready for install but are not getting DHCP [19:29:04] 06Operations, 10ops-eqiad, 06DC-Ops: New appserver mw1274 shows boot errors - https://phabricator.wikimedia.org/T138221#2394871 (10Cmjohnson) 05Open>03Resolved The server was hung...performed a hard reset and drained flea power. Ready for install now. [19:36:35] (03CR) 10Nicko: "Thanks for the different comments, I'll try to refactor my commit message as, I agree, it's not fully "compliant", sorry. I will also try " (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/295123 (owner: 10Nicko) [19:41:26] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: Migrate Gitblit (git.wikimedia.org) -> Diffusion (phabricator.wikimedia.org/diffusion) - https://phabricator.wikimedia.org/T137224#2394874 (10greg) Just to be clear, an email with one day notice is not enough :) we'll give a... [19:41:41] 07Blocked-on-Operations, 10Continuous-Integration-Infrastructure, 10Packaging, 05Gerrit-Migration, and 2 others: Package xhpast (libphutil) - https://phabricator.wikimedia.org/T137770#2394876 (10mmodell) @fgiunchedi: See D276 [19:48:14] (03CR) 10Gehel: Manage Postgresql data dir with Puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/295227 (https://phabricator.wikimedia.org/T138092) (owner: 10Gehel) [19:50:46] !log cleaning up /scratch NFS share as it ran out of inodes [19:50:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:52:01] 07Blocked-on-Operations, 10Continuous-Integration-Infrastructure, 10Packaging, 05Gerrit-Migration, and 2 others: Package xhpast (libphutil) - https://phabricator.wikimedia.org/T137770#2394898 (10hashar) [19:58:25] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [20:00:04] gwicke, cscott, arlolra, subbu, bearND, and mdholloway: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160620T2000). Please do the needful. [20:00:12] no deploy today. [20:00:26] no deploy for mobileapps [20:01:29] (03CR) 10Ori.livneh: prometheus: add nginx reverse proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/290479 (https://phabricator.wikimedia.org/T126785) (owner: 10Filippo Giunchedi) [20:09:26] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [20:09:40] :S [20:10:43] (03CR) 10Luke081515: [C: 031] Fix pt.wikinews namespace issue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295239 (https://phabricator.wikimedia.org/T138230) (owner: 10Dereckson) [20:22:26] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [20:43:31] 06Operations, 10ops-eqiad, 10Analytics-Cluster, 06Analytics-Kanban: Smartctl disk defects on kafka1012 - https://phabricator.wikimedia.org/T136933#2395046 (10Nuria) 05Open>03Resolved [20:44:35] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [20:45:32] ebernhardson: ^ is that something you are watching then? :) [20:45:33] 07Blocked-on-Operations, 10Continuous-Integration-Infrastructure, 10Packaging, 05Gerrit-Migration, and 2 others: Package xhpast (libphutil) - https://phabricator.wikimedia.org/T137770#2395061 (10hashar) [20:49:21] greg-g: around? [20:49:21] aude: You sent me a contentless ping. This is a contentless pong. Please provide a bit of information about what you want and I will respond when I am around. [20:49:48] greg-g: i would like to deploy a bug fix for wikidata [20:50:06] and will be on an airplane at swat time, so would like to do it now [20:50:46] https://gerrit.wikimedia.org/r/#/c/295309/ [20:52:33] 06Operations, 10Ops-Access-Requests, 06Parsing-Team, 06Services: Allow the Services team to administer the Parsoid cluster - https://phabricator.wikimedia.org/T137879#2395131 (10RobH) I've escalated this into discussion within the ops team. There is no operations team meeting next Monday (June 27th, 2016)... [20:52:40] 06Operations, 10Ops-Access-Requests, 06Parsing-Team, 06Services: Allow the Services team to administer the Parsoid cluster - https://phabricator.wikimedia.org/T137879#2395133 (10RobH) a:03RobH [20:52:53] * aude proceeds [20:53:07] (can't wait until i get to esino lario) [20:53:15] and is low risk [20:53:54] Krenair, because you had something with rt [20:54:16] rt? I don't think so... [20:54:22] not rt [20:54:25] racktables [20:54:27] oh [20:54:36] or servermon [20:54:38] yeah, that's fine, thanks. [20:54:51] failover will take 1 second [20:55:14] the problem is what happens after it [20:55:45] yeah I use the m1-master alias to access that [20:55:59] I don't need up-to-the-second info from it [20:56:50] no, that is not the issue [20:56:58] the issue is if it stops working [20:57:03] because new os [20:57:07] because new ip [20:57:15] because new mariadb version [20:57:38] I need people to check after the failover things are still up [20:58:17] well I can check if I'm available at that time [21:02:03] * aude deploying [21:05:33] (03PS1) 10Jhobs: Remove old mobile workaround for wikidata descriptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295311 (https://phabricator.wikimedia.org/T127250) [21:05:55] !log aude@tin Synchronized php-1.28.0-wmf.6/extensions/Wikidata: Fix property suggester (duration: 01m 59s) [21:06:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:11:43] * aude done [21:22:14] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: Redirect Gitblit urls (git.wikimedia.org) -> Diffusion urls (phabricator.wikimedia.org/diffusion) - https://phabricator.wikimedia.org/T137224#2395262 (10greg) [21:24:19] (03PS6) 10Gehel: Manage Postgresql data dir with Puppet [puppet] - 10https://gerrit.wikimedia.org/r/295227 (https://phabricator.wikimedia.org/T138092) [21:24:36] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: Redirect Gitblit urls (git.wikimedia.org) -> Diffusion urls (phabricator.wikimedia.org/diffusion) - https://phabricator.wikimedia.org/T137224#2395264 (10greg) [21:30:06] RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [21:32:17] (03CR) 10Gehel: Manage Postgresql data dir with Puppet (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/295227 (https://phabricator.wikimedia.org/T138092) (owner: 10Gehel) [21:34:04] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: Redirect Gitblit urls (git.wikimedia.org) -> Diffusion urls (phabricator.wikimedia.org/diffusion) - https://phabricator.wikimedia.org/T137224#2395282 (10greg) (I hope I cleared up the task ordering confusion with the title/d... [22:03:54] PROBLEM - puppet last run on labsdb1007 is CRITICAL: CRITICAL: Puppet has 1 failures [22:08:29] 06Operations, 10ORES, 06Revision-Scoring-As-A-Service: ORES should advertise swagger specs under /?spec - https://phabricator.wikimedia.org/T137804#2395343 (10Halfak) a:03Halfak [22:08:41] 06Operations, 10ORES, 06Revision-Scoring-As-A-Service: ORES should advertise swagger specs under /?spec - https://phabricator.wikimedia.org/T137804#2379606 (10Halfak) p:05Unbreak!>03High [22:09:14] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: Redirect Gitblit urls (git.wikimedia.org) -> Diffusion urls (phabricator.wikimedia.org/diffusion) - https://phabricator.wikimedia.org/T137224#2395345 (10Paladox) @greg yep thanks. [22:16:50] (03PS3) 10Nicko: Include a cassandra::instance::monitoring class [puppet] - 10https://gerrit.wikimedia.org/r/295123 (https://phabricator.wikimedia.org/T137422) [22:18:24] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: Redirect Gitblit urls (git.wikimedia.org) -> Diffusion urls (phabricator.wikimedia.org/diffusion) - https://phabricator.wikimedia.org/T137224#2395362 (10greg) @Dzahn Can you commit to a window next week to do this (merge all... [22:19:31] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: Redirect Gitblit urls (git.wikimedia.org) -> Diffusion urls (phabricator.wikimedia.org/diffusion) - https://phabricator.wikimedia.org/T137224#2395363 (10greg) email (pending date confirmation): https://etherpad.wikimedia.org... [22:29:04] RECOVERY - puppet last run on labsdb1007 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [22:34:40] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [22:40:51] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [22:54:51] 06Operations, 06Release-Engineering-Team, 07Developer-notice, 05Gitblit-Deprecate, and 2 others: Redirect Gitblit urls (git.wikimedia.org) -> Diffusion urls (phabricator.wikimedia.org/diffusion) - https://phabricator.wikimedia.org/T137224#2395384 (10Danny_B) [23:00:04] RoanKattouw, ostriches, Krenair, MaxSem, and Dereckson: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160620T2300). [23:00:04] jhobs and Dereckson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:06] Hi. I can SWAT this evening. Let's start with Jeff config change. [23:00:43] I'm here [23:02:20] PROBLEM - puppet last run on cp3036 is CRITICAL: CRITICAL: puppet fail [23:02:43] So the value has been removed from source code? [23:04:09] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [23:04:44] Dereckson: the value being set there was an old alpha process. It's long since been removed and the variable is set to false by default in InitialiseSettings (which is the desired behavior) [23:06:12] In your extension code, there is still the value at tests/browser/LocalSettings.php by the way [23:06:35] Dereckson: sorry, the _process_ has been abolished [23:06:39] the variable is still used [23:06:42] ok [23:06:49] but we're setting it in InitialiseSettings now [23:07:04] and defaulting to false as we shift out of beta [23:07:05] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295311 (https://phabricator.wikimedia.org/T127250) (owner: 10Jhobs) [23:07:40] (03Merged) 10jenkins-bot: Remove old mobile workaround for wikidata descriptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295311 (https://phabricator.wikimedia.org/T127250) (owner: 10Jhobs) [23:08:22] jhobs: live on mw1017, if you can test this with https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug [23:11:23] (03PS2) 10Dereckson: Fix pt.wikinews namespace issue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295239 (https://phabricator.wikimedia.org/T138230) [23:12:36] Dereckson: yep, it's working, thanks [23:13:23] !log dereckson@tin Synchronized wmf-config/mobile.php: Remove old mobile workaround for Wikidata descriptions (T127250, T138085) (duration: 00m 33s) [23:13:25] T138085: Ensure Wikidata descriptions disabled by default on stable channel prod cluster mobile web Wikipedias - https://phabricator.wikimedia.org/T138085 [23:13:25] T127250: Prepare Wikidata descriptions to roll out to stable - https://phabricator.wikimedia.org/T127250 [23:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:13:31] Here you are in prod ^ [23:14:10] (03CR) 10Dereckson: [C: 032] Fix pt.wikinews namespace issue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295239 (https://phabricator.wikimedia.org/T138230) (owner: 10Dereckson) [23:14:43] Dereckson: thanks! [23:14:46] (03Merged) 10jenkins-bot: Fix pt.wikinews namespace issue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295239 (https://phabricator.wikimedia.org/T138230) (owner: 10Dereckson) [23:14:48] You're welcome. [23:15:47] pt.wikinews change tested fined on mw1017 too, sending it to prod [23:16:26] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Fix pt.wikinews namespace issue (T138230) (duration: 00m 24s) [23:16:27] T138230: Restore former namespaces on pt.wikinews - https://phabricator.wikimedia.org/T138230 [23:16:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:16:59] Works. Script to run to update namespaces. [23:19:31] PROBLEM - HHVM rendering on mw1252 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:20:40] PROBLEM - Apache HTTP on mw1252 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:22:09] !log `mwscript namespaceDupes.php ptwikinews --fix` (T138230). Some links and revisions are still to fix. [23:22:10] T138230: Restore former namespaces on pt.wikinews - https://phabricator.wikimedia.org/T138230 [23:22:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:27:29] RECOVERY - puppet last run on cp3036 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [23:29:10] RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [23:41:18] 06Operations, 13Patch-For-Review: Staging area for the next version of the transparency report - https://phabricator.wikimedia.org/T138197#2392917 (10MZMcBride) What's wrong with meta.wikimedia.org? [23:48:10] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [23:49:48] ebernhardson: what's up with CirrusSearch? [23:51:58] ori: not really sure, its only codfw reporting that though which only handles morelike queries (and index updates) [23:52:23] morelike is only ~30 req/s, but they are quite expensive so can push the latency percentiles up high [23:52:49] i'm not sure why its doing it today and not before though :S [23:57:47] p50 and p75 are pretty normal...but something is causing p95 to rise :(