[00:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170125T0000). Please do the needful. [00:00:23] Nothing to deploy. [00:05:03] (03PS2) 10Dereckson: Amend import sources for en.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333487 (https://phabricator.wikimedia.org/T155922) (owner: 10MarcoAurelio) [00:05:14] Actually, I'm going to deploy that one ^ [00:05:57] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333487 (https://phabricator.wikimedia.org/T155922) (owner: 10MarcoAurelio) [00:08:08] (03Merged) 10jenkins-bot: Amend import sources for en.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333487 (https://phabricator.wikimedia.org/T155922) (owner: 10MarcoAurelio) [00:08:10] 06Operations, 10ops-eqiad, 10hardware-requests: Decommission analytics1026 and analytics1015 - https://phabricator.wikimedia.org/T147313#2967363 (10Dzahn) [00:08:18] (03CR) 10jenkins-bot: Amend import sources for en.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333487 (https://phabricator.wikimedia.org/T155922) (owner: 10MarcoAurelio) [00:09:14] !log analytics1015,analytics1026 - revoked puppet cert, removing from puppet, shutting down (T147313) [00:09:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:09:21] T147313: Decommission analytics1026 and analytics1015 - https://phabricator.wikimedia.org/T147313 [00:09:30] 333487 live on mwdebug1002 [00:10:49] looks good, syncing [00:11:29] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Amend import sources for en.wikisource (T155922) (duration: 00m 47s) [00:11:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:11:33] T155922: English Wikisource — update to import sources - https://phabricator.wikimedia.org/T155922 [00:14:22] PROBLEM - Host analytics1015 is DOWN: PING CRITICAL - Packet loss = 100% [00:14:42] PROBLEM - Host analytics1026 is DOWN: PING CRITICAL - Packet loss = 100% [00:14:57] ACKNOWLEDGEMENT - Host analytics1015 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T147313 [00:14:57] ACKNOWLEDGEMENT - Host analytics1026 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T147313 [00:15:04] (03PS1) 10Dzahn: site.pp, DHCP: remove analytics1015, analytics1026 [puppet] - 10https://gerrit.wikimedia.org/r/334004 (https://phabricator.wikimedia.org/T147313) [00:15:32] (03PS2) 10Dzahn: site.pp, DHCP: remove analytics1015, analytics1026 [puppet] - 10https://gerrit.wikimedia.org/r/334004 (https://phabricator.wikimedia.org/T147313) [00:16:09] (03PS3) 10Dzahn: site.pp, DHCP: remove analytics1015, analytics1026 [puppet] - 10https://gerrit.wikimedia.org/r/334004 (https://phabricator.wikimedia.org/T147313) [00:19:10] 06Operations, 10DBA, 13Patch-For-Review: Reimage db1065 and db1066 - https://phabricator.wikimedia.org/T156005#2961400 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['db1066.eqiad.wmnet'] ``` The log can be found in `/var/log/wmf-auto-reimage/201... [00:19:15] 06Operations, 10ops-esams, 10hardware-requests: decom cp3011-22 (12 machines) - https://phabricator.wikimedia.org/T130883#2149483 (10Dzahn) Can we remove these from puppet and shut them down? They are running idle since October. [00:19:51] (03CR) 10Dzahn: [C: 032] site.pp, DHCP: remove analytics1015, analytics1026 [puppet] - 10https://gerrit.wikimedia.org/r/334004 (https://phabricator.wikimedia.org/T147313) (owner: 10Dzahn) [00:22:42] 06Operations, 10ops-esams, 10hardware-requests: decom cp3011-22 (12 machines) - https://phabricator.wikimedia.org/T130883#2967392 (10Dzahn) cp3011, cp3014 for example show up in modules/torrus/tests/cdn.pp. Wondering what to replace them with in those torrus tests. [00:23:22] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [00:23:47] (03PS1) 10Dzahn: site.pp, DHCP: remove cp3011-cp3022 [puppet] - 10https://gerrit.wikimedia.org/r/334005 (https://phabricator.wikimedia.org/T130883) [00:27:44] (03PS1) 10Dzahn: site.pp, DHCP: remove mw1017,mw1099 [puppet] - 10https://gerrit.wikimedia.org/r/334006 (https://phabricator.wikimedia.org/T151303) [00:29:13] 06Operations, 10ops-eqiad, 06DC-Ops, 13Patch-For-Review, 15User-Joe: Decommission mw1017, mw1099 - https://phabricator.wikimedia.org/T151295#2967404 (10Dzahn) [00:29:15] 06Operations, 10ops-eqiad, 06DC-Ops, 13Patch-For-Review, 15User-Joe: Hardware decommission mw1017, mw1099 - https://phabricator.wikimedia.org/T151303#2967403 (10Dzahn) 05Resolved>03Open [00:30:23] (03PS1) 10Dzahn: apache-fast-test: replace mw1017 with mwdebug1001 [puppet] - 10https://gerrit.wikimedia.org/r/334007 [00:31:02] (03PS11) 10Dzahn: mediawiki module: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/332103 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [00:31:37] (03CR) 10Dzahn: [C: 031] "noop in compiler http://puppet-compiler.wmflabs.org/5216/" [puppet] - 10https://gerrit.wikimedia.org/r/332103 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [00:31:55] 06Operations, 10DBA, 10netops, 13Patch-For-Review: Switchover s1 master db1057 -> db1052 - https://phabricator.wikimedia.org/T156008#2967405 (10jcrespo) I have upgraded all packages except wmf-mariadb10 and restarted the server for kernel update. [00:36:05] (03CR) 10Dzahn: [C: 032] mediawiki module: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/332103 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [00:36:12] RECOVERY - salt-minion processes on planet2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:37:22] !log planet2001 - re-add new salt key, fix minion [00:37:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:38:10] (03PS1) 10Jcrespo: mariadb: Repool db1052 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334008 (https://phabricator.wikimedia.org/T156008) [00:40:39] 06Operations, 10Cassandra, 10RESTBase, 06Services (later): Evaluate ScyllaDB as a near-term replacement to Cassandra - https://phabricator.wikimedia.org/T150811#2967432 (10GWicke) [00:41:08] 06Operations, 10DBA, 13Patch-For-Review: Reimage db1065 and db1066 - https://phabricator.wikimedia.org/T156005#2967438 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1066.eqiad.wmnet'] ``` and were **ALL** successful. [00:45:43] (03PS1) 10Dzahn: site.pp, DHCP: remove db1019 [puppet] - 10https://gerrit.wikimedia.org/r/334010 (https://phabricator.wikimedia.org/T146265) [00:46:46] (03PS2) 10Dzahn: site.pp, DHCP: remove db1019, db1042 [puppet] - 10https://gerrit.wikimedia.org/r/334010 (https://phabricator.wikimedia.org/T146265) [00:47:23] (03PS3) 10Dzahn: site.pp, DHCP: remove db1019, db1042 [puppet] - 10https://gerrit.wikimedia.org/r/334010 (https://phabricator.wikimedia.org/T149793) [00:49:48] 06Operations, 10ops-eqiad: reclaim multatuli - https://phabricator.wikimedia.org/T156208#2967468 (10Dzahn) [00:51:10] (03PS1) 10Dzahn: site.pp, DHCP: remove multatuli [puppet] - 10https://gerrit.wikimedia.org/r/334011 (https://phabricator.wikimedia.org/T156208) [00:51:58] (03CR) 10Volans: [C: 031] "LGTM, I've also tested it on the host." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/333866 (owner: 10Faidon Liambotis) [00:53:11] (03PS1) 10Dzahn: remove multatuli.wikimedia.org, keep mgmt [dns] - 10https://gerrit.wikimedia.org/r/334012 (https://phabricator.wikimedia.org/T156208) [00:54:27] (03CR) 10Jcrespo: [C: 032] mariadb: Repool db1052 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334008 (https://phabricator.wikimedia.org/T156008) (owner: 10Jcrespo) [00:55:56] (03Merged) 10jenkins-bot: mariadb: Repool db1052 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334008 (https://phabricator.wikimedia.org/T156008) (owner: 10Jcrespo) [00:57:00] (03PS1) 10Dzahn: remove analytics1015, analytics1026. incl. mgmt [dns] - 10https://gerrit.wikimedia.org/r/334013 (https://phabricator.wikimedia.org/T147313) [00:57:52] (03CR) 10jenkins-bot: mariadb: Repool db1052 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334008 (https://phabricator.wikimedia.org/T156008) (owner: 10Jcrespo) [00:58:39] (03PS2) 10Dzahn: remove analytics1015, analytics1026. incl. mgmt [dns] - 10https://gerrit.wikimedia.org/r/334013 (https://phabricator.wikimedia.org/T147313) [00:59:49] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1052 after maintenance (duration: 00m 40s) [00:59:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:01:00] (03PS1) 10Dzahn: remove db1019, db1042 incl. mgmt [dns] - 10https://gerrit.wikimedia.org/r/334014 (https://phabricator.wikimedia.org/T149793) [01:04:01] (03PS1) 10Dzahn: remove cp3011-cp3022 incl. mgmt [dns] - 10https://gerrit.wikimedia.org/r/334015 (https://phabricator.wikimedia.org/T130883) [01:05:03] 06Operations, 10ops-eqiad, 13Patch-For-Review: reclaim multatuli - https://phabricator.wikimedia.org/T156208#2967492 (10Dzahn) esams row OE12 @ 9 [01:05:06] 06Operations, 10DBA, 10MediaWiki-Change-tagging: db1072 change_tag schema and dataset is not consistent - https://phabricator.wikimedia.org/T156166#2967493 (10TTO) It looks like the author of `ChangeTags::updateTags` intended for the DB unique key to act as protection against duplicate entries. This pattern... [01:05:43] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 64 failures. Last run 2 minutes ago with 64 failures. Failed resources (up to 3 shown): Package[ack-grep],Package[screen],Package[nagios-plugins-basic],Package[python-yaml] [01:09:13] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:12:42] 06Operations, 10DBA, 10MediaWiki-Change-tagging: db1072 change_tag schema and dataset is not consistent - https://phabricator.wikimedia.org/T156166#2967519 (10jcrespo) > is widespread in MediaWiki core And we should kill them with fire! :-P I got the architecture comittee to agree with me and document that... [01:13:28] Hey.... [01:13:43] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [01:13:51] Any operations guy around, willing to do a touch of DB tweaking? Not urgent at all. [01:16:35] a touch of DB tweaking? [01:16:52] (nods) [01:17:22] There are about 35 ‘completely bogus’ entries with broken information in the transcode table. [01:17:23] PROBLEM - Host labstore1004 is DOWN: PING CRITICAL - Packet loss = 100% [01:17:33] PROBLEM - Host analytics1031 is DOWN: PING CRITICAL - Packet loss = 100% [01:17:46] not the switch again... [01:17:47] s1-master down, too? [01:17:50] They aren’t really ‘fixable’, because they are all transcodes of renamed files, under the old filename. [01:17:51] ouch [01:17:53] PROBLEM - Host analytics1029 is DOWN: PING CRITICAL - Packet loss = 100% [01:17:53] PROBLEM - Host analytics1030 is DOWN: PING CRITICAL - Packet loss = 100% [01:17:53] PROBLEM - Host analytics1028 is DOWN: PING CRITICAL - Packet loss = 100% [01:17:56] They date from 2013. [01:17:57] jynus: possible, likely [01:18:03] Is Wikipedia supposed to be in read-only mode right now? [01:18:03] PROBLEM - Host db1088 is DOWN: PING CRITICAL - Packet loss = 100% [01:18:03] PROBLEM - Host ms-be1007 is DOWN: PING CRITICAL - Packet loss = 100% [01:18:03] PROBLEM - Host db1056 is DOWN: PING CRITICAL - Packet loss = 100% [01:18:03] PROBLEM - Host db1054 is DOWN: PING CRITICAL - Packet loss = 100% [01:18:03] PROBLEM - Host db1057 is DOWN: PING CRITICAL - Packet loss = 100% [01:18:04] https://quarry.wmflabs.org/query/14916 <- those. [01:18:04] PROBLEM - Host db1060 is DOWN: PING CRITICAL - Packet loss = 100% [01:18:04] PROBLEM - Host es1016 is DOWN: PING CRITICAL - Packet loss = 100% [01:18:05] PROBLEM - Host ms-be1005 is DOWN: PING CRITICAL - Packet loss = 100% [01:18:05] PROBLEM - Host ms-be1006 is DOWN: PING CRITICAL - Packet loss = 100% [01:18:06] PROBLEM - Host db1055 is DOWN: PING CRITICAL - Packet loss = 100% [01:18:06] PROBLEM - Host db1087 is DOWN: PING CRITICAL - Packet loss = 100% [01:18:07] PROBLEM - Host db1059 is DOWN: PING CRITICAL - Packet loss = 100% [01:18:11] Ok wait, later, lol. [01:18:12] yeah, I would say so [01:18:13] PROBLEM - Host es1015 is DOWN: PING CRITICAL - Packet loss = 100% [01:18:14] nice [01:18:23] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 208, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/0/2: down - Core: asw-c-eqiad:xe-2/1/2 {#3464} [10Gbps DF]BR [01:18:23] PROBLEM - configured eth on lvs1003 is CRITICAL: eth2 reporting no carrier. [01:18:33] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: /data/citation/{format}/{query} (Get citation for Darth Vader) is WARNING: Test Get citation for Darth Vader responds with unexpected body: [0]/encyclopediaTitle = None: /page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [01:18:33] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /data/citation/{format}/{query} (Get citation for Darth Vader) is WARNING: Test Get citation for Darth Vader responds with unexpected body: [0]/encyclopediaTitle = None: /page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [01:18:33] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:18:33] PROBLEM - restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:18:33] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:18:33] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:18:34] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:18:34] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:18:35] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:18:35] PROBLEM - restbase endpoints health on restbase2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:18:36] PROBLEM - configured eth on lvs1001 is CRITICAL: eth2 reporting no carrier. [01:18:43] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero alive) is CRITICAL: Test Zotero alive returned the unexpected status 404 (expecting: 200) [01:18:43] PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:18:44] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:18:44] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:18:44] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:18:53] RECOVERY - Host db1088 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [01:18:53] RECOVERY - Host db1056 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [01:18:53] RECOVERY - Host db1059 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [01:18:53] RECOVERY - Host ms-be1006 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [01:18:53] RECOVERY - Host ms-be1007 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [01:18:53] RECOVERY - Host analytics1028 is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms [01:18:54] RECOVERY - Host analytics1029 is UP: PING OK - Packet loss = 0%, RTA = 1.88 ms [01:18:54] RECOVERY - Host db1060 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [01:18:55] RECOVERY - Host db1054 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [01:18:55] RECOVERY - Host ms-be1005 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [01:18:56] RECOVERY - Host labstore1004 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [01:18:57] RECOVERY - Host es1016 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [01:18:57] RECOVERY - Host db1057 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [01:18:57] RECOVERY - Host db1087 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [01:18:59] The Wikipedia database is temporarily in read-only mode [01:19:20] should be up now? [01:19:23] yeah [01:19:23] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [01:19:23] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [01:19:23] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [01:19:23] RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy [01:19:23] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy [01:19:24] yep [01:19:24] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy [01:19:24] RECOVERY - restbase endpoints health on restbase2001 is OK: All endpoints are healthy [01:19:25] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy [01:19:25] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [01:19:26] RECOVERY - configured eth on lvs1003 is OK: OK - interfaces up [01:19:26] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [01:19:27] back to normal [01:19:33] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [01:19:33] RECOVERY - configured eth on lvs1001 is OK: OK - interfaces up [01:19:34] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [01:19:34] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [01:19:34] RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy [01:19:37] well, that is less than 5 minutes [01:19:40] jynus: yes, usually just restarts [01:19:43] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [01:19:43] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [01:19:44] yeah - switch restart [01:20:24] let's see the actual impact on enwiki rcs [01:20:43] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [01:21:23] PROBLEM - puppet last run on analytics1029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:21:23] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [01:21:28] 02:16:06-02:18:56 [01:21:33] PROBLEM - puppet last run on analytics1030 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 2 minutes ago with 6 failures. Failed resources (up to 3 shown): File[/home/mforns],File[/home/rush],File[/home/mark],File[/home/springle] [01:21:46] so under 3 minutes [01:21:55] labs is not happy :-) [01:22:20] labstore1004 should be passive, 1005 should be the active one [01:22:23] PROBLEM - puppet last run on analytics1031 is CRITICAL: CRITICAL: Puppet has 10 failures. Last run 2 minutes ago with 10 failures. Failed resources (up to 3 shown): File[/usr/lib/nagios/plugins/check-fresh-files-in-dir.py],File[/usr/local/bin/puppet-enabled],File[/usr/lib/nagios/plugins/check_sysctl],File[/etc/sysctl.d] [01:22:23] jynus: where are you seeing this? [01:22:40] enwiki recentchanges [01:22:59] s1 master losses conectivity, but it is up [01:23:11] rcs are a good measure of the actual user impact [01:23:23] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:23:26] jynus: but why labs should not be happy? [01:23:35] 06Operations, 10ops-codfw: codfw:rack/setup mc2019-mc2036 - https://phabricator.wikimedia.org/T155755#2967547 (10Papaul) [01:23:39] labstore1004 is there [01:23:43] NFS bla bla [01:23:44] is passive [01:23:47] ors something [01:23:49] 1005 is the active one [01:23:57] since the first reboot [01:23:58] ok, if you say so :-) [01:24:00] and is doing fine [01:24:11] I was reacting to the alerts [01:24:27] jynus: can you point me to the alerts? [01:24:32] oh, madhuvishy you meant why I said labs wasn't happy? [01:24:33] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [01:24:39] jynus: yes :) [01:25:01] ok, forget it [01:25:26] I mistook an analytics box with labs [01:25:30] sorry about that [01:25:33] aah [01:25:37] okay :) [01:25:53] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [01:26:24] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [01:26:33] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:29:33] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [01:31:43] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:31:53] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:32:33] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [01:33:23] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:34:33] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:36:41] Revent, add your request as a comment with expected and found results in a query here: https://phabricator.wikimedia.org/T138967 [01:36:58] 06Operations, 10ops-codfw, 10netops: codfw: mc2019-mc2036/switch port configuration - https://phabricator.wikimedia.org/T156212#2967567 (10Papaul) [01:38:13] 06Operations, 10ops-codfw: codfw:rack/setup mc2019-mc2036 - https://phabricator.wikimedia.org/T155755#2953444 (10Papaul) [01:38:13] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [01:39:17] jynus: Ok… just so ya know, this same search in quarry is populated with all of the ‘currently running’ transcodes if the scalers are reset… both times that was done recently, ended up with a buch that were shown as ‘completed’, but were not, and had to be shoveled back through. [01:39:45] add all that info to the ticket [01:39:49] These 35 are just ‘unfixable’ from our end, as they don’t exist. [01:39:50] otherwise, it will be lost! [01:40:13] I promise to have a look at it [01:40:20] Ok. [01:45:33] RECOVERY - puppet last run on analytics1030 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [01:47:23] RECOVERY - puppet last run on analytics1031 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [01:47:53] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 1801.673495 Seconds [01:48:53] RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 28.526611 Seconds [01:50:23] RECOVERY - puppet last run on analytics1029 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [01:54:40] jynus: I left a longish comment, explaining as much as I can about it (and history) [01:54:51] thank you, Revent [01:56:04] 06Operations, 10ops-codfw, 06DC-Ops: ms-be2025 controller failure - https://phabricator.wikimedia.org/T151201#2967634 (10Papaul) 05Open>03Resolved It has been more than a month now this system is up and running with now problem. I am resolving this task. [01:57:56] jynus: I mentioned the entries that ‘were’ in that search, that I fixed by undeleting and redeleting the files…. the bug behind those seems clearly to have been long fixed, as ‘now’ deleting a file with running transcodes (or renaming one, for that matter) removes them from the table. [01:59:33] PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:00:34] RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING [02:19:53] 06Operations, 10DBA, 13Patch-For-Review: Reimage db1065 and db1066 - https://phabricator.wikimedia.org/T156005#2967668 (10jcrespo) This is all done, only pending db1066 to catch up and repool. [02:20:18] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.8) (duration: 06m 25s) [02:20:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:23:30] 06Operations, 10MediaWiki-Vagrant, 06Release-Engineering-Team, 07Epic: [EPIC] Migrate base image to Debian Jessie - https://phabricator.wikimedia.org/T136429#2967672 (10bd808) [02:24:09] 06Operations, 10MediaWiki-Vagrant, 06Release-Engineering-Team, 07Epic: [EPIC] Migrate base image to Debian Jessie - https://phabricator.wikimedia.org/T136429#2334744 (10bd808) [02:24:36] 06Operations, 10MediaWiki-Vagrant, 06Release-Engineering-Team, 07Epic: [EPIC] Migrate base image to Debian Jessie - https://phabricator.wikimedia.org/T136429#2334744 (10bd808) [02:30:04] 06Operations, 10DBA, 13Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#2967685 (10jcrespo) MySQLs with no SSL ``` $ sudo salt -C 'G@cluster:mysql' cmd.run 'mysql --skip-ssl -e "SELECT @@ssl_ca"' | grep -c 'NULL' 13 ``` MySQL with expired TLS cert: ``` $ sudo s... [02:35:39] 06Operations, 10MediaWiki-Vagrant, 06Release-Engineering-Team, 07Epic: [EPIC] Migrate base image to Debian Jessie - https://phabricator.wikimedia.org/T136429#2967690 (10bd808) [02:39:33] (03PS1) 10Jcrespo: mariadb: Repool db1066 after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334022 (https://phabricator.wikimedia.org/T156005) [02:46:13] (03CR) 10Volans: "I did a first pass on the code (tests TBD), see my comments so far inline." (033 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/288609 (https://phabricator.wikimedia.org/T155823) (owner: 10Giuseppe Lavagetto) [02:50:20] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.9) (duration: 12m 53s) [02:50:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:55:58] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Jan 25 02:55:57 UTC 2017 (duration 5m 37s) [02:56:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:13:28] 06Operations, 06Commons, 06Multimedia, 10media-storage, 15User-Josve05a: Specific revisions of multiple files missing from Swift - 404 Not Found returned - https://phabricator.wikimedia.org/T124101#2967818 (10Josve05a) First revision of https://commons.wikimedia.org/wiki/File:J.J._Burns_NSRW1-0009.jpg as... [03:24:03] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 689.07 seconds [03:28:53] PROBLEM - puppet last run on cp3036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:45:03] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 165.07 seconds [03:56:53] RECOVERY - puppet last run on cp3036 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [05:06:13] PROBLEM - puppet last run on mw1242 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:14:36] (03PS1) 10AndyRussG: CentralNotice config: make mediawiki its own CN project [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334025 [05:15:42] (03PS2) 10AndyRussG: CentralNotice config: make mediawiki its own CN project [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334025 (https://phabricator.wikimedia.org/T156221) [05:25:51] (03PS4) 10Ema: Expand Content-Security-Policy on upload test to fr. [puppet] - 10https://gerrit.wikimedia.org/r/318490 (https://phabricator.wikimedia.org/T117618) (owner: 10Brian Wolff) [05:33:43] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:34:13] RECOVERY - puppet last run on mw1242 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [05:37:00] !log zotero restarting zotero, taking 95% of mem ... [05:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:39:29] (03CR) 10Ema: [C: 031] raid: also check for State: degraded in md arrays [puppet] - 10https://gerrit.wikimedia.org/r/333866 (owner: 10Faidon Liambotis) [05:54:13] RECOVERY - Restbase root url on restbase-dev1001 is OK: HTTP OK: HTTP/1.1 200 - 15500 bytes in 0.028 second response time [05:54:33] RECOVERY - restbase endpoints health on restbase-dev1001 is OK: All endpoints are healthy [06:02:43] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:03:41] akosiaris: i'm sick of zotero too, let's prioritise its removal \o/ [06:07:54] <_joe_> mobrovac: so many things to remove :P [06:08:17] haha [06:13:09] _joe_: here's more :) https://gerrit.wikimedia.org/r/#/c/334006/1 [06:14:27] signs off, cya [06:15:54] (03Draft2) 10TTO: Enable expiring user groups on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333652 [06:16:00] (03CR) 10TTO: [C: 04-1] Enable expiring user groups on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333652 (owner: 10TTO) [06:34:03] RECOVERY - restbase endpoints health on restbase-dev1002 is OK: All endpoints are healthy [06:34:23] RECOVERY - Restbase root url on restbase-dev1002 is OK: HTTP OK: HTTP/1.1 200 - 15500 bytes in 0.014 second response time [06:35:03] PROBLEM - Check HHVM threads for leakage on mw1168 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [06:35:04] RECOVERY - restbase endpoints health on restbase-dev1003 is OK: All endpoints are healthy [06:35:13] RECOVERY - Restbase root url on restbase-dev1003 is OK: HTTP OK: HTTP/1.1 200 - 15500 bytes in 0.010 second response time [06:39:03] PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [06:50:55] (03CR) 10Marostegui: [C: 032] mariadb: Repool db1066 after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334022 (https://phabricator.wikimedia.org/T156005) (owner: 10Jcrespo) [06:52:40] (03Merged) 10jenkins-bot: mariadb: Repool db1066 after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334022 (https://phabricator.wikimedia.org/T156005) (owner: 10Jcrespo) [06:52:51] (03CR) 10jenkins-bot: mariadb: Repool db1066 after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334022 (https://phabricator.wikimedia.org/T156005) (owner: 10Jcrespo) [06:53:03] PROBLEM - Check HHVM threads for leakage on mw1260 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [06:53:13] PROBLEM - Check HHVM threads for leakage on mw1259 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [06:54:09] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1066 - T156005 (duration: 00m 42s) [06:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:16] T156005: Reimage db1065 and db1066 - https://phabricator.wikimedia.org/T156005 [06:55:57] 06Operations, 10DBA, 06Labs, 10netops, 13Patch-For-Review: DBA plan to mitigate asw-c2-eqiad reboots - https://phabricator.wikimedia.org/T155999#2968016 (10Marostegui) [06:56:02] 06Operations, 10DBA, 13Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#2968017 (10Marostegui) [06:56:04] 06Operations, 10DBA, 13Patch-For-Review: Reimage db1065 and db1066 - https://phabricator.wikimedia.org/T156005#2968014 (10Marostegui) 05Open>03Resolved I have repooled the host. Awesome job @jcrespo [07:04:40] 06Operations, 10DBA, 06Labs, 10netops, 13Patch-For-Review: DBA plan to mitigate asw-c2-eqiad reboots - https://phabricator.wikimedia.org/T155999#2968020 (10Marostegui) For the record and tracking purposes: after lots of hours and hassle we were able to switch db1095's (new sanitarium) master from db1052... [07:07:39] 06Operations, 10ops-eqiad, 10DBA, 10netops: Move db1054 to C3 - https://phabricator.wikimedia.org/T156225#2968022 (10Marostegui) [07:08:02] 06Operations, 10DBA, 06Labs, 10netops, 13Patch-For-Review: DBA plan to mitigate asw-c2-eqiad reboots - https://phabricator.wikimedia.org/T155999#2961118 (10Marostegui) [07:09:34] 06Operations, 10DBA, 10MediaWiki-Change-tagging: Reimage and clone db1072 - https://phabricator.wikimedia.org/T156226#2968039 (10Marostegui) [07:18:16] (03CR) 10Marostegui: "if we push this they should be powered off soon in order to avoid them being up without having any security update, right?" [puppet] - 10https://gerrit.wikimedia.org/r/334010 (https://phabricator.wikimedia.org/T149793) (owner: 10Dzahn) [07:18:35] (03CR) 10Marostegui: [C: 031] remove db1019, db1042 incl. mgmt [dns] - 10https://gerrit.wikimedia.org/r/334014 (https://phabricator.wikimedia.org/T149793) (owner: 10Dzahn) [07:28:43] !log upgrading aqs100[56] to node6 [07:28:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:34] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-3/2/3: down - Core: cr2-codfw:xe-5/0/1 (Zayo, OGYX/120003//ZYO) 36ms {#2909} [10Gbps wave]BR [07:36:44] (03PS1) 10Ema: icinga: do not print stacktrace when check_ripe_atlas exits with 2 [puppet] - 10https://gerrit.wikimedia.org/r/334028 [07:38:52] 06Operations, 10ops-codfw, 10DBA: db2060 not accessible - https://phabricator.wikimedia.org/T156161#2968060 (10Marostegui) a:03Papaul I had rebooted the server as it wasn't responding. ILO logs aren't showing anything, neither system logs. However yesterday, as it can be see on the original ticket message,... [07:42:42] (03PS1) 10Marostegui: Revert "db-codfw.php Depool db2054" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334029 [07:43:03] PROBLEM - Check HHVM threads for leakage on mw1260 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [07:46:02] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php Depool db2054" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334029 (owner: 10Marostegui) [07:48:59] <_joe_> !log restarting pybal on lvs1003 [07:49:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:16] 06Operations, 10DBA, 10MediaWiki-Change-tagging: Reimage and clone db1072 - https://phabricator.wikimedia.org/T156226#2968079 (10Marostegui) We should take the opportunity, now that it is depooled, to move it to another rack as all the API hosts for s1 are on D1, including this host. We can move it to B2 for... [07:49:24] 06Operations, 10DBA: Reimage and clone db1072 - https://phabricator.wikimedia.org/T156226#2968080 (10Marostegui) [07:51:36] (03PS2) 10Marostegui: Revert "db-codfw.php Depool db2054" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334029 [07:52:35] (03PS2) 10Muehlenhoff: Remove myself from elasticsearch-roots [puppet] - 10https://gerrit.wikimedia.org/r/333946 (owner: 10Chad) [07:57:18] moritzm,mobrovac - aqs migrated to node6 [08:00:04] elukey: nice [08:00:17] (03PS2) 10Elukey: Increase retry wait time for Hadoop Yarn Nodemanager checks [puppet] - 10https://gerrit.wikimedia.org/r/333912 [08:01:01] (03CR) 10jenkins-bot: Revert "db-codfw.php Depool db2054" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334029 (owner: 10Marostegui) [08:01:32] (03CR) 10Muehlenhoff: [C: 032] Remove myself from elasticsearch-roots [puppet] - 10https://gerrit.wikimedia.org/r/333946 (owner: 10Chad) [08:02:09] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2054 - T153300 (duration: 00m 51s) [08:02:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:14] T153300: Remove partitions from metawiki.pagelinks in s7 - https://phabricator.wikimedia.org/T153300 [08:02:56] (03PS3) 10Elukey: Increase retry wait time for Hadoop Yarn Nodemanager checks [puppet] - 10https://gerrit.wikimedia.org/r/333912 [08:03:41] (03CR) 10Elukey: [C: 032] Increase retry wait time for Hadoop Yarn Nodemanager checks [puppet] - 10https://gerrit.wikimedia.org/r/333912 (owner: 10Elukey) [08:03:43] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 4.263 second response time [08:04:43] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 2.469 second response time [08:06:43] (03CR) 10Elukey: [C: 031] remove analytics1015, analytics1026. incl. mgmt [dns] - 10https://gerrit.wikimedia.org/r/334013 (https://phabricator.wikimedia.org/T147313) (owner: 10Dzahn) [08:07:34] (03CR) 10Faidon Liambotis: [C: 032] icinga: do not print stacktrace when check_ripe_atlas exits with 2 [puppet] - 10https://gerrit.wikimedia.org/r/334028 (owner: 10Ema) [08:10:03] PROBLEM - puppet last run on db1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:10:55] (03PS2) 10Ema: icinga: do not print stacktrace when check_ripe_atlas exits with 2 [puppet] - 10https://gerrit.wikimedia.org/r/334028 [08:11:00] (03CR) 10Ema: [V: 032] icinga: do not print stacktrace when check_ripe_atlas exits with 2 [puppet] - 10https://gerrit.wikimedia.org/r/334028 (owner: 10Ema) [08:15:08] !log upgrade cp3034 to jessie 8.7 and reboot into kernel 4.4.2-3+wmf8 [08:15:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:58] !log upgrade cp3034 to jessie 8.7 and reboot into kernel 4.4.2-3+wmf8 T155401 [08:16:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:01] T155401: Integrate jessie 8.7 point release - https://phabricator.wikimedia.org/T155401 [08:19:13] RECOVERY - Check HHVM threads for leakage on mw1259 is OK: OK [08:20:17] (03PS1) 10Marostegui: site.pp: Change active master for enwiki [puppet] - 10https://gerrit.wikimedia.org/r/334030 (https://phabricator.wikimedia.org/T156008) [08:20:43] (03CR) 10Marostegui: [C: 04-2] "wait for the correct day and time" [puppet] - 10https://gerrit.wikimedia.org/r/334030 (https://phabricator.wikimedia.org/T156008) (owner: 10Marostegui) [08:23:43] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 3.556 second response time [08:24:43] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 3.367 second response time [08:27:32] (03CR) 10Marostegui: [C: 04-2] "https://puppet-compiler.wmflabs.org/5218/ - changes the correct things on both hosts" [puppet] - 10https://gerrit.wikimedia.org/r/334030 (https://phabricator.wikimedia.org/T156008) (owner: 10Marostegui) [08:39:09] RECOVERY - puppet last run on db1036 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [08:50:45] (03PS1) 10Muehlenhoff: Add more email addresses [puppet] - 10https://gerrit.wikimedia.org/r/334034 [08:51:21] whats up with the certs from wmf ca in the puppet repo are those in case globalsign goes out the trash bin again? [08:53:42] !log upgrade cp3040 to jessie 8.7 and reboot into kernel 4.4.2-3+wmf8 T155401 [08:53:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:46] T155401: Integrate jessie 8.7 point release - https://phabricator.wikimedia.org/T155401 [08:53:56] (03PS4) 10Zppix: site.pp, DHCP: remove db1019, db1042 [puppet] - 10https://gerrit.wikimedia.org/r/334010 (https://phabricator.wikimedia.org/T149793) (owner: 10Dzahn) [08:54:39] PROBLEM - puppet last run on relforge1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:57:26] (03PS2) 10Muehlenhoff: Add more email addresses [puppet] - 10https://gerrit.wikimedia.org/r/334034 [08:58:57] 06Operations, 10MediaWiki-Configuration: change -wgDisableUserGroupExpiry' to -wgEnableUserGroupExpiry - https://phabricator.wikimedia.org/T156230#2968138 (10Zppix) [09:00:31] (03CR) 10Muehlenhoff: [C: 032] Add more email addresses [puppet] - 10https://gerrit.wikimedia.org/r/334034 (owner: 10Muehlenhoff) [09:12:04] 06Operations, 10ops-eqiad, 13Patch-For-Review: rack and set up aqs100[7-9] - https://phabricator.wikimedia.org/T155654#2968172 (10elukey) [09:12:48] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 15User-Elukey: rack and set up aqs100[7-9] - https://phabricator.wikimedia.org/T155654#2949969 (10elukey) a:05Cmjohnson>03elukey [09:14:35] (03PS1) 10Elukey: Add aqs100[789] to the related role [puppet] - 10https://gerrit.wikimedia.org/r/334035 (https://phabricator.wikimedia.org/T155654) [09:16:16] 06Operations, 10MediaWiki-Configuration: change -wgDisableUserGroupExpiry' to -wgEnableUserGroupExpiry - https://phabricator.wikimedia.org/T156230#2968180 (10TTO) p:05Triage>03Lowest From https://gerrit.wikimedia.org/r/328377/: > Since this is a temporary feature flag I decided to use Disable. The intenti... [09:18:17] 06Operations, 10MediaWiki-Configuration: change -wgDisableUserGroupExpiry' to -wgEnableUserGroupExpiry - https://phabricator.wikimedia.org/T156230#2968183 (10Zppix) I don't see the reason for removing this, honestly I like the Idea behind it and I think it could be used in MW to easily have others have temp ac... [09:18:26] 06Operations, 10MediaWiki-Configuration: change -wgDisableUserGroupExpiry' to -wgEnableUserGroupExpiry - https://phabricator.wikimedia.org/T156230#2968184 (10Legoktm) 05Open>03declined [09:19:10] 06Operations, 07discovery-system: confctl SubjectAltNameWarning after python-urllib3 upgrade - https://phabricator.wikimedia.org/T156232#2968185 (10ema) [09:19:29] 06Operations, 07discovery-system: confctl SubjectAltNameWarning after python-urllib3 upgrade - https://phabricator.wikimedia.org/T156232#2968198 (10ema) p:05Triage>03Normal [09:22:42] RECOVERY - puppet last run on relforge1001 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [09:25:07] !log updating puppet-compiler facts [09:25:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:03] (03PS1) 10DCausse: Revert "[cirrus] properly set wgCirrusSearchUseIcuFolding" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334038 (https://phabricator.wikimedia.org/T156234) [10:06:55] (03PS3) 10Faidon Liambotis: raid: also check for State: degraded in md arrays [puppet] - 10https://gerrit.wikimedia.org/r/333866 [10:07:41] (03CR) 10Faidon Liambotis: [C: 032] raid: also check for State: degraded in md arrays [puppet] - 10https://gerrit.wikimedia.org/r/333866 (owner: 10Faidon Liambotis) [10:08:55] (03PS2) 10Ema: Revert "Temporarily depool codfw" [dns] - 10https://gerrit.wikimedia.org/r/333854 [10:09:09] (03CR) 10Ema: [V: 032 C: 032] Revert "Temporarily depool codfw" [dns] - 10https://gerrit.wikimedia.org/r/333854 (owner: 10Ema) [10:11:18] !log repooled codfw [10:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:25] (03PS1) 10Elukey: Allocate aqs100[789]'s cassandra instances forward and reverse IPs. [dns] - 10https://gerrit.wikimedia.org/r/334040 (https://phabricator.wikimedia.org/T155654) [10:18:01] (03PS2) 10Elukey: Allocate aqs100[789]'s cassandra instances A and PTR records. [dns] - 10https://gerrit.wikimedia.org/r/334040 (https://phabricator.wikimedia.org/T155654) [10:19:14] (03PS3) 10Elukey: Allocate aqs100[789]'s cassandra instances A and PTR records. [dns] - 10https://gerrit.wikimedia.org/r/334040 (https://phabricator.wikimedia.org/T155654) [10:28:42] !log uploaded ca-certificates-java 20161107~bpo8+1 to apt.wikimedia.org [10:28:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:12] RECOVERY - Check HHVM threads for leakage on mw1260 is OK: OK [10:36:38] 06Operations, 10MediaWiki-JobRunner, 13Patch-For-Review, 15User-Addshore: jobrunner should send statsd in batches - https://phabricator.wikimedia.org/T132327#2968409 (10Addshore) *poke* @aaron ? :) [10:39:12] PROBLEM - MD RAID on mw2256 is CRITICAL: CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 0, Spare: 0 [10:39:13] ACKNOWLEDGEMENT - MD RAID on mw2256 is CRITICAL: CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T156242 [10:39:18] 06Operations, 10ops-codfw: Degraded RAID on mw2256 - https://phabricator.wikimedia.org/T156242#2968411 (10ops-monitoring-bot) [10:45:02] PROBLEM - puppet last run on kafka1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:49:04] !log uploaded openjdk-8 u121 to apt.wikimedia.org [10:49:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:12] RECOVERY - Check HHVM threads for leakage on mw1169 is OK: OK [11:01:19] !log upgrading restbase staging cluster to new openjdk (also piggyback reboot to latest 4.4 kernel) [11:01:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:04] 06Operations, 10ops-codfw, 10netops: asw-a7-codfw is down - https://phabricator.wikimedia.org/T154758#2968449 (10faidon) 05Open>03Resolved a:03faidon Nothing more to do here. [11:07:21] 06Operations, 06Labs, 10netops: asw-c2-eqiad reboots & fdb_mac_entry_mc_set() issues - https://phabricator.wikimedia.org/T155875#2968453 (10faidon) p:05High>03Unbreak! [11:10:23] 06Operations, 06Labs, 10netops: asw-c2-eqiad reboots & fdb_mac_entry_mc_set() issues - https://phabricator.wikimedia.org/T155875#2968455 (10faidon) The switch rebooted again overnight (Jan 25 01:16 UTC). We are going to proceed with a replacement as soon as the DBA work (T155999) is done. Setting priority to... [11:12:44] 06Operations, 10netops: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387#2968457 (10faidon) 05Open>03stalled [11:13:02] RECOVERY - puppet last run on kafka1001 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [11:13:27] (03PS1) 10Ema: cache: remove varnish_version4 from hiera and salt [puppet] - 10https://gerrit.wikimedia.org/r/334043 [11:14:05] 06Operations, 06Labs, 10netops: asw-c2-eqiad reboots & fdb_mac_entry_mc_set() issues - https://phabricator.wikimedia.org/T155875#2968471 (10Marostegui) >>! In T155875#2968455, @faidon wrote: > The switch rebooted again overnight (Jan 25 01:16 UTC). We are going to proceed with a replacement as soon as the DB... [11:22:22] PROBLEM - puppet last run on wtp1021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:25:12] PROBLEM - Check whether ferm is active by checking the default input chain on db1051 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [11:29:00] (03PS5) 10Ema: Expand Content-Security-Policy on upload test to fr. [puppet] - 10https://gerrit.wikimedia.org/r/318490 (https://phabricator.wikimedia.org/T117618) (owner: 10Brian Wolff) [11:29:07] (03CR) 10Ema: [V: 032 C: 032] Expand Content-Security-Policy on upload test to fr. [puppet] - 10https://gerrit.wikimedia.org/r/318490 (https://phabricator.wikimedia.org/T117618) (owner: 10Brian Wolff) [11:30:13] RECOVERY - Check whether ferm is active by checking the default input chain on db1051 is OK: OK ferm input default policy is set [11:35:12] RECOVERY - Check HHVM threads for leakage on mw1168 is OK: OK [11:50:22] RECOVERY - puppet last run on wtp1021 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [11:55:32] 06Operations, 10DBA, 10netops, 13Patch-For-Review: Switchover s1 master db1057 -> db1052 - https://phabricator.wikimedia.org/T156008#2968514 (10Marostegui) This will be happening Thursday 25th at 07:00 UTC [12:04:36] !log Refresh site statistics on simple. (T156247) [12:04:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:41] T156247: Update statistics count on simplewiki - https://phabricator.wikimedia.org/T156247 [12:07:12] (03CR) 10Jcrespo: [C: 04-1] site.pp: Change active master for enwiki (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/334030 (https://phabricator.wikimedia.org/T156008) (owner: 10Marostegui) [12:10:48] 06Operations, 10ops-eqiad, 13Patch-For-Review: reclaim multatuli - https://phabricator.wikimedia.org/T156208#2967457 (10faidon) Please don't, I've been using it for general experimentation/various unpuppetized stuff. [12:10:51] (03CR) 10Marostegui: [C: 04-1] "Ah, I thought we couldn't not have both set to true, as we only one host with role master on puppet. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/334030 (https://phabricator.wikimedia.org/T156008) (owner: 10Marostegui) [12:22:48] (03CR) 10Muehlenhoff: [C: 04-1] "@chad: We should strive to use a native systemd unit for gerrit now that gerrit is running on a jessie host. While systemd has a generator" (032 comments) [debs/gerrit] - 10https://gerrit.wikimedia.org/r/333475 (owner: 10Paladox) [12:23:52] 06Operations, 10ops-esams, 13Patch-For-Review: reclaim multatuli - https://phabricator.wikimedia.org/T156208#2968591 (10Southparkfan) (correct datacenter) [12:24:22] (03CR) 10Jcrespo: [C: 04-1] "> as we only one host with role master on puppet" [puppet] - 10https://gerrit.wikimedia.org/r/334030 (https://phabricator.wikimedia.org/T156008) (owner: 10Marostegui) [12:27:37] 06Operations, 10ops-esams, 13Patch-For-Review: reclaim multatuli - https://phabricator.wikimedia.org/T156208#2967457 (10MoritzMuehlenhoff) /me too, that's the host where I usually test kernels or other updates which cannot be tested in labs. I'd say let's just switch the puppet role to role::test::system [12:30:08] there is high api thoughput since 11 am today [12:30:12] on enwiki [12:32:25] https://grafana.wikimedia.org/dashboard/db/mysql?var-dc=eqiad%20prometheus%2Fops&var-server=db1066&from=now-3h&to=now&panelId=5&fullscreen [12:34:34] api servers seem to handle it ok: https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown?from=now-3h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver&cluster=api_appserver&panelId=85&fullscreen [12:38:42] !log installing libxml security updates [12:38:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:29] don't see anything weird from https://grafana.wikimedia.org/dashboard/db/prometheus-apache-hhvm-dc-stats?var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-instance=All&from=now-24h&to=now too [13:06:46] (03PS1) 10Urbanecm: Allow bureaucrats to remove sysop rights on French Wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334054 (https://phabricator.wikimedia.org/T156227) [13:11:04] pooling new elasticsearch nodes on codfw - T154251 [13:11:04] T154251: rack/setup/install elastic2025-2036 - https://phabricator.wikimedia.org/T154251 [13:11:13] !log pooling new elasticsearch nodes on codfw - T154251 [13:11:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:11] 06Operations, 10Wikimedia-Stream: Upstream prematurely closed connection - https://phabricator.wikimedia.org/T153772#2890554 (10akosiaris) I 've research this a bit more. What seems to happen is that some requests to URLs like /socket.io/1/websocket/695138296539 will get back an empty reply from the python ap... [13:15:30] 06Operations, 10ops-codfw, 06Discovery, 10Elasticsearch, and 2 others: rack/setup/install elastic2025-2036 - https://phabricator.wikimedia.org/T154251#2968694 (10Gehel) 05Open>03Resolved All new elasticsearch nodes on codfw installed, configured and pooled. [13:17:16] (03PS1) 10Gehel: wdqs - configure wdqs2003 (new node) [puppet] - 10https://gerrit.wikimedia.org/r/334056 (https://phabricator.wikimedia.org/T152644) [13:18:08] (03CR) 10Marostegui: [C: 04-1] "well, yes, one per DC - I meant in the same DC. But it is all clear now. Thanks" [puppet] - 10https://gerrit.wikimedia.org/r/334030 (https://phabricator.wikimedia.org/T156008) (owner: 10Marostegui) [13:18:17] (03PS2) 10Gehel: wdqs - configure wdqs2003 (new node) [puppet] - 10https://gerrit.wikimedia.org/r/334056 (https://phabricator.wikimedia.org/T152644) [13:25:14] (03CR) 10Gehel: [C: 032] wdqs - configure wdqs2003 (new node) [puppet] - 10https://gerrit.wikimedia.org/r/334056 (https://phabricator.wikimedia.org/T152644) (owner: 10Gehel) [13:27:09] 06Operations, 10ops-codfw, 06Discovery, 10Wikidata, and 2 others: rack/setup/install wdqs2003 - https://phabricator.wikimedia.org/T152644#2855908 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['wdqs2003.codfw.wmnet'] ``` The log can be found in... [13:28:52] dcausse: are you going to deploy your change during eu swat today? [13:29:28] zeljkof: yes, why not [13:29:44] dcausse: great, that is the only commit so far [13:30:07] ok makes sense then, I can do it and not bother you [13:30:36] zeljkof: it'd be nice if you stay around tho :) [13:30:53] dcausse: sure, both hashar and me will be around [13:31:00] cool thanks [13:31:22] o/ [13:31:31] o/ [13:46:33] (03PS1) 10Marostegui: db-eqiad.php: Depool db1054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334106 (https://phabricator.wikimedia.org/T156225) [13:51:19] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334106 (https://phabricator.wikimedia.org/T156225) (owner: 10Marostegui) [13:52:25] !log upgrading openjdk-8 on maps-test* [13:52:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:37] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#2968768 (10Gehel) @MoritzMuehlenhoff pointed to me that we don't support OpenJDK 8 on Trusty, only on Jessie. This is an additional reason to migrat... [13:53:39] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334106 (https://phabricator.wikimedia.org/T156225) (owner: 10Marostegui) [13:53:50] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334106 (https://phabricator.wikimedia.org/T156225) (owner: 10Marostegui) [13:54:59] 06Operations, 10ops-codfw, 06Discovery, 10Wikidata, and 2 others: rack/setup/install wdqs2003 - https://phabricator.wikimedia.org/T152644#2968772 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['wdqs2003.codfw.wmnet'] ``` and were **ALL** successful. [13:55:16] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1054 - T156225 (duration: 00m 50s) [13:55:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:20] T156225: Move db1054 to C3 - https://phabricator.wikimedia.org/T156225 [14:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170125T1400). [14:00:04] dcausse: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [14:00:21] o/ [14:00:38] I will swat my change [14:01:04] (03PS2) 10DCausse: Revert "[cirrus] properly set wgCirrusSearchUseIcuFolding" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334038 (https://phabricator.wikimedia.org/T156234) [14:01:06] (03PS1) 10Alexandros Kosiaris: Remove the per host docker networks [puppet] - 10https://gerrit.wikimedia.org/r/334121 [14:04:00] O/ [14:04:36] (03CR) 10DCausse: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334038 (https://phabricator.wikimedia.org/T156234) (owner: 10DCausse) [14:05:51] (03Merged) 10jenkins-bot: Revert "[cirrus] properly set wgCirrusSearchUseIcuFolding" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334038 (https://phabricator.wikimedia.org/T156234) (owner: 10DCausse) [14:07:35] (03CR) 10jenkins-bot: Revert "[cirrus] properly set wgCirrusSearchUseIcuFolding" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334038 (https://phabricator.wikimedia.org/T156234) (owner: 10DCausse) [14:09:13] !log removed totally outdated openjdk-8 packages from trusty-wikimedia (from 2014) on carbon [14:09:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:10] (03PS2) 10Marostegui: site.pp: Change active master for enwiki [puppet] - 10https://gerrit.wikimedia.org/r/334030 (https://phabricator.wikimedia.org/T156008) [14:11:08] !log dcausse@tin Synchronized wmf-config/InitialiseSettings.php: T156234 Revert [cirrus] properly set wgCirrusSearchUseIcuFolding (duration: 00m 41s) [14:11:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:13] T156234: ICU folding seems to cause issues with completion - https://phabricator.wikimedia.org/T156234 [14:13:25] !log EU SWAT done [14:13:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:02] (03CR) 10Alexandros Kosiaris: [C: 032] Remove the per host docker networks [puppet] - 10https://gerrit.wikimedia.org/r/334121 (owner: 10Alexandros Kosiaris) [14:23:15] bblack: would you be able to review https://gerrit.wikimedia.org/r/#/c/333158/ [14:23:51] !log installing wget updates from jessie point release [14:23:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:32] (03CR) 10Gehel: [C: 04-1] "shipping log to logstash in log4j2 configuration is commented out pending tests. We should already deploy this change to enable testing of" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/333969 (https://phabricator.wikimedia.org/T155578) (owner: 10EBernhardson) [14:26:53] (03PS1) 10Giuseppe Lavagetto: etcd: add ability to use a TLS/auth proxy [puppet] - 10https://gerrit.wikimedia.org/r/334123 (https://phabricator.wikimedia.org/T156009) [14:26:55] (03PS1) 10Giuseppe Lavagetto: role::etcd::common: move to profile, refactor [puppet] - 10https://gerrit.wikimedia.org/r/334124 (https://phabricator.wikimedia.org/T156009) [14:26:57] (03PS1) 10Giuseppe Lavagetto: wmflib: add function to calculate htpasswd entries [puppet] - 10https://gerrit.wikimedia.org/r/334125 [14:26:59] (03PS1) 10Giuseppe Lavagetto: profile::etcd::tlsproxy: nginx auth proxy for etcd [puppet] - 10https://gerrit.wikimedia.org/r/334126 (https://phabricator.wikimedia.org/T156009) [14:27:01] (03PS1) 10Giuseppe Lavagetto: conf2xx: install etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/334127 (https://phabricator.wikimedia.org/T156009) [14:27:18] <_joe_> gehel: FWIW, I am generally against letting the app handle speaking to logstash directly [14:27:23] (03PS2) 10Gehel: Update elasticsearch module for es5 compatability [puppet] - 10https://gerrit.wikimedia.org/r/333969 (https://phabricator.wikimedia.org/T155578) (owner: 10EBernhardson) [14:27:56] _joe_: you prefer to have the app logging to file and those files parsed by lumberjack (or similar) ? [14:28:30] (03CR) 10jerkins-bot: [V: 04-1] wmflib: add function to calculate htpasswd entries [puppet] - 10https://gerrit.wikimedia.org/r/334125 (owner: 10Giuseppe Lavagetto) [14:28:56] <_joe_> if possible, I prefer to have the app send its logs to syslog, and then manage the logs from there [14:29:04] <_joe_> but YMMV of course [14:29:10] <_joe_> we don't do that with mediawiki [14:29:21] (03CR) 10jerkins-bot: [V: 04-1] profile::etcd::tlsproxy: nginx auth proxy for etcd [puppet] - 10https://gerrit.wikimedia.org/r/334126 (https://phabricator.wikimedia.org/T156009) (owner: 10Giuseppe Lavagetto) [14:29:38] <_joe_> but if my memory doesn't fail me, log4j works well with syslog [14:29:50] I now that for at least a few JVM based apps, we use a log4j appender [14:30:05] <_joe_> heh of course, everyone does that [14:30:12] log4j can send to syslog, but you loose most structure [14:30:12] (03CR) 10jerkins-bot: [V: 04-1] conf2xx: install etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/334127 (https://phabricator.wikimedia.org/T156009) (owner: 10Giuseppe Lavagetto) [14:30:44] unless you start sending json over syslog... [14:30:45] <_joe_> oh well, the idea that logs need to be structured, instead than readable, right :P [14:31:11] :) yeah, Java is a tool based stack :) [14:31:25] <_joe_> but yeah, I'm looking at how the log4j appender for logstash works, and it's mostly fine I guess [14:31:47] <_joe_> I kinda remember java apps stuck because log4j had the queue full in the past [14:32:10] <_joe_> we have caused a pretty large mediawiki outage here with a similar problem [14:33:06] Ceki has tried to deprecate log4j for probably 10 years... no one should still be using it. But that does not seem to be the case. [14:33:22] <_joe_> eheh [14:38:37] PROBLEM - puppet last run on db1023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:42:51] !log installing ruby2.1 updates from jessie point release [14:42:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:47] 06Operations, 10Traffic: Select or Acquire Address Space for Asia Cache DC - https://phabricator.wikimedia.org/T156256#2968867 (10BBlack) [14:58:12] 06Operations, 10Traffic: Turn up network links for Asia Cache DC - https://phabricator.wikimedia.org/T156031#2968883 (10BBlack) [14:58:15] 06Operations, 10Traffic: Select or Acquire Address Space for Asia Cache DC - https://phabricator.wikimedia.org/T156256#2968882 (10BBlack) [14:58:41] 06Operations, 10Traffic: Configuration for Asia Cache DC hosts - https://phabricator.wikimedia.org/T156027#2968889 (10BBlack) [14:58:44] 06Operations, 10Traffic: Select or Acquire Address Space for Asia Cache DC - https://phabricator.wikimedia.org/T156256#2968867 (10BBlack) [15:00:25] (03PS2) 10Giuseppe Lavagetto: etcd: add ability to use a TLS/auth proxy [puppet] - 10https://gerrit.wikimedia.org/r/334123 (https://phabricator.wikimedia.org/T156009) [15:00:27] (03PS2) 10Giuseppe Lavagetto: role::etcd::common: move to profile, refactor [puppet] - 10https://gerrit.wikimedia.org/r/334124 (https://phabricator.wikimedia.org/T156009) [15:00:29] (03PS2) 10Giuseppe Lavagetto: wmflib: add function to calculate htpasswd entries [puppet] - 10https://gerrit.wikimedia.org/r/334125 [15:00:31] (03PS2) 10Giuseppe Lavagetto: profile::etcd::tlsproxy: nginx auth proxy for etcd [puppet] - 10https://gerrit.wikimedia.org/r/334126 (https://phabricator.wikimedia.org/T156009) [15:00:33] (03PS2) 10Giuseppe Lavagetto: conf2xx: install etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/334127 (https://phabricator.wikimedia.org/T156009) [15:06:35] (03CR) 10jerkins-bot: [V: 04-1] profile::etcd::tlsproxy: nginx auth proxy for etcd [puppet] - 10https://gerrit.wikimedia.org/r/334126 (https://phabricator.wikimedia.org/T156009) (owner: 10Giuseppe Lavagetto) [15:06:37] RECOVERY - puppet last run on db1023 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [15:11:34] !log graphite1003 / graphite2002 at 94% utilization, increase lv size by 300G [15:11:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:46] mobrovac urandom Pchelolo gwicke ^ [15:14:37] (03PS3) 10Giuseppe Lavagetto: etcd: add ability to use a TLS/auth proxy [puppet] - 10https://gerrit.wikimedia.org/r/334123 (https://phabricator.wikimedia.org/T156009) [15:14:47] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [15:19:08] !log (slightly late) of 'maintain-views --all-databases --table watchlist_count --replace-all' across labsdbs [15:19:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:47] away [15:27:17] !log deleting indices using jieba plugin from relforge - T156150 [15:27:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:21] T156150: Install ES 5.x to relforge100[12] - https://phabricator.wikimedia.org/T156150 [15:28:38] !log removing jieba / ltr / swift plugins from elasticsearch relforge - T156150 [15:28:38] (03PS4) 10Giuseppe Lavagetto: etcd: add ability to use a TLS/auth proxy [puppet] - 10https://gerrit.wikimedia.org/r/334123 (https://phabricator.wikimedia.org/T156009) [15:28:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:17] PROBLEM - ElasticSearch health check for shards on relforge1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 268 threshold =0.1% breach: status: red, number_of_nodes: 2, unassigned_shards: 36, number_of_pending_tasks: 141, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 0, task_max_waiting_in_queue_millis: 21277, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: [15:31:17] PROBLEM - ElasticSearch health check for shards on relforge1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 268 threshold =0.1% breach: status: red, number_of_nodes: 2, unassigned_shards: 36, number_of_pending_tasks: 153, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 0, task_max_waiting_in_queue_millis: 25585, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: [15:32:07] ^ that's me, sorry for the noise [15:32:32] ACKNOWLEDGEMENT - ElasticSearch health check for shards on relforge1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 113 threshold =0.1% breach: status: red, number_of_nodes: 2, unassigned_shards: 36, number_of_pending_tasks: 180, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 155, task_max_waiting_in_queue_millis: 81402, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_ [15:32:33] ACKNOWLEDGEMENT - ElasticSearch health check for shards on relforge1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 113 threshold =0.1% breach: status: red, number_of_nodes: 2, unassigned_shards: 36, number_of_pending_tasks: 190, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 155, task_max_waiting_in_queue_millis: 85250, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_ [15:37:17] RECOVERY - ElasticSearch health check for shards on relforge1001 is OK: OK - elasticsearch status relforge-eqiad: status: green, number_of_nodes: 2, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 232, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, active_shards: 268, initial [15:37:17] RECOVERY - ElasticSearch health check for shards on relforge1002 is OK: OK - elasticsearch status relforge-eqiad: status: green, number_of_nodes: 2, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 232, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, active_shards: 268, initial [15:44:21] (03PS1) 10Urbanecm: IP Cap Lift for Edit-a-Thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334134 (https://phabricator.wikimedia.org/T156258) [15:45:57] PROBLEM - puppet last run on achernar is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:00:37] RECOVERY - DPKG on labtestweb2001 is OK: All packages OK [16:01:17] PROBLEM - puppet last run on etcd1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:01:47] RECOVERY - puppet last run on labtestweb2001 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [16:06:16] (03CR) 10Alexandros Kosiaris: [C: 031] postgresql: Only set user password if different [puppet] - 10https://gerrit.wikimedia.org/r/329328 (owner: 10Tim Landscheidt) [16:07:57] !log Stop mysql and power off db1054 for maintenance - T156225 [16:08:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:00] T156225: Move db1054 to C3 - https://phabricator.wikimedia.org/T156225 [16:08:34] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/5223/ confirms this is ok" [puppet] - 10https://gerrit.wikimedia.org/r/334123 (https://phabricator.wikimedia.org/T156009) (owner: 10Giuseppe Lavagetto) [16:10:43] 06Operations, 10ops-eqiad, 10DBA, 10netops, 13Patch-For-Review: Move db1054 to A2 - https://phabricator.wikimedia.org/T156225#2969189 (10Marostegui) [16:14:57] RECOVERY - puppet last run on achernar is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [16:16:59] (03PS1) 10Cmjohnson: Updating dns entries for db1054 to match rack relocation T156225 [dns] - 10https://gerrit.wikimedia.org/r/334138 [16:17:36] (03CR) 10Cmjohnson: [C: 032] Updating dns entries for db1054 to match rack relocation T156225 [dns] - 10https://gerrit.wikimedia.org/r/334138 (owner: 10Cmjohnson) [16:18:36] ah snap cmjohnson1 I was trying to allocate .206 too https://gerrit.wikimedia.org/r/#/c/334040/3/templates/wmnet :P [16:19:19] (03PS1) 10Marostegui: db-codfw,db-eqiad.php: Change db1054 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334139 (https://phabricator.wikimedia.org/T156225) [16:19:28] (03CR) 10Elukey: [C: 04-1] "Need to replace on IP, already taken in the meantime by https://gerrit.wikimedia.org/r/334138" [dns] - 10https://gerrit.wikimedia.org/r/334040 (https://phabricator.wikimedia.org/T155654) (owner: 10Elukey) [16:19:47] elukey: will you get another one? I am pushing the mediawiki config files with the .206 that cmjohnson1 got [16:19:49] ah [16:19:52] thanks :) [16:19:58] sure sure :D [16:20:06] <3 [16:21:15] (03CR) 10Marostegui: [C: 032] db-codfw,db-eqiad.php: Change db1054 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334139 (https://phabricator.wikimedia.org/T156225) (owner: 10Marostegui) [16:22:52] (03Merged) 10jenkins-bot: db-codfw,db-eqiad.php: Change db1054 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334139 (https://phabricator.wikimedia.org/T156225) (owner: 10Marostegui) [16:22:57] (03PS3) 10Giuseppe Lavagetto: role::etcd::common: move to profile, refactor [puppet] - 10https://gerrit.wikimedia.org/r/334124 (https://phabricator.wikimedia.org/T156009) [16:23:01] (03CR) 10jenkins-bot: db-codfw,db-eqiad.php: Change db1054 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334139 (https://phabricator.wikimedia.org/T156225) (owner: 10Marostegui) [16:24:12] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Change db1054 IP - T156225 (duration: 00m 41s) [16:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:16] T156225: Move db1054 to A2 - https://phabricator.wikimedia.org/T156225 [16:25:05] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Change db1054 IP - T156225 (duration: 00m 40s) [16:25:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:05] (03PS4) 10Giuseppe Lavagetto: role::etcd::common: move to profile, refactor [puppet] - 10https://gerrit.wikimedia.org/r/334124 (https://phabricator.wikimedia.org/T156009) [16:26:12] 06Operations, 10ops-eqiad, 10DBA, 10netops, 13Patch-For-Review: Move db1054 to A3 - https://phabricator.wikimedia.org/T156225#2969260 (10Marostegui) [16:26:39] 06Operations, 10ops-eqiad, 10DBA, 10netops, 13Patch-For-Review: Move db1054 to A3 - https://phabricator.wikimedia.org/T156225#2968022 (10Marostegui) in the end it will go to A3 as Chris found some issues on the racks we previously selected. [16:28:16] (03PS1) 10Jgreen: rename fdb2001 -> frdb2001 to fit consistent naming scheme [dns] - 10https://gerrit.wikimedia.org/r/334140 [16:28:58] (03PS1) 10Andrew Bogott: Horizon: update our custom auth hacks for mitaka [puppet] - 10https://gerrit.wikimedia.org/r/334141 [16:29:11] (03PS2) 10Andrew Bogott: Horizon: update our custom auth hacks for mitaka [puppet] - 10https://gerrit.wikimedia.org/r/334141 [16:29:17] RECOVERY - puppet last run on etcd1002 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [16:30:42] (03CR) 10Andrew Bogott: [C: 032] Horizon: update our custom auth hacks for mitaka [puppet] - 10https://gerrit.wikimedia.org/r/334141 (owner: 10Andrew Bogott) [16:31:00] (03CR) 10Jgreen: [C: 032] rename fdb2001 -> frdb2001 to fit consistent naming scheme [dns] - 10https://gerrit.wikimedia.org/r/334140 (owner: 10Jgreen) [16:31:59] !log renamed fdb2001 to frdb2001 [16:32:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:36] (03PS5) 10Giuseppe Lavagetto: role::etcd::common: move to profile, refactor [puppet] - 10https://gerrit.wikimedia.org/r/334124 (https://phabricator.wikimedia.org/T156009) [16:41:44] (03PS4) 10Elukey: Allocate aqs100[789]'s cassandra instances A and PTR records. [dns] - 10https://gerrit.wikimedia.org/r/334040 (https://phabricator.wikimedia.org/T155654) [16:44:22] (03PS2) 10Elukey: Add aqs100[789] to the related role [puppet] - 10https://gerrit.wikimedia.org/r/334035 (https://phabricator.wikimedia.org/T155654) [16:45:22] (03PS3) 10Elukey: Add aqs1007 to the related role [puppet] - 10https://gerrit.wikimedia.org/r/334035 (https://phabricator.wikimedia.org/T155654) [16:47:27] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1054" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334144 [16:47:38] (03CR) 10Marostegui: [C: 04-2] "Server still catching up" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334144 (owner: 10Marostegui) [16:55:26] 06Operations, 10ops-eqiad, 10DBA, 10netops, 13Patch-For-Review: Move db1054 to A3 - https://phabricator.wikimedia.org/T156225#2969337 (10Marostegui) 05Open>03Resolved a:03Cmjohnson db1054 has been moved. DNS updated db-eqiad,codfw files updated mysql and replication started finely. tendril updated... [16:55:31] 06Operations, 10DBA, 06Labs, 10netops, 13Patch-For-Review: DBA plan to mitigate asw-c2-eqiad reboots - https://phabricator.wikimedia.org/T155999#2969340 (10Marostegui) [16:56:42] !log gerrit: quick service reboot to pick up new java version [16:56:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:58] ah I was wondering about the 503s :P [16:57:07] 06Operations, 10DBA: Move db1073 to B3 - https://phabricator.wikimedia.org/T156126#2969346 (10Marostegui) This is probably not required anymore as we will do this: T156226 Will leave it open for now as it is not a bad a idea to have the 3 api host in 3 different racks anyways. We will evaluate tomorrow. [16:57:40] !log gerrit: everything back up! [16:57:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:32] I just wanted to say gerrit is down and I saw ostriches SAL [16:58:43] 06Operations, 06Labs, 10netops: asw-c2-eqiad reboots & fdb_mac_entry_mc_set() issues - https://phabricator.wikimedia.org/T155875#2969364 (10Cmjohnson) @faidon new switch has been installed. Also added an uplink module. The switch is accessible via mgmt [16:59:05] Amir1: Yeah, just had to restart the service real quick, was upgrading java [16:59:25] cool, thanks! [16:59:53] ssh works for me but the ui is still sending 503 [16:59:56] maybe overloaded [17:00:10] 503 still? Shouldn't be from load [17:00:17] That's a "I can't access the backend" [17:00:23] PROBLEM - puppet last run on stat1003 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[git_pull_analytics/reportupdater],Exec[git_pull_geowiki-scripts],Exec[git_pull_statistics_mediawiki] [17:00:36] When I click on https://gerrit.wikimedia.org/r/#/c/333595/ [17:00:41] it says 503 [17:00:50] I did it right nwo [17:00:59] Amir1: WFM. Gerrit's API responses tend to remain very heavily cached in the browser.... [17:01:10] It'll sort itself out shortly [17:01:19] okay [17:01:21] 06Operations, 06Analytics-Kanban: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#2915651 (10Milimetric) Yep, this is definitely still happening. I took a look and what's happening is every time someone visits a page instrumented with piwik, piwik makes a new **connection**... [17:01:23] thanks [17:01:23] (03CR) 10Giuseppe Lavagetto: "seems ok, should be merged with relative care" [puppet] - 10https://gerrit.wikimedia.org/r/334124 (https://phabricator.wikimedia.org/T156009) (owner: 10Giuseppe Lavagetto) [17:01:35] (03PS6) 10Giuseppe Lavagetto: role::etcd::common: move to profile, refactor [puppet] - 10https://gerrit.wikimedia.org/r/334124 (https://phabricator.wikimedia.org/T156009) [17:03:03] (03CR) 10Chad: "Actually, this will be better." [debs/gerrit] - 10https://gerrit.wikimedia.org/r/333475 (owner: 10Paladox) [17:03:21] <_joe_> akosiaris: /win 18 [17:03:24] <_joe_> argh [17:03:37] <_joe_> akosiaris: I am going to merge that change, it should be harmless [17:03:47] <_joe_> worst case scenario, we lose kubernetes in prod [17:03:49] 06Operations, 10ops-eqiad, 10Datasets-General-or-Unknown: Degraded RAID on ms1001 - https://phabricator.wikimedia.org/T152367#2969375 (10Cmjohnson) 05Open>03Resolved this is resolved Enclosure Device ID: 25 Slot Number: 4 Drive's position: DiskGroup: 1, Span: 0, Arm: 4 Enclosure position: 1 Device Id:... [17:03:52] <_joe_> :P [17:04:01] which change ? aaa the etcd one ? [17:06:56] <_joe_> yes [17:07:04] <_joe_> it will probably restart etcd on the nodes [17:07:19] <_joe_> but that should be ok as long as it doesn't happen at the same time on every node [17:08:11] 06Operations, 10ops-eqiad: mw1239: memory scrubbing error - https://phabricator.wikimedia.org/T148421#2969391 (10Cmjohnson) 05Open>03Resolved The error has not returned...resolving this task. [17:09:00] 06Operations, 06Analytics-Kanban: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#2969394 (10Milimetric) a:03Milimetric [17:09:06] (03CR) 10Giuseppe Lavagetto: [C: 032] role::etcd::common: move to profile, refactor [puppet] - 10https://gerrit.wikimedia.org/r/334124 (https://phabricator.wikimedia.org/T156009) (owner: 10Giuseppe Lavagetto) [17:10:33] 06Operations, 10ops-eqiad: Rack and setup wdqs1003 - https://phabricator.wikimedia.org/T153349#2969399 (10Cmjohnson) p:05Triage>03High [17:11:06] 06Operations, 06Analytics-Kanban: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#2915651 (10jcrespo) Assuming this is mysql, this is something I can indeed help relatively easily without touching code. As a reminder, I do not only handle mysql for mediawiki and labs, I also... [17:11:24] (03PS9) 10Paladox: Gerrit: Add a systemd init script fro gerrit [debs/gerrit] - 10https://gerrit.wikimedia.org/r/333475 [17:11:43] RECOVERY - puppet last run on etcd1005 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [17:12:20] (03PS1) 10Alexandros Kosiaris: nagios: Specify a parents host relationship [puppet] - 10https://gerrit.wikimedia.org/r/334149 [17:12:34] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, 10Elasticsearch: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#2969404 (10Cmjohnson) p:05Normal>03High [17:13:26] (03CR) 10jerkins-bot: [V: 04-1] nagios: Specify a parents host relationship [puppet] - 10https://gerrit.wikimedia.org/r/334149 (owner: 10Alexandros Kosiaris) [17:14:43] RECOVERY - puppet last run on etcd1004 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [17:15:13] (03PS1) 10Ottomata: Cleanup existing symlinks for datasets.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/334150 (https://phabricator.wikimedia.org/T125854) [17:19:22] (03PS2) 10Ottomata: Cleanup existing symlinks for datasets.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/334150 (https://phabricator.wikimedia.org/T125854) [17:19:43] RECOVERY - puppet last run on etcd1006 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [17:20:05] (03PS3) 10Ottomata: Cleanup existing symlinks for datasets.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/334150 (https://phabricator.wikimedia.org/T125854) [17:22:37] 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 07Wikimedia-Multiple-active-datacenters: Check the size of every cluster in codfw to see if it matches eqiad's capacity - https://phabricator.wikimedia.org/T156023#2961839 (10fgiunchedi) Yes, in fact we can already answer these questions with Prometheu... [17:22:49] (03PS1) 10Dzahn: multatuli: spare::system -> test::system [puppet] - 10https://gerrit.wikimedia.org/r/334153 (https://phabricator.wikimedia.org/T156208) [17:23:07] (03CR) 10Ottomata: [C: 032] Cleanup existing symlinks for datasets.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/334150 (https://phabricator.wikimedia.org/T125854) (owner: 10Ottomata) [17:23:13] (03PS4) 10Ottomata: Cleanup existing symlinks for datasets.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/334150 (https://phabricator.wikimedia.org/T125854) [17:23:28] 06Operations, 10ops-eqiad, 10hardware-requests: Reclaim nobelium - https://phabricator.wikimedia.org/T142581#2969452 (10Cmjohnson) 05Open>03Resolved disks wiped, added to tracking sheet. [17:23:32] (03CR) 10Ottomata: [V: 032 C: 032] Cleanup existing symlinks for datasets.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/334150 (https://phabricator.wikimedia.org/T125854) (owner: 10Ottomata) [17:24:07] 06Operations, 10ops-esams, 13Patch-For-Review: reclaim multatuli - https://phabricator.wikimedia.org/T156208#2969454 (10Dzahn) Oops, ok. Sure, then just adding a comment and making it a "test::system" rather than "spare::system". [17:24:53] (03CR) 10Muehlenhoff: [C: 031] multatuli: spare::system -> test::system [puppet] - 10https://gerrit.wikimedia.org/r/334153 (https://phabricator.wikimedia.org/T156208) (owner: 10Dzahn) [17:26:09] (03PS1) 10Ottomata: Fix case typo in datasets.wikimedia.org apache vhost [puppet] - 10https://gerrit.wikimedia.org/r/334154 [17:26:19] (03PS1) 10Alexandros Kosiaris: realm.pp: Remove the pre 3.5 puppet handling code [puppet] - 10https://gerrit.wikimedia.org/r/334155 [17:26:23] (03PS2) 10Ottomata: Fix case typo in datasets.wikimedia.org apache vhost [puppet] - 10https://gerrit.wikimedia.org/r/334154 [17:26:34] (03CR) 10Ottomata: [V: 032 C: 032] Fix case typo in datasets.wikimedia.org apache vhost [puppet] - 10https://gerrit.wikimedia.org/r/334154 (owner: 10Ottomata) [17:27:03] PROBLEM - Blazegraph Port on wdqs2003 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused [17:27:23] PROBLEM - Blazegraph process on wdqs2003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 997 (blazegraph), regex args ^java .* blazegraph-service-.*-dist.war [17:27:30] (03CR) 10Alexandros Kosiaris: "Adding chasemp, andrewbogott in case we have some self-hosted puppetmasters that would break with this change and we don't want to break t" [puppet] - 10https://gerrit.wikimedia.org/r/334155 (owner: 10Alexandros Kosiaris) [17:27:33] PROBLEM - Check systemd state on wdqs2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:28:23] RECOVERY - puppet last run on stat1003 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [17:29:03] wdqs2003 is being reimaged... [17:33:43] 06Operations, 10ops-eqiad, 10hardware-requests: decommission stat1001 - https://phabricator.wikimedia.org/T154164#2969483 (10Cmjohnson) [17:33:48] 06Operations, 10ops-eqiad, 10hardware-requests: decommission stat1001 - https://phabricator.wikimedia.org/T154164#2902803 (10Cmjohnson) Disks wiped [17:33:54] (03CR) 10Dzahn: [C: 031] Allocate aqs100[789]'s cassandra instances A and PTR records. [dns] - 10https://gerrit.wikimedia.org/r/334040 (https://phabricator.wikimedia.org/T155654) (owner: 10Elukey) [17:37:00] (03CR) 10Cmjohnson: [C: 032] site.pp, DHCP: remove mw1017,mw1099 [puppet] - 10https://gerrit.wikimedia.org/r/334006 (https://phabricator.wikimedia.org/T151303) (owner: 10Dzahn) [17:37:49] (03PS2) 10Zppix: realm.pp: Remove the pre 3.5 puppet handling code [puppet] - 10https://gerrit.wikimedia.org/r/334155 (owner: 10Alexandros Kosiaris) [17:37:58] (03PS1) 10Urbanecm: [throttle] Her Girl Friday + Lenny Unconference / Editathon in NYC, Sat Jan 28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334156 [17:38:09] !log restarting and upgrading db2060 [17:38:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:36] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: decommission the old pay-lvs1001/pay-lvs1002 boxes - https://phabricator.wikimedia.org/T156284#2969512 (10Jgreen) [17:38:43] jynus when you get a minute, can you remind me what db2060 is used by? [17:40:51] (03PS3) 10Giuseppe Lavagetto: wmflib: add function to calculate htpasswd entries [puppet] - 10https://gerrit.wikimedia.org/r/334125 [17:41:13] (03PS2) 10Urbanecm: [throttle] Her Girl Friday + Lenny Unconference / Editathon in NYC, Sat Jan 28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334156 (https://phabricator.wikimedia.org/T156278) [17:42:03] (03PS5) 10Elukey: Allocate aqs100[789]'s cassandra instances A and PTR records. [dns] - 10https://gerrit.wikimedia.org/r/334040 (https://phabricator.wikimedia.org/T155654) [17:43:01] Zppix: is there a reason you're curious? It's defined here; https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/manifests/site.pp;1169c458820a1831a4d2b234ba5571926034eccc$632 [17:44:29] (03PS1) 10Urbanecm: Enable SandboxLink on gdwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334157 (https://phabricator.wikimedia.org/T156281) [17:45:52] (03PS3) 10Urbanecm: [throttle] Her Girl Friday + Lenny Unconference / Editathon in NYC, 2017-01-28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334156 (https://phabricator.wikimedia.org/T156278) [17:46:40] (03Abandoned) 10Dzahn: remove multatuli.wikimedia.org, keep mgmt [dns] - 10https://gerrit.wikimedia.org/r/334012 (https://phabricator.wikimedia.org/T156208) (owner: 10Dzahn) [17:49:53] (03CR) 10Elukey: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/334040 (https://phabricator.wikimedia.org/T155654) (owner: 10Elukey) [17:51:04] (03PS2) 10Cmjohnson: site.pp, DHCP: remove mw1017,mw1099 [puppet] - 10https://gerrit.wikimedia.org/r/334006 (https://phabricator.wikimedia.org/T151303) (owner: 10Dzahn) [17:51:10] (03CR) 10Cmjohnson: [V: 032 C: 032] site.pp, DHCP: remove mw1017,mw1099 [puppet] - 10https://gerrit.wikimedia.org/r/334006 (https://phabricator.wikimedia.org/T151303) (owner: 10Dzahn) [17:52:47] (03PS1) 10Chad: Remove another unused/ancient logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334160 [17:53:37] (03PS2) 10Chad: Remove another unused/ancient logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334160 [17:54:48] 06Operations, 10Cassandra, 10RESTBase, 06Services (doing): RESTBase k-r-v as Cassandra anti-pattern - https://phabricator.wikimedia.org/T144431#2969601 (10Eevans) [17:55:15] 06Operations, 10ops-eqiad, 06DC-Ops, 13Patch-For-Review, 15User-Joe: Decommission mw1017, mw1099 - https://phabricator.wikimedia.org/T151295#2969619 (10Cmjohnson) [17:55:17] 06Operations, 10ops-eqiad, 06DC-Ops, 13Patch-For-Review, 15User-Joe: Hardware decommission mw1017, mw1099 - https://phabricator.wikimedia.org/T151303#2969617 (10Cmjohnson) 05Open>03Resolved removed dns entries, removed site.pp reference and dhcpd entries. [17:55:53] (03Abandoned) 10Dzahn: site.pp, DHCP: remove multatuli [puppet] - 10https://gerrit.wikimedia.org/r/334011 (https://phabricator.wikimedia.org/T156208) (owner: 10Dzahn) [17:58:38] 06Operations, 10ops-eqiad, 06DC-Ops, 13Patch-For-Review: decomission the netapps in EQIAD: nas1001-a, nas1001-b - https://phabricator.wikimedia.org/T124156#2969643 (10Cmjohnson) [17:58:45] 06Operations, 10ops-eqiad, 06DC-Ops, 13Patch-For-Review: decomission the netapps in EQIAD: nas1001-a, nas1001-b - https://phabricator.wikimedia.org/T124156#1947460 (10Cmjohnson) 05Open>03Resolved [17:58:52] yay [18:00:04] bd808: Dear anthropoid, the time has come. Please deploy Striker (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170125T1800). [18:02:01] (03CR) 10Elukey: [C: 032] Allocate aqs100[789]'s cassandra instances A and PTR records. [dns] - 10https://gerrit.wikimedia.org/r/334040 (https://phabricator.wikimedia.org/T155654) (owner: 10Elukey) [18:02:37] !log running authdns-update on ns0.w.o to pick up changes made in https://gerrit.wikimedia.org/r/334040 [18:02:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:16] o/ I'll be starting soon [18:03:48] bd808 is there a changelog i could look at for striker for this upcoming deploy of striker? [18:05:19] (03CR) 10Chad: [C: 032] Remove another unused/ancient logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334160 (owner: 10Chad) [18:06:54] (03Merged) 10jenkins-bot: Remove another unused/ancient logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334160 (owner: 10Chad) [18:07:29] (03CR) 10Anomie: "This seems like a really weird special case situation." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333653 (https://phabricator.wikimedia.org/T154064) (owner: 10Niharika29) [18:08:04] Zppix: https://wikitech.wikimedia.org/wiki/Toolsadmin.wikimedia.org/Deployments#2017-01-25 [18:08:05] (03CR) 10jenkins-bot: Remove another unused/ancient logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334160 (owner: 10Chad) [18:09:04] (03PS1) 10Ema: Pass config file name as a CLI argument [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/334163 [18:09:07] !log demon@tin Synchronized docroot/foundation/logos: rm a junk logo (duration: 00m 50s) [18:09:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:40] (03PS4) 10Elukey: Add aqs1007 to the related role [puppet] - 10https://gerrit.wikimedia.org/r/334035 (https://phabricator.wikimedia.org/T155654) [18:10:26] (03PS5) 10Elukey: Add aqs1007 to the related role [puppet] - 10https://gerrit.wikimedia.org/r/334035 (https://phabricator.wikimedia.org/T155654) [18:12:20] 06Operations, 10Cassandra, 10RESTBase, 06Services (doing): RESTBase k-r-v as Cassandra anti-pattern - https://phabricator.wikimedia.org/T144431#2969695 (10Eevans) [18:13:23] PROBLEM - puppet last run on mw1290 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:14:41] (03PS6) 10Elukey: Add aqs1007 to site.pp and bootstrap aqs1007-a [puppet] - 10https://gerrit.wikimedia.org/r/334035 (https://phabricator.wikimedia.org/T155654) [18:21:59] !log bd808@tin Starting deploy [striker/deploy@5aa3aa8]: Update Striker to 5aa3aa8 (T144710, T147024, T144712, T144711, T153935) [18:22:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:08] T153935: Allow changing LDAP password from Striker - https://phabricator.wikimedia.org/T153935 [18:22:09] T144710: Create Wikitech/LDAP accounts via a new user friendly guided workflow - https://phabricator.wikimedia.org/T144710 [18:22:09] T144711: Allow management of LDAP SSH keys - https://phabricator.wikimedia.org/T144711 [18:22:09] T144712: Check for 2FA protection and enforce validation of 2FA tokens - https://phabricator.wikimedia.org/T144712 [18:22:10] T147024: Striker should respect TitleBlacklist bans on new account names - https://phabricator.wikimedia.org/T147024 [18:22:23] !log bd808@tin Finished deploy [striker/deploy@5aa3aa8]: Update Striker to 5aa3aa8 (T144710, T147024, T144712, T144711, T153935) (duration: 00m 24s) [18:22:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:05] 06Operations, 10ops-codfw, 10DBA: db2060 not accessible - https://phabricator.wikimedia.org/T156161#2969747 (10jcrespo) I have restarted db2060 because there is no reason to have mysql down there- if it was corrupted, which I believe it shouldn't, we would find out, and if it isn't we lose nothing. We can al... [18:23:27] 06Operations, 10ops-codfw, 10DBA: db2060 not accessible - https://phabricator.wikimedia.org/T156161#2969748 (10jcrespo) p:05Triage>03Normal [18:26:13] 06Operations, 10ops-eqiad, 10hardware-requests: decommission stat1001 - https://phabricator.wikimedia.org/T154164#2969753 (10Cmjohnson) p:05Normal>03Low [18:31:23] (03PS2) 10Dzahn: multatuli: spare::system -> test::system [puppet] - 10https://gerrit.wikimedia.org/r/334153 (https://phabricator.wikimedia.org/T156208) [18:36:12] (03PS1) 10Ottomata: /srv/datasets.wikimedia.org -> /srv/datasets [puppet] - 10https://gerrit.wikimedia.org/r/334167 (https://phabricator.wikimedia.org/T132594) [18:36:49] (03CR) 10RobH: "I would NOT merge any change to remove mgmt dns until the systems are unracked. Otherwise they will, even when powered down, not release " [dns] - 10https://gerrit.wikimedia.org/r/334015 (https://phabricator.wikimedia.org/T130883) (owner: 10Dzahn) [18:39:20] (03CR) 10Ottomata: [C: 032] /srv/datasets.wikimedia.org -> /srv/datasets [puppet] - 10https://gerrit.wikimedia.org/r/334167 (https://phabricator.wikimedia.org/T132594) (owner: 10Ottomata) [18:41:23] RECOVERY - puppet last run on mw1290 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [18:42:16] (03Abandoned) 10Urbanecm: Allow bureaucrats to remove sysop rights on French Wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334054 (https://phabricator.wikimedia.org/T156227) (owner: 10Urbanecm) [18:43:43] PROBLEM - puppet last run on thorium is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): File[/srv/limn-public-data],File[/srv/aggregate-datasets],File[/srv/public-datasets] [18:48:59] (03PS3) 10Dzahn: multatuli: spare::system -> test::system [puppet] - 10https://gerrit.wikimedia.org/r/334153 (https://phabricator.wikimedia.org/T156208) [18:50:49] (03PS1) 10Ottomata: Serve analytics.wikimedia.org/datasets from /srv/datasets [puppet] - 10https://gerrit.wikimedia.org/r/334168 (https://phabricator.wikimedia.org/T132594) [18:52:43] RECOVERY - puppet last run on thorium is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [18:53:03] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:53:03] PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:53:03] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:53:36] mmm [18:53:53] RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [18:54:03] RECOVERY - MariaDB Slave SQL: s4 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [18:54:03] RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [18:54:35] (03CR) 10Dzahn: [C: 032] multatuli: spare::system -> test::system [puppet] - 10https://gerrit.wikimedia.org/r/334153 (https://phabricator.wikimedia.org/T156208) (owner: 10Dzahn) [18:55:24] someone is trying to dump from dbstore1001 [18:55:38] which may be a really bad idea [18:56:09] why is this happening now? [18:56:17] or is this the backups [18:56:18] ? [18:56:43] could it be related to analytics work for datasets? [18:56:50] no [18:57:01] I think the backups may be running now [18:57:15] (03CR) 10Ottomata: [C: 032] Serve analytics.wikimedia.org/datasets from /srv/datasets [puppet] - 10https://gerrit.wikimedia.org/r/334168 (https://phabricator.wikimedia.org/T132594) (owner: 10Ottomata) [18:57:20] (03PS2) 10Ottomata: Serve analytics.wikimedia.org/datasets from /srv/datasets [puppet] - 10https://gerrit.wikimedia.org/r/334168 (https://phabricator.wikimedia.org/T132594) [18:57:25] (03CR) 10Ottomata: [V: 032 C: 032] Serve analytics.wikimedia.org/datasets from /srv/datasets [puppet] - 10https://gerrit.wikimedia.org/r/334168 (https://phabricator.wikimedia.org/T132594) (owner: 10Ottomata) [18:57:38] and if alex is moving helium [18:57:59] dbstore may be overloaded/too much bandwidth [18:58:10] 06Operations, 10ops-esams, 13Patch-For-Review: reclaim multatuli - https://phabricator.wikimedia.org/T156208#2969883 (10Dzahn) 05Open>03declined [18:59:15] it would work much better if we had new disks there... [19:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170125T1900). [19:00:05] def not me! [19:00:17] i'm just moving urls around, and also 'datasets' here is not what you think it is! :) [19:00:21] (03CR) 10Dzahn: "yes, agree. i would actually shutdown -h now them. that's ok right? The thing is that until they are out of puppet they are still in Icing" [puppet] - 10https://gerrit.wikimedia.org/r/334010 (https://phabricator.wikimedia.org/T149793) (owner: 10Dzahn) [19:00:21] No changes [19:03:25] (03PS1) 10Ottomata: Allow access /srv/datasets from analytics.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/334172 [19:05:19] (03CR) 10Ottomata: [V: 032 C: 032] Allow access /srv/datasets from analytics.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/334172 (owner: 10Ottomata) [19:07:51] (03PS1) 10Ottomata: Remove extraneous trailing > [puppet] - 10https://gerrit.wikimedia.org/r/334173 [19:08:05] (03CR) 10Ottomata: [V: 032 C: 032] Remove extraneous trailing > [puppet] - 10https://gerrit.wikimedia.org/r/334173 (owner: 10Ottomata) [19:11:37] Something broke [19:11:42] Caches aren't clearing [19:11:50] https://en.wikisource.org/w/index.php?title=Special%3AWhatLinksHere&target=Template%3AAe&namespace=0 [19:12:07] Claims it's still used by https://en.wikisource.org/w/index.php?title=Special%3AWhatLinksHere&target=Template%3AAe&namespace=0 [19:12:14] I've checked the source pages [19:12:14] (03CR) 10Dzahn: [C: 04-1] "as Rob points out, keep mgmt until after hw is removed from rack" [dns] - 10https://gerrit.wikimedia.org/r/334013 (https://phabricator.wikimedia.org/T147313) (owner: 10Dzahn) [19:12:21] There's NO use of the template [19:12:33] on the pages that get translcuded [19:12:39] Fix your cache people! [19:12:42] * ShakespeareFan00 out [19:16:08] (03PS3) 10Dzahn: remove analytics1015, analytics1026. keep mgmt [dns] - 10https://gerrit.wikimedia.org/r/334013 (https://phabricator.wikimedia.org/T147313) [19:16:56] (03PS4) 10Dzahn: remove analytics1015, analytics1026. keep mgmt [dns] - 10https://gerrit.wikimedia.org/r/334013 (https://phabricator.wikimedia.org/T147313) [19:17:36] 06Operations, 10Annual-Report: add subdomain for annual report 2016 - https://phabricator.wikimedia.org/T151798#2969952 (10Aklapper) >>! In T151798#2894164, @ZMcCune wrote: > @Dzahn: Thank you! We will let you know. Hope to have the static pages ready in early January. @ZMcCune: Any news to share, now that i... [19:21:02] (03CR) 10Dzahn: [C: 032] remove analytics1015, analytics1026. keep mgmt [dns] - 10https://gerrit.wikimedia.org/r/334013 (https://phabricator.wikimedia.org/T147313) (owner: 10Dzahn) [19:26:38] !log analytics1015,analytics1026 - decom: remove DNS names, delete salt keys, revoke puppet certs, puppet node clean (to remove from icinga) (T147313) [19:26:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:43] T147313: Decommission analytics1026 and analytics1015 - https://phabricator.wikimedia.org/T147313 [19:29:07] (03CR) 10DatGuy: [C: 031] "Very simple change. Should be a quick +2 if requested." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334157 (https://phabricator.wikimedia.org/T156281) (owner: 10Urbanecm) [19:30:43] (03CR) 10Dzahn: "This would benefit from a link to the wiki page where it was discussed in the gd community." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334157 (https://phabricator.wikimedia.org/T156281) (owner: 10Urbanecm) [19:33:36] PROBLEM - puppet last run on elastic1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:36:36] (03PS2) 10Dzahn: apache-fast-test: replace mw1017 with mwdebug1001 [puppet] - 10https://gerrit.wikimedia.org/r/334007 [19:37:32] (03CR) 10Dzahn: [C: 032] "mw1017 is no more. but was still the default for this test script. fixing that" [puppet] - 10https://gerrit.wikimedia.org/r/334007 (owner: 10Dzahn) [19:39:53] (03CR) 10Dzahn: [C: 04-1] "amending to actually keep mgmt" [dns] - 10https://gerrit.wikimedia.org/r/334014 (https://phabricator.wikimedia.org/T149793) (owner: 10Dzahn) [19:40:12] (03CR) 10Dzahn: [C: 04-1] "amending to actually keep mgmt" [dns] - 10https://gerrit.wikimedia.org/r/334015 (https://phabricator.wikimedia.org/T130883) (owner: 10Dzahn) [19:43:00] (03CR) 10Dzahn: "the intended order would be: schedule (eternal) downtime in icinga for hosts and services, shutdown -h now on the servers, merge this, rev" [puppet] - 10https://gerrit.wikimedia.org/r/334005 (https://phabricator.wikimedia.org/T130883) (owner: 10Dzahn) [19:45:17] 06Operations, 10Traffic: Letsencrypt all the prod things we can - planning - https://phabricator.wikimedia.org/T133717#2970074 (10RobH) [19:46:10] mutante: https://github.com/search?utf8=%E2%9C%93&q=org%3Awikimedia+mw1017&type=Code&ref=searchresults [19:46:14] Looks like we got them all now [19:46:25] apache-fast-test and ChromeWikimediaDebug were the last one [19:46:29] (aside from three comments) [19:46:55] 06Operations, 10Traffic: convert wikitech-static.wikimedia.org to use LE rather than GS certificate - https://phabricator.wikimedia.org/T156294#2970080 (10RobH) [19:49:33] 06Operations, 10Traffic: convert wikitech-static.wikimedia.org to use LE rather than GS certificate - https://phabricator.wikimedia.org/T156294#2970098 (10Dzahn) Yep, i would suggest to install https://certbot.eff.org/ and run that there. It would get the cert and also create the Apache config snippet. [19:54:09] 06Operations, 10Traffic: convert wikitech-static.wikimedia.org to use LE rather than GS certificate - https://phabricator.wikimedia.org/T156294#2970105 (10RobH) Also this doesn't appear to be in our monitoring, and it should be in icinga. I'm adding now. [19:54:54] (03PS1) 10RobH: adding ssl monitoring for wikitech-static [puppet] - 10https://gerrit.wikimedia.org/r/334177 [19:55:13] (03PS1) 10Gergő Tisza: Set up a Swift backend for prod Commons thumbnails on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334178 (https://phabricator.wikimedia.org/T145496) [19:55:43] Krinkle: :) yeas, i did a quick grep and also just saw the comments, in puppet repo only though [19:55:50] 06Operations, 10Traffic: Letsencrypt all the prod things we can - planning - https://phabricator.wikimedia.org/T133717#2970113 (10Krenair) [19:55:54] 06Operations, 10Traffic: convert wikitech-static.wikimedia.org to use LE rather than GS certificate - https://phabricator.wikimedia.org/T156294#2970111 (10Krenair) 05Open>03Invalid It already runs LE. You can add me to the monitoring if you like. [19:56:07] (03Abandoned) 10Gergő Tisza: Set up a Swift backend for prod Commons thumbnails on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334178 (https://phabricator.wikimedia.org/T145496) (owner: 10Gergő Tisza) [19:56:54] Krenair: why close my task? [19:57:00] oh, nm [19:57:04] i see i missed the comment [19:57:12] mutante: so uhhh, yeah... done? [19:57:15] =P [19:57:27] so ill modify my patchset now [19:58:36] (03PS2) 10RobH: adding ssl monitoring for wikitech-static [puppet] - 10https://gerrit.wikimedia.org/r/334177 [19:58:52] 06Operations, 10Traffic: Letsencrypt all the prod things we can - planning - https://phabricator.wikimedia.org/T133717#2970121 (10RobH) [19:59:13] 06Operations, 10Traffic: Letsencrypt all the prod things we can - planning - https://phabricator.wikimedia.org/T133717#2240497 (10RobH) Seems wikitech-static was converted previously, so it was already done. [19:59:39] (03CR) 10RobH: [C: 032] adding ssl monitoring for wikitech-static [puppet] - 10https://gerrit.wikimedia.org/r/334177 (owner: 10RobH) [20:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170125T2000). [20:00:33] robh: i actually commented on that, there was something to amend [20:00:45] ? [20:00:52] oops, i hit submit and it says "change is closed" [20:00:55] awesome timing [20:00:59] oh, what was wrong? [20:01:05] i changed to lets encrypt [20:01:10] robh: the word "policy" in the resource name and description [20:01:27] (03CR) 10Alex Monk: adding ssl monitoring for wikitech-static (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/334177 (owner: 10RobH) [20:01:49] bleh [20:01:52] fixing [20:01:57] I think this patch will cause puppet errors [20:02:04] oh but you already merged it [20:02:10] (03PS1) 10RobH: Revert "adding ssl monitoring for wikitech-static" [puppet] - 10https://gerrit.wikimedia.org/r/334179 [20:02:16] you don't need to revert [20:02:20] just follow-up a bit [20:02:31] +1 [20:02:51] meh, then others complain my patch should be a single all enclusive but sure [20:02:57] i disabled puppet agent on einsteinium [20:03:00] so icinga wont suffer [20:03:19] cool, yep [20:05:25] i was too quick to merge, heh [20:05:26] PROBLEM - Host heka is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2205.30 ms [20:05:27] (03PS1) 10RobH: fixing my last patch, wikitech-static ssl monitoring [puppet] - 10https://gerrit.wikimedia.org/r/334180 [20:05:31] PROBLEM - Host pay-lvs2001 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2208.30 ms [20:05:39] ahhhh, did we just lose a pfw in codfw? [20:05:45] Jeff_Green: ^ [20:05:46] RECOVERY - Host heka is UP: PING WARNING - Packet loss = 14%, RTA = 1164.35 ms [20:05:51] RECOVERY - Host pay-lvs2001 is UP: PING WARNING - Packet loss = 14%, RTA = 1164.49 ms [20:05:55] ok.... [20:05:57] wtf [20:06:01] another reboot? [20:06:05] incoming pages [20:06:06] only 2 though [20:06:06] woooo netflap :-) [20:06:15] if pfw fails its half the rack [20:06:22] robh what's going on? [20:06:26] 2 downs and 3 ups [20:06:34] Zppix: with what specifically? [20:06:45] strange [20:06:46] We're now discussing trying to determine what is up with the cofdw frack systems. [20:06:47] s/3/2/ [20:08:06] (03PS2) 10Dzahn: remove db1019, db1042, keep mgmt [dns] - 10https://gerrit.wikimedia.org/r/334014 (https://phabricator.wikimedia.org/T149793) [20:08:44] (03CR) 10Dzahn: [C: 031] fixing my last patch, wikitech-static ssl monitoring [puppet] - 10https://gerrit.wikimedia.org/r/334180 (owner: 10RobH) [20:08:53] (03CR) 10RobH: [C: 032] fixing my last patch, wikitech-static ssl monitoring [puppet] - 10https://gerrit.wikimedia.org/r/334180 (owner: 10RobH) [20:08:57] mutante: thx! [20:09:10] i was waiting, could you tell since my +2 was a second after your +1? ;] [20:09:41] yep, np [20:09:54] laptop battery critically low. have to move inside. be back soon [20:10:22] robh: you can also see the result on tegmen before einsteinium if you want [20:10:46] yeah good plan will do [20:10:53] 2 servers with icinga role, one being the "live" , brb [20:11:16] yeah, i thought about halting puppet on tegmen when i did for eqiad host [20:11:21] but figured meh it can fail its the standby. [20:11:56] PROBLEM - Host heka is DOWN: PING CRITICAL - Packet loss = 100% [20:12:12] hrmmm [20:12:19] again? [20:12:30] im fixing my icinga break, then i'll go back to poking the non-active frack issue [20:12:34] unsuperduper [20:12:36] PROBLEM - Host pay-lvs2001 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 4485.14 ms [20:12:36] anything i can do to help sort this out? [20:12:59] same two hosts, i think that pfw is being shitty [20:13:26] unfortunately, those particular devices are stumping our actual network admins on how to solve (due to firmware bugs) so im not sure how i can help ;P [20:13:35] Zppix: don't think so but thank you for offering. [20:14:09] maybe downtime the host for a bit if it starts happening frequently? [20:14:14] *hosts [20:14:42] ok, wikitech-static is now properly monitored by icinga. well, its in monitoring and is pending the first check in a few minutes. [20:15:16] Jeff_Green: Am I correct in my assumption that we still have nothing 'active' on codfw for frack? [20:15:19] (03CR) 10Gergő Tisza: "I0bd4b539 is probably a better approach." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334178 (https://phabricator.wikimedia.org/T145496) (owner: 10Gergő Tisza) [20:15:46] RECOVERY - Host heka is UP: PING WARNING - Packet loss = 0%, RTA = 650.10 ms [20:15:47] just wondering if we shold be calling network admins due to severity. [20:15:57] 06Operations, 10ops-eqiad, 10hardware-requests, 13Patch-For-Review: Decommission labsdb1002 - https://phabricator.wikimedia.org/T146455#2970182 (10Cmjohnson) a:03Cmjohnson [20:16:06] I assumed that [20:16:23] robh: yeah probably [20:17:20] the fact that only a couple hosts are flapping makes me think it's in the process of falling over but has not completely yet [20:17:46] RECOVERY - Host pay-lvs2001 is UP: PING WARNING - Packet loss = 0%, RTA = 1432.15 ms [20:19:19] 06Operations, 06Commons, 06Multimedia, 10media-storage, 15User-Josve05a: Specific revisions of multiple files missing from Swift - 404 Not Found returned - https://phabricator.wikimedia.org/T124101#2970235 (10Srittau) To be honest, I find priority "normal" to be worrying. If potentially unrecoverable dat... [20:22:56] PROBLEM - Host pay-lvs2001 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 4466.25 ms [20:23:56] 07Puppet, 06Labs, 10Labs-Infrastructure: Puppet failure on instance creation - https://phabricator.wikimedia.org/T156297#2970242 (10DeltaQuad) [20:36:56] !log upgrading nodejs-legacy (it is just the symlink) to v6 on parsoid hosts T149331 [20:37:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:00] T149331: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331 [20:37:06] RECOVERY - Check systemd state on wdqs2003 is OK: OK - running: The system is fully operational [20:37:16] RECOVERY - Blazegraph Port on wdqs2003 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 [20:37:26] RECOVERY - Blazegraph process on wdqs2003 is OK: PROCS OK: 1 process with UID = 997 (blazegraph), regex args ^java .* blazegraph-service-.*-dist.war [20:41:11] 07Puppet, 06Labs, 10Labs-Infrastructure: Puppet failure on instance creation - https://phabricator.wikimedia.org/T156297#2970184 (10bd808) The key bit here seems to be `Could not request certificate: Connection refused` when trying to talk to the Puppetmaster. The `+ sed -i s/_MASTER_//g /etc/puppet/puppet.c... [20:44:03] !log deploying mediawiki 1.29.0-wmf.9 to group1 wikis [20:44:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:52] 06Operations, 10Traffic, 10Wikimedia-Blog, 07HTTPS: Switch blog to HTTPS-only - https://phabricator.wikimedia.org/T105905#2970304 (10EdErhart-WMF) @BBlack Automattic has done this. Can someone check and make sure it's been set correctly before we close the ticket? [20:47:45] (03PS1) 1020after4: group1 wikis to 1.29.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334185 [20:47:47] (03CR) 1020after4: [C: 032] group1 wikis to 1.29.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334185 (owner: 1020after4) [20:49:16] (03Merged) 10jenkins-bot: group1 wikis to 1.29.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334185 (owner: 1020after4) [20:49:27] (03CR) 10jenkins-bot: group1 wikis to 1.29.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334185 (owner: 1020after4) [20:50:06] 07Puppet, 06Labs, 10Labs-Infrastructure: Puppet failure on instance creation - https://phabricator.wikimedia.org/T156297#2970313 (10bd808) I tried to reproduce by creating a new instance and see a big difference in that initial setup: ``` + project=mediawiki-vagrant ++ curl http://169.254.169.254/1.0/meta-da... [20:50:10] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.29.0-wmf.9 [20:50:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:59] hmm [20:52:00] Warning: Empty regular expression in /srv/mediawiki/php-1.29.0-wmf.9/includes/parser/DateFormatter.php on line 200 [20:52:21] pl proc line: 2959: warning: points must have either 4 or 2 values per line [20:53:02] pl proc line, yay [20:53:05] all day errday [20:53:10] nobody ever gonna fix [20:58:05] MaxSem: ebernhardson: as the only two people who have touched DateFormatter recently... either of you care to help me figure out ^ [20:58:43] looking [20:59:02] I don't see anything in your change that would cause this but it might be related [20:59:42] it removes the declaration of all these vars? $rxDM, $rxMD, $rxDMY, $rxYDM, $rxMDY, $rxYMD; [21:00:05] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170125T2100). [21:00:10] 06Operations, 10scap, 06Release-Engineering-Team (Long-Lived-Branches): Make git 2.2.0+ (preferably 2.8.x) available - https://phabricator.wikimedia.org/T140927#2970362 (10demon) p:05Low>03Normal Actually, stretch/testing has 2.11 available. That would solve all of the above and is easily backported to j... [21:00:21] https://phabricator.wikimedia.org/rMW4ca09bd76f19614464414b692b8f57ac8b7f5495 is the change [21:00:35] no mobileapps deployment today [21:01:04] twentyafterfour, it removes unused stuff. visibility changes would've resulted in fatals [21:01:15] so not mine but still need to figure this out [21:02:28] well it's a warning not a fatal, so I'm not getting a stack [21:02:31] It's probably something higher up that's passing emptiness down into the date formatter [21:02:40] cf: AbuseFilter regex spam [21:03:03] ostriches: yeah, I just thought I'd start at the bottom and this file was recently touched for the first time in a year so it was worth looking at [21:03:28] 07Puppet, 06Labs, 10Labs-Infrastructure: Puppet failure on instance creation - https://phabricator.wikimedia.org/T156297#2970184 (10chasemp) Does project use its own puppetmaster? [21:03:31] So now just make a list of everything that calls the date formatter :p [21:03:37] lol [21:03:39] twentyafterfour: did you notice the blocking task i opened ? [21:04:03] matanya: I removed it from blocking - it appears to only happen on that one specific revision [21:04:08] and it's a valid error in that case [21:04:15] ok, thanks [21:04:40] matanya: thanks for reporting it though! [21:04:45] sure [21:05:30] The FlaggedRevs fatal is more concerning to me than DateFormatter. [21:05:36] That's only going to get worse on the 'pedias [21:05:36] ostriches, it's a parser date formatter, so it's called from like 2 places [21:06:08] extract2 is broken, heh [21:06:27] everything's on fire, heh? [21:07:26] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:07:26] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:07:26] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:07:36] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:07:36] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:07:36] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:07:36] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:08:03] uhm [21:08:08] hey dbstore1001, when I said everything I din't specifically mean you [21:08:16] RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [21:08:16] lol [21:08:16] RECOVERY - MariaDB Slave IO: s7 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [21:08:16] RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [21:08:26] RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [21:08:26] RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [21:08:26] RECOVERY - MariaDB Slave IO: x1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [21:08:26] RECOVERY - MariaDB Slave SQL: s7 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [21:13:43] (03PS1) 10Ema: Add PyOpenSSL to requirements.txt [debs/pybal] - 10https://gerrit.wikimedia.org/r/334193 [21:17:22] 06Operations, 06Commons, 06Multimedia, 10media-storage, 15User-Josve05a: Specific revisions of multiple files missing from Swift - 404 Not Found returned - https://phabricator.wikimedia.org/T124101#2970441 (10Aklapper) @Srittau: Feel free to elaborate why this task is more urgent than other tasks on the... [21:19:46] PROBLEM - Router interfaces on pfw-codfw is CRITICAL: CRITICAL: host 208.80.153.195, interfaces up: 45, down: 12, dormant: 0, excluded: 0, unused: 0BRge-11/0/12: down - BRswfab1: down - BRvlan.2140: down - Subnet frack-management-codfwBRswfab1.0: down - BRfab1: down - BRswfab0.0: down - BRswfab0: down - BRxe-15/0/1: down - BRvlan.2137: down - Subnet frack-listenerdmz-codfwBRvlan.2133: down - Subnet frack-bastion-codfwBRge-11/0/ [21:20:16] PROBLEM - puppet last run on mw1169 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:20:42] lol [21:20:46] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/3: down - Core: pfw-codfw:xe-6/0/0 {#10900} [10Gbps DF]BR [21:21:06] PROBLEM - Host saiph is DOWN: PING CRITICAL - Packet loss = 100% [21:21:31] Jeff_Green: didn't you schedule the downtime? [21:21:42] paged [21:21:49] saiph [21:22:19] (03PS1) 10Chad: Rewrite extract2 to handle Article::getContent() disappearing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334196 [21:22:30] i did, wasn't expecting saiph to be affected [21:22:46] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [21:25:16] RECOVERY - Host saiph is UP: PING OK - Packet loss = 0%, RTA = 297.45 ms [21:25:46] RECOVERY - Router interfaces on pfw-codfw is OK: OK: host 208.80.153.195, interfaces up: 90, down: 0, dormant: 0, excluded: 0, unused: 0 [21:25:48] (03PS2) 10Chad: Rewrite extract2 to handle Article::getContent() disappearing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334196 [21:25:50] why thank you saiph for letting us know [21:26:03] (not really snarking, just amused) [21:28:06] (03PS3) 10Chad: Rewrite extract2 to handle Article::getContent() disappearing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334196 [21:30:06] PROBLEM - check_puppetrun on betelgeuse is CRITICAL: CRITICAL: Puppet has 17 failures [21:31:15] (03PS2) 10Ema: Add PyOpenSSL to requirements.txt, explain how to run tests [debs/pybal] - 10https://gerrit.wikimedia.org/r/334193 [21:32:27] File not found: /srv/mediawiki/docroot/wikimedia.org/w/../multiversion/MWMultiVersion.php in /srv/mediawiki/docroot/wikimedia.org/w/extract2.php on line 5 [21:32:31] How is that even possible? [21:32:36] (and nothing changed there...) [21:33:07] did the docroot move from /wikimedia/ to .org similar to wikidata or something? [21:33:22] Not for this one [21:33:28] I'm on meta w/ mwdebug1002 [21:34:19] same on debug1001? [21:34:33] or somebody else's change debugging an unrelated thing causes it to behave differently? [21:34:53] I'm baffled...this wasn't failing like this in my initial testing. [21:35:04] It was failing further down... [21:35:06] RECOVERY - check_puppetrun on betelgeuse is OK: OK: Puppet is currently enabled, last run 235 seconds ago with 0 failures [21:36:16] Other files in w/ are working fine... [21:36:21] And have same inclusion line [21:39:49] !log reload pfw1-codfw node 0 in an effort to debug high RTTs [21:39:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:56] ostriches: confirmed, exact same require_once in other files there [21:47:06] and extract2.php is identical on debug1001 and 1002 [21:47:49] (03PS1) 10Rush: tool: convert HBA source host mechanism to static [puppet] - 10https://gerrit.wikimedia.org/r/334203 [21:48:10] (03CR) 10Dzahn: [C: 04-1] "not needed anymore. it already works after your follow-up" [puppet] - 10https://gerrit.wikimedia.org/r/334179 (owner: 10RobH) [21:48:12] (03PS2) 10Rush: tool: convert HBA source host mechanism to static [puppet] - 10https://gerrit.wikimedia.org/r/334203 [21:49:16] RECOVERY - puppet last run on mw1169 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [21:50:24] (03CR) 10Dzahn: "merge after after the puppet removal and shutdown" [dns] - 10https://gerrit.wikimedia.org/r/334014 (https://phabricator.wikimedia.org/T149793) (owner: 10Dzahn) [21:51:23] (03PS3) 10BryanDavis: tool: convert HBA source host mechanism to static [puppet] - 10https://gerrit.wikimedia.org/r/334203 (https://phabricator.wikimedia.org/T156168) (owner: 10Rush) [21:52:38] ugh I can't figure out T156310 either [21:52:38] T156310: Fatal error: Call to undefined method Revision::getText() in extensions/FlaggedRevs/backend/FlaggedRevision.php on line 480 - https://phabricator.wikimedia.org/T156310 [21:52:49] none of the recent changes there look relevant [21:55:44] (03CR) 10BryanDavis: "If you wanted to make it slightly more consolidated to add or remove a host, there could be a hiera hash of hostname:ip and the relevant p" [puppet] - 10https://gerrit.wikimedia.org/r/334203 (https://phabricator.wikimedia.org/T156168) (owner: 10Rush) [21:55:53] twentyafterfour: perhaps just because i havn't looked at it, but https://github.com/wikimedia/mediawiki-extensions-FlaggedRevs/blob/wmf/1.29.0-wmf.9/backend/FlaggedRevision.php#L480 calls getText, and https://gerrit.wikimedia.org/r/#/c/332933/ removed that function [21:56:04] (03PS2) 10Dzahn: remove cp3011-cp3022, keep mgmt [dns] - 10https://gerrit.wikimedia.org/r/334015 (https://phabricator.wikimedia.org/T130883) [21:56:22] (03CR) 10jerkins-bot: [V: 04-1] tool: convert HBA source host mechanism to static [puppet] - 10https://gerrit.wikimedia.org/r/334203 (https://phabricator.wikimedia.org/T156168) (owner: 10Rush) [21:56:30] ebernhardson: indeed that looks like the culprit [21:57:40] that call to wfDeprecated(...) that was in the function should have been logging and telling us to fix this for some time though, i wonder if those actually get logged [21:58:25] (03PS4) 10BryanDavis: tool: convert HBA source host mechanism to static [puppet] - 10https://gerrit.wikimedia.org/r/334203 (https://phabricator.wikimedia.org/T156168) (owner: 10Rush) [21:58:28] lol [21:58:43] twentyafterfour https://github.com/wikimedia/mediawiki/commit/0958f53373671e212f01bf3987e405c6121a2d70 [21:58:44] I've never seen "deprecated" logs in kibana but perhaps they are just filtered out somewhere [21:58:57] (03CR) 10Dzahn: "amended to keep mgmt. the order would be: remove from puppet, shutdown, merge this, physically take out of rack, remove mgmt DNS" [dns] - 10https://gerrit.wikimedia.org/r/334015 (https://phabricator.wikimedia.org/T130883) (owner: 10Dzahn) [21:59:19] twentyafterfour it should be replaced with getContent [21:59:19] i took a quick stroll through how wfDeprecated works but .... it's actually quite complicated [21:59:27] :-/ [21:59:40] it's not something obvious like getting a logger from the factory and sending a message [21:59:54] of course, that would be to straightforward [22:00:57] hmm, it looks like it should eventually end up in a log group called 'deprecated' [22:00:59] (03PS5) 10Dzahn: varnish misc: add phab2001 as a backend for phab-new [puppet] - 10https://gerrit.wikimedia.org/r/324797 (https://phabricator.wikimedia.org/T137928) [22:01:44] which isn't turned on in our prod logging config, I'll ship a patch up [22:02:56] hahah that's sweet, we only care to warn 3rd parties about deprecations ;) [22:07:19] ebernhardson: oh my [22:09:17] hmm, it's not as easy as i had thought ... we also need to call MWDebug::init() to enable deprecation logging, but that also does some other stuff ... needs some more evaluation [22:10:45] (03PS2) 10Madhuvishy: nfs: Make labstore1004 source for backups to secondary DC [puppet] - 10https://gerrit.wikimedia.org/r/333327 [22:10:56] PROBLEM - puppet last run on cp3043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:11:34] 06Operations, 06Labs, 13Patch-For-Review: Set up monitoring for secondary labstore HA cluster - https://phabricator.wikimedia.org/T144633#2970617 (10chasemp) @Madhuvishy satisfied we can close? [22:11:53] 06Operations, 06Labs, 13Patch-For-Review, 07Tracking: Migrate misc to secondary labstore HA cluster - https://phabricator.wikimedia.org/T154336#2970618 (10chasemp) @Madhuvishy do you think we can close this now? [22:12:03] (03PS1) 10EBernhardson: Enable deprecation logging in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334206 (https://phabricator.wikimedia.org/T156310) [22:13:56] 06Operations, 06Labs, 13Patch-For-Review, 07Tracking: overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083#2970629 (10madhuvishy) [22:13:59] 06Operations, 06Labs, 13Patch-For-Review: Set up monitoring for secondary labstore HA cluster - https://phabricator.wikimedia.org/T144633#2970627 (10madhuvishy) 05Open>03Resolved @chasemp Yup, closing. [22:14:56] (03PS1) 10Dzahn: delete stream.wikimedia.org SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/334207 [22:17:25] (03CR) 10Tim Landscheidt: [C: 04-1] "After moving this data to Hiera, I would add a fail() if a bastion host has its hostname and ip_eth0 not listed in that Hiera hash." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/334203 (https://phabricator.wikimedia.org/T156168) (owner: 10Rush) [22:18:08] (03PS3) 10Madhuvishy: nfs: Make labstore1004 source for backups to secondary DC [puppet] - 10https://gerrit.wikimedia.org/r/333327 [22:18:32] (03PS1) 10Dzahn: ssl: delete ecc-uni.wikimedia.org.crt [puppet] - 10https://gerrit.wikimedia.org/r/334209 [22:20:55] 06Operations, 06Labs, 13Patch-For-Review, 07Tracking: overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083#2970654 (10chasemp) [22:21:58] (03PS1) 10Dzahn: delete uni.wikimedia.org.crt [puppet] - 10https://gerrit.wikimedia.org/r/334210 [22:24:14] (03PS1) 10Dzahn: ssl: delete ldap-eqiad/ldap-codfw.wikimedia.org SSL certs [puppet] - 10https://gerrit.wikimedia.org/r/334211 [22:24:42] (03PS2) 10Dzahn: ssl: delete stream.wikimedia.org cert [puppet] - 10https://gerrit.wikimedia.org/r/334207 [22:25:19] (03PS2) 10Dzahn: ssl: delete ldap-eqiad/ldap-codfw.wikimedia.org certs [puppet] - 10https://gerrit.wikimedia.org/r/334211 [22:27:50] 06Operations, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, and 2 others: Current state and next steps for RESTBase storage - https://phabricator.wikimedia.org/T152724#2970665 (10Eevans) [22:28:33] (03PS2) 10Dzahn: ssl: delete uni.wikimedia.org.crt [puppet] - 10https://gerrit.wikimedia.org/r/334210 [22:28:47] 07Puppet, 06Labs, 10Labs-Infrastructure: Puppet failure on instance creation - https://phabricator.wikimedia.org/T156297#2970675 (10DeltaQuad) As mentioned on IRC, I have no clue. The option was not presented in the setup options as far as I know. [22:30:11] twentyafterfour, I got distracted and now don't see regexp errors - has it been resolved? [22:30:46] MaxSem: not by anything I did [22:31:47] (03PS3) 10Volans: icinga: raid_handler improvements [puppet] - 10https://gerrit.wikimedia.org/r/321642 (https://phabricator.wikimedia.org/T149913) [22:32:42] (03CR) 10Madhuvishy: [C: 032] nfs: Make labstore1004 source for backups to secondary DC [puppet] - 10https://gerrit.wikimedia.org/r/333327 (owner: 10Madhuvishy) [22:35:00] (03CR) 10Dzahn: "https://phabricator.wikimedia.org/rOPUP1125884b5bfc8b711d29ca33bc699bd47fc14ac5" [puppet] - 10https://gerrit.wikimedia.org/r/334209 (owner: 10Dzahn) [22:35:05] (03CR) 10Dzahn: "https://phabricator.wikimedia.org/rOPUP1125884b5bfc8b711d29ca33bc699bd47fc14ac5" [puppet] - 10https://gerrit.wikimedia.org/r/334210 (owner: 10Dzahn) [22:36:15] 06Operations, 10Traffic: convert wikitech-static.wikimedia.org to use LE rather than GS certificate - https://phabricator.wikimedia.org/T156294#2970689 (10Dzahn) added an Icinga contact for Krenair. but contact is not in a group yet. [22:36:42] 06Operations, 10Traffic: convert wikitech-static.wikimedia.org to use LE rather than GS certificate - https://phabricator.wikimedia.org/T156294#2970693 (10Dzahn) @Robh wanna link the monitoring change [22:36:59] (03CR) 10Alex Monk: [C: 031] ssl: delete stream.wikimedia.org cert [puppet] - 10https://gerrit.wikimedia.org/r/334207 (owner: 10Dzahn) [22:38:02] (03PS4) 10Volans: icinga: raid_handler improvements [puppet] - 10https://gerrit.wikimedia.org/r/321642 (https://phabricator.wikimedia.org/T149913) [22:38:56] RECOVERY - puppet last run on cp3043 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [22:40:21] (03PS4) 10Chad: Rewrite extract2 to handle Article::getContent() disappearing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334196 [22:40:44] (03CR) 10Volans: [C: 032] icinga: raid_handler improvements [puppet] - 10https://gerrit.wikimedia.org/r/321642 (https://phabricator.wikimedia.org/T149913) (owner: 10Volans) [22:41:20] twentyafterfour, https://gerrit.wikimedia.org/r/#/c/334214/ [22:41:26] mutante: Sooo, the "cannot include" thingie with extract2? Specific to mwdebug1002 [22:41:35] mwdebug1001 seems fine (and my patch fixes my breakages) [22:43:22] (03CR) 10Chad: [C: 032] Rewrite extract2 to handle Article::getContent() disappearing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334196 (owner: 10Chad) [22:45:05] I still haven't got around to https://gerrit.wikimedia.org/r/#/c/298397/ [22:45:08] I really need to do that [22:45:13] Maybe in the midnight swat? [22:45:15] jouncebot, next [22:45:15] In 1 hour(s) and 14 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170126T0000) [22:45:56] (03Merged) 10jenkins-bot: Rewrite extract2 to handle Article::getContent() disappearing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334196 (owner: 10Chad) [22:46:08] ugh, I bet I can't safely deploy that without splitting it up due to the multiple files thing [22:47:02] If it wasn't something important like file storage I just say sync-dir and endure the brief storm [22:47:03] !log demon@tin Synchronized w/extract2.php: (no message) (duration: 00m 40s) [22:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:09] But breaking file storage mid-request could be bad [22:48:25] yeah [22:48:25] :) [22:48:27] (03CR) 10jenkins-bot: Rewrite extract2 to handle Article::getContent() disappearing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334196 (owner: 10Chad) [22:48:29] I'll split it [22:48:53] Krenair: hmmm... you could sync the rename of filebackend, then sync CommonSettings and finally sync-dir to cleanup [22:49:12] but that might be easier to split for sure [22:49:30] between syncing of the rename and syncing of CommonSettings, everyone would get a fatal due to the old path no longer existing [22:50:01] no, because the old filebackend-production would not be removed untilt he sync-dir [22:50:10] oh right [22:50:23] Long as getRealmSpecificCrap() picks production over no extension :) [22:50:26] You should be fine [22:50:29] its taking advantage of a bit of goofiness [22:50:41] it's not using getRealmSpecificCrap [22:51:01] yeah I can do this safely [22:51:40] bit of temporary messing around on the server-side only without going via the repo, but whatever [22:51:47] * ostriches writes a patch to rename that function to pickAFileAnyFile() [22:52:01] ostriches: just use rand() :) [22:52:14] you know it's behaviour used to make sense to some of us [22:52:34] It makes sense, sure. But the idea of realm-specific config is ugly to begin with [22:52:41] cf: puppet [22:52:43] :D [22:53:22] I don't disagree [22:53:35] I'm just saying it wasn't random [22:53:36] 06Operations, 10Traffic: convert wikitech-static.wikimedia.org to use LE rather than GS certificate - https://phabricator.wikimedia.org/T156294#2970706 (10RobH) I linked this task in the commit. I thought it would show here post merge.... odd. I know it shows when bug:task# shows but should also post merge... [22:53:45] Krenair: I know, I'm just making fun of it all [22:53:54] This house of cards we've built [22:54:00] ok :) [22:54:11] Unwinding MWVersion took weeks. That's how fragile this whole thing is [22:54:15] 06Operations, 10Traffic: convert wikitech-static.wikimedia.org to use LE rather than GS certificate - https://phabricator.wikimedia.org/T156294#2970707 (10RobH) https://gerrit.wikimedia.org/r/#/c/334180/ & https://gerrit.wikimedia.org/r/#/c/334177/ [22:54:17] One wrong move and it all falls over :D [22:55:22] a house of cards is an upgrade, it used to just be spinning plates! [22:55:38] * robh isnt being helpful. [22:55:49] I'm trying to put some tape on the cards' edges [22:55:55] So some parts can stay up [22:55:58] 06Operations, 10Traffic: convert wikitech-static.wikimedia.org to use LE rather than GS certificate - https://phabricator.wikimedia.org/T156294#2970708 (10Krenair) https://gerrit.wikimedia.org/r/#/c/334177/ https://gerrit.wikimedia.org/r/#/c/334180/ I think it doesn't like the whitespace between the task line... [22:55:59] I 'earned' my t-shirt by fixing a line ordering bug in scap. Its edge cases all the way down. [22:56:23] So, extract2 is fixed. Now for FlaggedRevs I guess... [22:56:31] * ostriches glares at Reedy [22:56:35] ostriches: it must be a more hidden diff, extract2 was the same, i actually copied both and ran diff [22:57:09] It's something wrong about mwdebug [22:57:10] 1002 [22:57:22] I'm betting one of the piles of symlinks is miss-chained [22:57:27] does anyone know why it was called extract2 instead of extract? [22:57:41] I think there was an extract.php long time ago [22:57:51] then someone forked it? :) [22:57:59] Probably [22:58:09] Sounds like something someone around here would do [22:58:23] rsync /srv/ mwdebug1001 to 1002 ? [22:58:30] to "reset" it [22:58:41] I'm actually kind of curious [22:58:50] Maybe mwdebug1002 is actually *correct* and we're lucky right now? [22:58:58] Like, the whole thing *should* be broken? [22:58:59] :D [22:59:55] so my plan for https://gerrit.wikimedia.org/r/#/c/298397/2 - take a copy of filebackend-production.php, apply commit, rename my copy to filebackend-production.php, sync-dir, remove my copy, sync-dir? [23:00:43] ostriches, I think it sounds like what someone around here might've done in 2010 :) [23:00:53] api upload is busted in latest branch [23:00:54] https://phabricator.wikimedia.org/P4807 [23:02:13] File not found: /srv/mediawiki/docroot/wikisource.org/w/../multiversion/MWMultiVersion.php in /srv/mediawiki/docroot/wikisource.org/w/favicon.php on line 2 [23:02:16] And not on mwdebug [23:02:18] That's concerning. [23:02:24] Is this all about to fall over? [23:02:30] <_joe_> ostriches: I guess so [23:03:51] What happened to my home directory on tin? [23:04:01] I used to have a load of files in here... [23:04:13] Krenair: I stole them [23:05:28] drwxr-xr-x 3 krenair wikidev 4096 Jan 25 19:16 . [23:05:39] 07Puppet, 06Labs, 10Labs-Infrastructure: Puppet failure on instance creation - https://phabricator.wikimedia.org/T156297#2970732 (10chasemp) @DeltaQuad ok thank you, @Andrew can you look at this when you get a second? I'm not sure what's going on at the moment in this project. [23:05:43] _joe_, any idea what happened there? [23:06:20] <_joe_> Krenair: tin was reimaged, those files should be in a backup [23:06:28] when was it last reimaged? [23:06:31] <_joe_> Krenair: ask moritzm tomorrow [23:06:40] <_joe_> Krenair: a few months back? [23:06:46] <_joe_> I don't remember [23:06:56] would've thought I'd've replaced my usual scripts since then [23:07:11] I don't need most of the files, I can recover the one I was planning to use from elsewhere [23:08:50] just thought it was weird they all disappeared. [23:09:42] ostriches, yep, that file (/srv/mediawiki/docroot/wikisource.org/multiversion/MWMultiVersion.php) doesn't exist anywhere [23:10:02] Yes, when you look at it that way, of course. [23:10:15] /srv/mediawiki/docroot/wikisource.org/multiversion is missing [23:10:24] It's not supposed to be there.... [23:10:46] The whole thing should be exploding...everywhere [23:10:52] (we don't want multiversion in the docroot) [23:11:01] (03PS1) 10Madhuvishy: labstore: Install package nethogs from jessie-backports [puppet] - 10https://gerrit.wikimedia.org/r/334218 [23:11:25] This pile of symlinks is fragile. [23:11:39] so it's missing an extra ../../ to escape the docroot? [23:12:01] How are any of the entry points working? [23:12:41] krenair@tin:~$ cat /srv/mediawiki/docroot/wikisource.org/w/favicon.php [23:12:41] require_once __DIR__ . '/../multiversion/MWMultiVersion.php'; [23:13:22] same for index.php [23:13:24] so that is a very good question [23:13:34] This should be broken. Everywhere. [23:13:54] woah wait what? [23:13:55] hang on [23:14:11] krenair@tin:~$ ls -l /srv/mediawiki/docroot/wikisource.org/w/../multiversion/MWMultiVersion.php [23:14:11] -rw-r--r-- 1 mwdeploy mwdeploy 14053 Jan 18 18:34 /srv/mediawiki/docroot/wikisource.org/w/../multiversion/MWMultiVersion.php [23:14:11] krenair@tin:~$ ls -l /srv/mediawiki/docroot/wikisource.org/multiversion/MWMultiVersion.php [23:14:11] ls: cannot access /srv/mediawiki/docroot/wikisource.org/multiversion/MWMultiVersion.php: No such file or directory [23:14:38] w is a symlink to somewhere that has a parent directory containing multiversion/MWMultiVersion.php ? [23:14:38] F'ing symlinks... [23:14:48] Yep [23:14:59] docroot/foo/w symlinks w/ [23:15:02] lrwxrwxrwx 1 mwdeploy mwdeploy 16 Nov 17 16:49 /srv/mediawiki/docroot/wikisource.org/w -> /srv/mediawiki/w [23:16:24] well [23:16:26] that explains that then [23:17:18] except [23:17:27] https://wikisource.org/w/favicon.php [23:17:34] shows a fatal error [23:17:44] or did earlier, now it's gone [23:17:56] ah, it does *sometimes*. scary. [23:18:25] (03PS2) 10Madhuvishy: labstore: Install package nethogs from jessie-backports [puppet] - 10https://gerrit.wikimedia.org/r/334218 [23:18:26] It failed with server:mw1185.eqiad.wmnet [23:18:39] It was fine with server:mw1268.eqiad.wmnet [23:18:52] We really need to unwind some of these layers of symlinks further. [23:19:01] Each layer of indiction adds complexity [23:19:34] curl -H "Host: wikisource.org" http://mw1185/w/favicon.php [23:19:34] vs. [23:19:45] curl -H "Host: wikisource.org" http://mw1268/w/favicon.php [23:19:52] they are both consistent, but different from each other [23:20:54] My only thought is maybe a couple of apaches are weirdly out of sync [23:21:26] Like, does a scap pull fix 1185? [23:21:52] the thing is, '/srv/mediawiki/docroot/wikisource.org/w -> /srv/mediawiki/w' is the same on both of the hosts I looked at [23:23:17] and shell on mw1185 shows /srv/mediawiki/docroot/wikisource.org/w/../multiversion/MWMultiVersion.php definitely exists [23:23:33] so why does php on it return File not found: /srv/mediawiki/docroot/wikisource.org/w/../multiversion/MWMultiVersion.php [23:24:08] I have no freaking clue... [23:24:31] I do see different kernel versions but somehow I doubt that's it [23:28:14] both servers have var_dump( file_exists( '/srv/mediawiki/docroot/wikisource.org/w/../multiversion/MWMultiVersion.php' ) ) returning false, yet one of them is apparently OK with this? [23:29:04] is it possible this stuff can come from cache? [23:29:13] That's my guess just now [23:29:21] Something stuck in hhvm? [23:30:06] I'm not much of an hhvm expert but I have seen a .hhvm.hhbc in my home directory and assume there's an equivalent for www-data somewhere [23:30:25] I'm also curious...as a more permanent fix...could we move w/ to std-docroot directly rather than have it in the root? Then itd be ../.. [23:30:31] Less ambiguous [23:31:55] krenair@mw1185:~$ ls -l /var/cache/hhvm/hhvm.hhbc [23:31:55] -rw-r--r-- 1 www-data www-data 4109312 Oct 5 12:03 /var/cache/hhvm/hhvm.hhbc [23:32:02] krenair@mw1268:~$ ls -l /var/cache/hhvm/hhvm.hhbc [23:32:02] -rw-r--r-- 1 www-data www-data 4110336 Jun 10 2016 /var/cache/hhvm/hhvm.hhbc [23:32:22] So both are months old [23:32:27] yes [23:32:54] ugh, how do I get service status without full sudo on these things [23:33:18] Good question [23:33:34] ps ? :P [23:33:45] yeah I was about to resort to that [23:33:51] privileges: ['ALL = (www-data,apache,mwdeploy,l10nupdate) NOPASSWD: ALL', [23:33:54] 61 'ALL = NOPASSWD: /sbin/restart hhvm', [23:33:56] 62 'ALL = NOPASSWD: /sbin/start hhvm', [23:34:22] anything as apache, but no "status" [23:34:23] mutante: those don't work in jessie, you could deploy https://gerrit.wikimedia.org/r/#/c/312705/ :) [23:34:31] hhvm start 24th Jan on mw1185 [23:34:52] 22nd on mw1268 [23:35:27] can't help but feel I'm shooting in the dark here [23:35:40] Restart hhvm on busted node and see if it fixes it? [23:35:42] Can't hurt. [23:35:48] ok [23:36:25] krenair@mw1185:~$ sudo /sbin/restart hhvm [23:36:25] We trust you have received the usual lecture from the local System [23:36:26] hm [23:36:46] * ebernhardson predicts next message: cannot access /sbin/restart: No such file or directory [23:37:02] * Krenair sighs [23:37:07] krenair@mw1185:~$ ls -l /sbin/restart [23:37:07] ls: cannot access /sbin/restart: No such file or directory [23:37:08] yup [23:37:17] i put a patch up in october, it just hasn't been merged *shrug* [23:37:18] Krenair: sudo -u? [23:37:19] service hhvm restart , want me to do it? [23:37:21] so we have the sudo line, it just allows us to do something that doesn't exist [23:37:42] ebernhardson, link? [23:37:50] Krenair: about 10 lines ^ [23:37:58] ah, ty [23:38:12] mutante, yes please [23:38:26] mutante, did you do it? [23:38:34] it looks like stuff on that node got fixed [23:38:55] !log mw1185, mw1268 - service hhvm restart [23:38:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:01] Yep, mw1185 looks good now [23:39:11] mw1268 was already OK [23:39:13] mwdebug1002 was busted earlier, should do that one [23:39:59] !log mwdebug1002 - service hhvm restart [23:40:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:10] And fixed [23:40:21] Ok, so it was something in hhvm that was badly cached [23:40:38] (03CR) 10Alex Monk: [C: 031] "Let's address Giuseppe's comments and get this done. It was needed today." [puppet] - 10https://gerrit.wikimedia.org/r/312705 (https://phabricator.wikimedia.org/T146656) (owner: 10EBernhardson) [23:41:18] (03PS6) 10EBernhardson: Update mwdeploy group sudo rights for jessie [puppet] - 10https://gerrit.wikimedia.org/r/312705 (https://phabricator.wikimedia.org/T146656) [23:41:24] ostriches, I did think about sudo -u btw, but figured none of the users we can run all commands under would be able to mess with services in that way [23:41:58] I'm wondering if we should do a rolling hhvm restart [23:42:04] only root, which we lacked the correct sudo line for that operating system for [23:42:36] (03CR) 10Alex Monk: [C: 031] Update mwdeploy group sudo rights for jessie [puppet] - 10https://gerrit.wikimedia.org/r/312705 (https://phabricator.wikimedia.org/T146656) (owner: 10EBernhardson) [23:42:50] (03CR) 10Dzahn: [C: 031] "link has been added, thanks" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334157 (https://phabricator.wikimedia.org/T156281) (owner: 10Urbanecm) [23:43:23] Krenair: So...actionables? [23:43:35] 1) We should try and unwind some more of these symlinks, they're fragile and confusing [23:43:39] 06Operations, 10Traffic: convert wikitech-static.wikimedia.org to use LE rather than GS certificate - https://phabricator.wikimedia.org/T156294#2970836 (10Dzahn) It's the lack of "Bug: " [23:43:44] 2) Fix up the server restart sudo rules [23:43:50] 3) Do a rolling hhvm restart everywhere? [23:44:09] I don't have any more to add to that list [23:44:36] (2) already has a task and gerrit change. [23:44:39] I'll file a bug for (1) [23:45:09] I refreshed the page a bunch more [23:45:25] I don't think this particular issue is still hiding elsewhere in the servers assigned to user requests [23:45:55] doesn't mean there aren't other issues, or issues hiding in job runners, debug servers, or what have you [23:46:31] so (3) is worthy of consideration with ops [23:48:29] https://phabricator.wikimedia.org/T156319 for the mw-config stuff [23:49:04] I'm pretty sure we'd be seeing more of them if it were widespread [23:49:19] I'm a little worried these couple snuck by unnoticed, but I guess they were rare enough [23:49:54] (03PS1) 10Dzahn: icinga/wikitech-static: add contact_group for https monitor [puppet] - 10https://gerrit.wikimedia.org/r/334220 (https://phabricator.wikimedia.org/T156294) [23:51:03] (03CR) 10jerkins-bot: [V: 04-1] icinga/wikitech-static: add contact_group for https monitor [puppet] - 10https://gerrit.wikimedia.org/r/334220 (https://phabricator.wikimedia.org/T156294) (owner: 10Dzahn) [23:52:37] (03PS2) 10Dzahn: icinga/wikitech-static: add contact_group for https monitor [puppet] - 10https://gerrit.wikimedia.org/r/334220 (https://phabricator.wikimedia.org/T156294) [23:54:25] mutante, say aren't there existing wikitech-static monitors? [23:54:29] beyond https stuff? [23:54:35] like how up to date it is compared to wikitech? [23:54:43] yes, i checked that earlier [23:54:48] (03PS3) 10Dzahn: icinga/wikitech-static: add contact_group for https monitor [puppet] - 10https://gerrit.wikimedia.org/r/334220 (https://phabricator.wikimedia.org/T156294) [23:54:53] there was no host wikitech-static though [23:54:56] until rob added it today [23:55:04] so it never pinged it as a host [23:55:15] or had a HTTP(S) check [23:55:50] the check you are referring to is called "are wikitech and wt-static in sync" and that is applied on the wmf-side [23:55:50] right [23:55:55] yep [23:55:57] so on silver and labtestweb2001 [23:56:18] okay well labtestweb2001 probably shouldn't have that [23:56:21] (03CR) 10jerkins-bot: [V: 04-1] icinga/wikitech-static: add contact_group for https monitor [puppet] - 10https://gerrit.wikimedia.org/r/334220 (https://phabricator.wikimedia.org/T156294) (owner: 10Dzahn) [23:56:28] but you're saying if it failed today, that would go to the contacts for silver? [23:56:37] technically those should be in a single service group [23:56:53] yes [23:56:59] but "contacts for silver" = admins [23:57:02] = default for everything [23:57:03] right [23:57:15] admins also includes the IRC bots [23:57:18] I think we should consider moving that check to have the host wikitech-static [23:57:57] yes, well, since wikitech is always the source of the sync, right [23:58:20] well moving it should be easy now that there is a virtual host [23:58:25] yeah [23:59:55] btw I think jenkinsbot doesn't like the abusive whitespace your commit contains :)