[00:00:26] so Luke081515 my plan is to wait and try to get a local consensus on ur., and if not possible, to escalate the matter on a meta. RfC [00:00:33] yeah ok [00:00:43] was just a +1 for correct code ;) [00:06:54] 06Operations, 10Traffic, 06Zero, 13Patch-For-Review: Use Text IP for Mobile hostnames to gain SPDY/H2 coalesce between the two - https://phabricator.wikimedia.org/T124482#2237531 (10BBlack) This was merged around 2016-04-25 18:40 UTC, and legit caches that honor TTLs correctly should have all stopped handi... [00:20:59] RECOVERY - puppet last run on ms-fe3002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:21:43] (03CR) 10Thcipriani: [C: 04-1] "Problem inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) (owner: 1020after4) [00:24:10] (03PS2) 10Dzahn: planet: move role class to autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/285307 [00:38:29] PROBLEM - puppet last run on mw2173 is CRITICAL: CRITICAL: puppet fail [01:06:01] RECOVERY - puppet last run on mw2173 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [01:09:00] PROBLEM - Host bast4001 is DOWN: PING CRITICAL - Packet loss = 100% [01:09:40] !log reinstalling bast4001 w/ jessie [01:09:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:10:42] (03PS1) 10Faidon Liambotis: Revert "installserver: let bast4001 use carbon" [puppet] - 10https://gerrit.wikimedia.org/r/285323 [01:10:48] (03PS2) 10Faidon Liambotis: Revert "installserver: let bast4001 use carbon" [puppet] - 10https://gerrit.wikimedia.org/r/285323 [01:12:53] (03PS3) 10Faidon Liambotis: Revert "installserver: let bast4001 use carbon" [puppet] - 10https://gerrit.wikimedia.org/r/285323 [01:13:34] (03CR) 10Faidon Liambotis: [C: 032 V: 032] Revert "installserver: let bast4001 use carbon" [puppet] - 10https://gerrit.wikimedia.org/r/285323 (owner: 10Faidon Liambotis) [01:18:00] PROBLEM - Apache HTTP on mw1131 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:18:40] PROBLEM - HHVM rendering on mw1131 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:19:09] PROBLEM - RAID on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:19:20] PROBLEM - SSH on mw1131 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:19:36] (03PS1) 10Faidon Liambotis: network: kill bast4001 SLAAC from ::constants [puppet] - 10https://gerrit.wikimedia.org/r/285324 [01:19:40] PROBLEM - configured eth on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:19:40] PROBLEM - nutcracker port on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:20:02] PROBLEM - Check size of conntrack table on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:20:09] PROBLEM - dhclient process on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:20:10] PROBLEM - Disk space on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:20:30] PROBLEM - HHVM processes on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:20:40] PROBLEM - nutcracker process on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:20:40] PROBLEM - salt-minion processes on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:20:50] PROBLEM - DPKG on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:21:06] (03Abandoned) 10Faidon Liambotis: Remove IPv6 SLAAC addresses from network.pp etc. [puppet] - 10https://gerrit.wikimedia.org/r/214360 (owner: 10Faidon Liambotis) [01:30:40] RECOVERY - dhclient process on mw1131 is OK: PROCS OK: 0 processes with command name dhclient [01:30:49] RECOVERY - Check size of conntrack table on mw1131 is OK: OK: nf_conntrack is 0 % full [01:30:50] RECOVERY - Disk space on mw1131 is OK: DISK OK [01:31:10] RECOVERY - HHVM processes on mw1131 is OK: PROCS OK: 6 processes with command name hhvm [01:31:21] RECOVERY - salt-minion processes on mw1131 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [01:31:21] RECOVERY - nutcracker process on mw1131 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [01:31:40] RECOVERY - DPKG on mw1131 is OK: All packages OK [01:31:40] RECOVERY - HHVM rendering on mw1131 is OK: HTTP OK: HTTP/1.1 200 OK - 67790 bytes in 2.225 second response time [01:32:17] RECOVERY - RAID on mw1131 is OK: OK: no RAID installed [01:33:26] RECOVERY - Apache HTTP on mw1131 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.043 second response time [01:33:47] RECOVERY - puppet last run on mw1131 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [01:34:57] RECOVERY - SSH on mw1131 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [01:35:17] RECOVERY - configured eth on mw1131 is OK: OK - interfaces up [01:35:23] (03PS1) 10Faidon Liambotis: bastion: install mtr-tiny, cleanup [puppet] - 10https://gerrit.wikimedia.org/r/285327 [01:35:47] RECOVERY - nutcracker port on mw1131 is OK: TCP OK - 0.000 second response time on port 11212 [01:37:17] (03CR) 10Faidon Liambotis: [C: 032] bastion: install mtr-tiny, cleanup [puppet] - 10https://gerrit.wikimedia.org/r/285327 (owner: 10Faidon Liambotis) [01:37:39] (03PS2) 10Faidon Liambotis: network: kill bast4001 SLAAC from ::constants [puppet] - 10https://gerrit.wikimedia.org/r/285324 [01:45:07] RECOVERY - Host bast4001 is UP: PING OK - Packet loss = 0%, RTA = 74.81 ms [01:49:38] PROBLEM - RAID on bast4001 is CRITICAL: Connection refused by host [01:49:57] PROBLEM - configured eth on bast4001 is CRITICAL: Connection refused by host [01:49:57] PROBLEM - salt-minion processes on bast4001 is CRITICAL: Connection refused by host [01:50:00] (03PS1) 10Faidon Liambotis: autoinstall/partman: switch bast4001 to lvm [puppet] - 10https://gerrit.wikimedia.org/r/285329 [01:50:16] PROBLEM - puppet last run on bast4001 is CRITICAL: Connection refused by host [01:50:37] PROBLEM - Check size of conntrack table on bast4001 is CRITICAL: Connection refused by host [01:50:46] (03CR) 10Faidon Liambotis: [C: 032 V: 032] autoinstall/partman: switch bast4001 to lvm [puppet] - 10https://gerrit.wikimedia.org/r/285329 (owner: 10Faidon Liambotis) [01:50:57] PROBLEM - dhclient process on bast4001 is CRITICAL: Timeout while attempting connection [01:51:06] PROBLEM - DPKG on bast4001 is CRITICAL: Timeout while attempting connection [01:51:17] PROBLEM - Disk space on bast4001 is CRITICAL: Timeout while attempting connection [01:53:26] PROBLEM - Host bast4001 is DOWN: PING CRITICAL - Packet loss = 100% [02:12:33] RECOVERY - Host bast4001 is UP: PING WARNING - Packet loss = 37%, RTA = 75.29 ms [02:15:20] (03PS3) 10Faidon Liambotis: network: kill bast4001 SLAAC from ::constants [puppet] - 10https://gerrit.wikimedia.org/r/285324 [02:15:29] (03CR) 10Faidon Liambotis: [C: 032 V: 032] network: kill bast4001 SLAAC from ::constants [puppet] - 10https://gerrit.wikimedia.org/r/285324 (owner: 10Faidon Liambotis) [02:34:21] 06Operations, 13Patch-For-Review, 07Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2237817 (10faidon) [02:34:24] 06Operations, 13Patch-For-Review: reinstall bast4001 with jessie - https://phabricator.wikimedia.org/T123674#2237815 (10faidon) 05Open>03Resolved So, I noticed that bast4001's PXE ROM was hanging at either the DHCP or the TFTP step at random (the progress-character stopped moving, network interactions, DHC... [02:35:09] (03Abandoned) 10Faidon Liambotis: network: remove bast4001 SLAAC IPs [puppet] - 10https://gerrit.wikimedia.org/r/283361 (https://phabricator.wikimedia.org/T123674) (owner: 10Dzahn) [02:37:51] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.21) (duration: 19m 17s) [02:37:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:43:38] RECOVERY - Check size of conntrack table on bast4001 is OK: OK: nf_conntrack is 0 % full [02:43:49] RECOVERY - DPKG on bast4001 is OK: All packages OK [02:43:57] RECOVERY - Disk space on bast4001 is OK: DISK OK [02:43:57] RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [02:44:39] RECOVERY - RAID on bast4001 is OK: OK: no RAID installed [02:44:58] RECOVERY - configured eth on bast4001 is OK: OK - interfaces up [02:45:07] RECOVERY - salt-minion processes on bast4001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:45:09] RECOVERY - dhclient process on bast4001 is OK: PROCS OK: 0 processes with command name dhclient [02:46:52] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Apr 26 02:46:52 UTC 2016 (duration 9m 1s) [02:47:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:35:41] 06Operations, 10DNS, 10Traffic, 13Patch-For-Review: Internal DNS resolver responds with NXDOMAIN for localhost AAAA - https://phabricator.wikimedia.org/T125170#2237848 (10yuvipanda) Yeah, I submitted a patch to upstream to bind to 127.0.0.1 instead, and things are all ok now for me. [03:47:45] (03PS4) 10Yuvipanda: labs: Add diamond collector for PDNS authoritative server [puppet] - 10https://gerrit.wikimedia.org/r/285206 [03:48:02] (03CR) 10Yuvipanda: [C: 032 V: 032] labs: Add diamond collector for PDNS authoritative server [puppet] - 10https://gerrit.wikimedia.org/r/285206 (owner: 10Yuvipanda) [03:50:50] 06Operations, 13Patch-For-Review: Split carbon's install/mirror roles, provision install1001 - https://phabricator.wikimedia.org/T132757#2237902 (10Dzahn) @Faidon regarding the puppet noise on newly upgraded install2001 that you mentioned. (due to ganglia aggregator being on jessie with systemd) fixed that wit... [03:57:01] PROBLEM - Disk space on elastic1001 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 80761 MB (15% inode=99%) [04:00:35] 06Operations, 13Patch-For-Review: reinstall bast4001 with jessie - https://phabricator.wikimedia.org/T123674#2237943 (10Dzahn) Thank you very much Faidon! Yes, that was indeed just like with hooft. just that in esams i could work around it by simple picking another server that was idle and install that from th... [04:03:16] (03PS1) 10Yuvipanda: labs: Allow diamond to be root for pdns_control list [puppet] - 10https://gerrit.wikimedia.org/r/285330 [04:03:37] (03CR) 10jenkins-bot: [V: 04-1] labs: Allow diamond to be root for pdns_control list [puppet] - 10https://gerrit.wikimedia.org/r/285330 (owner: 10Yuvipanda) [04:03:45] (03PS2) 10Yuvipanda: labs: Allow diamond to be root for pdns_control list [puppet] - 10https://gerrit.wikimedia.org/r/285330 [04:05:16] (03CR) 10Yuvipanda: [C: 032] labs: Allow diamond to be root for pdns_control list [puppet] - 10https://gerrit.wikimedia.org/r/285330 (owner: 10Yuvipanda) [04:14:02] RECOVERY - Disk space on elastic1001 is OK: DISK OK [04:33:54] (03PS1) 10Yuvipanda: labs: Add diamond collector for pdns_recursor stats [puppet] - 10https://gerrit.wikimedia.org/r/285332 [04:34:25] (03PS2) 10Yuvipanda: labs: Add diamond collector for pdns_recursor stats [puppet] - 10https://gerrit.wikimedia.org/r/285332 [04:36:59] 06Operations: reinstall bast4001 with jessie - https://phabricator.wikimedia.org/T123674#2238066 (10Dzahn) [04:37:33] (03PS3) 10Yuvipanda: labs: Add diamond collector for pdns_recursor stats [puppet] - 10https://gerrit.wikimedia.org/r/285332 [04:39:52] (03PS4) 10Yuvipanda: labs: Add diamond collector for pdns_recursor stats [puppet] - 10https://gerrit.wikimedia.org/r/285332 [04:43:45] (03CR) 10Yuvipanda: [C: 032 V: 032] labs: Add diamond collector for pdns_recursor stats [puppet] - 10https://gerrit.wikimedia.org/r/285332 (owner: 10Yuvipanda) [04:45:03] (03PS1) 10Dzahn: phragile: move role class to autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/285333 [04:47:02] (03PS2) 10Dzahn: phragile: move role class to autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/285333 [04:47:39] (03PS3) 10Dzahn: phragile: move role class to autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/285333 [04:50:30] 06Operations, 07Diamond, 07Upstream: Upstream our Diamond PowerDNSRecursorCollector - https://phabricator.wikimedia.org/T133643#2238070 (10yuvipanda) [04:57:39] (03PS1) 10Dzahn: testsystem: move role class to test::system [puppet] - 10https://gerrit.wikimedia.org/r/285334 [05:32:04] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 05codfw-rollout: WDQS stopped updating during datacenter switch - https://phabricator.wikimedia.org/T133046#2238129 (10Smalyshev) 05Open>03Resolved a:03Smalyshev I think we won't do anything more with this one? [06:26:58] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:30:29] PROBLEM - puppet last run on mw1199 is CRITICAL: CRITICAL: puppet fail [06:30:49] PROBLEM - puppet last run on nobelium is CRITICAL: CRITICAL: puppet fail [06:30:49] PROBLEM - puppet last run on neodymium is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:59] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:32:19] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:29] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:49] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:32:59] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:33:19] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:04] PROBLEM - puppet last run on mw2146 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:44] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [06:35:52] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [06:35:52] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [06:37:14] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:40:32] PROBLEM - puppet last run on mw1133 is CRITICAL: CRITICAL: Puppet has 26 failures [06:42:03] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:42:03] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:42:12] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:50:33] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [06:51:53] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [06:52:02] !log restarting db2047 for reimaging [06:52:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:52:32] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [06:52:33] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [06:53:53] PROBLEM - puppet last run on analytics1056 is CRITICAL: CRITICAL: Puppet has 2 failures [06:56:03] RECOVERY - puppet last run on nobelium is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:56:12] RECOVERY - puppet last run on neodymium is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:57:03] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:57:43] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:43] RECOVERY - puppet last run on mw1199 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:58:04] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:22] RECOVERY - puppet last run on mw2146 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:58:23] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:00:42] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:02:22] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:02:30] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:02:31] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:04:11] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [07:05:10] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps wave]BR [07:07:12] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [07:07:31] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [07:08:53] (03CR) 10Jcrespo: "This commit caused an outage." [puppet] - 10https://gerrit.wikimedia.org/r/285208 (owner: 10Faidon Liambotis) [07:09:11] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 [07:10:31] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:11:51] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [07:12:40] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [07:13:41] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:15:01] (03PS1) 10Giuseppe Lavagetto: mediawiki: remove zend/hhvm conditional from search.w.o [puppet] - 10https://gerrit.wikimedia.org/r/285339 [07:15:03] (03PS1) 10Giuseppe Lavagetto: mediawiki: remove zend/hhvm/conditional from www.mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/285340 [07:18:59] (03PS1) 10Muehlenhoff: Remove access credentials for Moiz [puppet] - 10https://gerrit.wikimedia.org/r/285341 [07:19:00] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:20:11] RECOVERY - puppet last run on analytics1056 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:20:20] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:20:21] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:21:10] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [07:21:18] (03CR) 10Muehlenhoff: [C: 04-1] "Only merge after the 29th" [puppet] - 10https://gerrit.wikimedia.org/r/285341 (owner: 10Muehlenhoff) [07:22:01] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [07:22:21] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [07:28:09] (03CR) 10Muehlenhoff: "Yep, that rsyncd was added in https://gerrit.wikimedia.org/r/#/c/283663/ which you reviewed six days ago :-)" [puppet] - 10https://gerrit.wikimedia.org/r/285160 (owner: 10Muehlenhoff) [07:28:40] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:29:51] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:30:50] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [07:31:02] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [07:32:00] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [07:38:18] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:38:28] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:39:08] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:40:38] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:51:57] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [07:55:27] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [07:55:36] gehel: bonjour! [07:55:54] about hhvm on terbium looks like my task https://phabricator.wikimedia.org/T133522 is a dupe of the one asking to upgrade hhvm terbium on https://phabricator.wikimedia.org/T132751 [07:56:28] so I am reopening the old one and marking mine as a dupe [07:57:32] 06Operations, 07HHVM: upgrade HHVM to 3.12.1 on terbium - https://phabricator.wikimedia.org/T132751#2238336 (10hashar) 05Resolved>03Open a:05Joe>03Gehel From T133522, HHVM hasn't been restarted and is stuck to some old version. Seems @Gehel is going to take care of it >>! In T133522#2234608, @Gehel wr... [07:57:45] 06Operations, 07HHVM: upgrade HHVM to 3.12.1 on terbium - https://phabricator.wikimedia.org/T132751#2238342 (10hashar) [07:58:14] 06Operations, 10CirrusSearch, 06Discovery, 06Discovery-Search-Backlog, and 2 others: upgrade HHVM to 3.12.1 on terbium (Elastica: missing curl_init_pooled method.) - https://phabricator.wikimedia.org/T132751#2238345 (10hashar) [07:58:27] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:58:28] done! [08:00:37] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [08:02:07] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [08:02:07] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [08:05:14] 06Operations, 10Traffic: Fix apache-2.4 + DHE ciphersuites issue - https://phabricator.wikimedia.org/T133217#2238375 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [08:06:49] !log CI jobs deadlocked due to castor being unavailable | https://phabricator.wikimedia.org/T133652 [08:06:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:08:27] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:08:37] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:08:37] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:11:18] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:16:48] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [08:17:07] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [08:17:07] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [08:17:27] (03CR) 10jenkins-bot: [V: 04-1] phragile: move role class to autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/285333 (owner: 10Dzahn) [08:18:43] (03CR) 10jenkins-bot: [V: 04-1] testsystem: move role class to test::system [puppet] - 10https://gerrit.wikimedia.org/r/285334 (owner: 10Dzahn) [08:19:32] (03PS2) 10Muehlenhoff: Add ferm service for debug proxy [puppet] - 10https://gerrit.wikimedia.org/r/283606 [08:19:39] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [08:26:08] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:27:22] !log stopping db2068 for cloning to db2047 [08:27:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:27:37] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:27:39] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:27:39] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:28:17] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [08:31:57] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [08:32:08] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [08:38:18] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:38:28] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:39:08] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:43:16] (03PS1) 10Jcrespo: Repool db1052 (old s1-master) with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285344 (https://phabricator.wikimedia.org/T125028) [08:43:18] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [08:43:39] (03CR) 10jenkins-bot: [V: 04-1] Repool db1052 (old s1-master) with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285344 (https://phabricator.wikimedia.org/T125028) (owner: 10Jcrespo) [08:44:38] !log restarting elasticsearch server elastic2007.codfw.wmnet - activating unicast (T110236) [08:44:39] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [08:44:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:45:21] !log Most of CI is down / deadlocked due to wmflabs being unresponsive T133654 [08:45:22] T133654: wmflabs OpenStack is deadlocked (can't boot or delete instances) - https://phabricator.wikimedia.org/T133654 [08:45:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:48:01] (03CR) 10Jcrespo: [C: 032 V: 032] Repool db1052 (old s1-master) with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285344 (https://phabricator.wikimedia.org/T125028) (owner: 10Jcrespo) [08:48:05] (03PS1) 10Gehel: Depooled wdqs1001 during reinstall [puppet] - 10https://gerrit.wikimedia.org/r/285345 (https://phabricator.wikimedia.org/T133566) [08:48:23] (03CR) 10jenkins-bot: [V: 04-1] Repool db1052 (old s1-master) with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285344 (https://phabricator.wikimedia.org/T125028) (owner: 10Jcrespo) [08:48:28] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] mediawiki: remove zend/hhvm conditional from search.w.o [puppet] - 10https://gerrit.wikimedia.org/r/285339 (owner: 10Giuseppe Lavagetto) [08:49:57] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:50:21] !log restarted nova-conductor & scheduler on labcontrol1001 for T133654 [08:50:22] T133654: wmflabs OpenStack is deadlocked (can't boot or delete instances) - https://phabricator.wikimedia.org/T133654 [08:50:27] hashar: ^ try now? [08:50:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:51:08] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [08:51:12] 29717 nova 20 0 2931476 2.605g 6284 R 99.5 5.5 0:47.17 nova-scheduler [08:51:17] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [08:51:17] interesting [08:51:18] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [08:51:31] (03PS2) 10Gehel: Depooled wdqs1001 during reinstall [puppet] - 10https://gerrit.wikimedia.org/r/285345 (https://phabricator.wikimedia.org/T133566) [08:51:48] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [08:51:59] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1052 (old s1-master) with low weight (duration: 00m 31s) [08:52:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:52:24] (03CR) 10Gehel: [C: 032 V: 032] Depooled wdqs1001 during reinstall [puppet] - 10https://gerrit.wikimedia.org/r/285345 (https://phabricator.wikimedia.org/T133566) (owner: 10Gehel) [08:52:29] (03CR) 10Mobrovac: [C: 04-1] Automate the generation deployment keys (keyholder-managed ssh keys) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) (owner: 1020after4) [08:53:01] <_joe_> gehel: can I merge your patch? [08:53:11] _joe_: I was going to ask the same [08:53:17] yes, go ahead, it is trivial [08:53:22] <_joe_> gehel: done [08:53:27] thanks! [08:54:27] nova-scheduler has settled down [08:54:40] hashar: and I think I managed to successfully create a new instance [08:54:41] <_joe_> gehel: actually, it's doing a full git gc [08:54:57] <_joe_> now it's properly done [08:55:01] YuviPanda: yeah it is apparently back thank you ! [08:55:06] _joe_: You see, Java is not the only one... [08:55:15] YuviPanda: I think you can close https://phabricator.wikimedia.org/T133654 safely now! [08:55:27] hashar: np! thanks for bringing it to my notice! [08:55:52] _joe_: on a different subject, can I just restart HHVM on terbium? I don't really know what's using it... [08:56:23] !log depooling wdqs1001 from varnish for reinstall [08:56:28] heh, labcontrol is being hammered with new instance creation now [08:56:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:59:45] 06Operations, 10Continuous-Integration-Infrastructure, 06Labs, 10Labs-Infrastructure, 10Monitoring: Have a paging check for Nova API accessible - https://phabricator.wikimedia.org/T133656#2238563 (10yuvipanda) [09:00:04] hashar: I filed ^ to make a paging check for this so we notice sooner next time [09:00:54] (03PS2) 10Giuseppe Lavagetto: mediawiki: remove zend/hhvm/conditional from www.mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/285340 [09:02:44] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] mediawiki: remove zend/hhvm/conditional from www.mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/285340 (owner: 10Giuseppe Lavagetto) [09:02:58] * YuviPanda is going afk again [09:03:40] !log restarting db1069 s1 instance- one query is "stuck", creating lag on labs [09:03:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:04:16] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:05:05] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:06:16] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [09:06:45] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:06:56] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [09:08:36] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [09:09:32] !log citoid deployed 36c2bf02 [09:09:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:09:45] PROBLEM - puppet last run on mw1133 is CRITICAL: CRITICAL: puppet fail [09:11:53] !log CI is back up! [09:11:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:12:43] (03PS1) 10Gehel: Modify partitions to reflect new disk added in WDQS nodes [puppet] - 10https://gerrit.wikimedia.org/r/285353 (https://phabricator.wikimedia.org/T133566) [09:12:48] <_joe_> !log restarted hhvm on mw1133, almost OOM [09:12:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:15:56] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:17:36] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:18:56] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:19:26] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:20:15] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [09:24:28] 06Operations, 10Continuous-Integration-Infrastructure, 06Labs, 10Labs-Infrastructure, 10Monitoring: Have a paging check for Nova API accessible - https://phabricator.wikimedia.org/T133656#2238670 (10hashar) [09:26:36] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:30:53] (03PS2) 10Gehel: Modify partitions to reflect new disk added in WDQS nodes [puppet] - 10https://gerrit.wikimedia.org/r/285353 (https://phabricator.wikimedia.org/T133566) [09:31:05] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [09:31:06] RECOVERY - puppet last run on mw1133 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [09:31:55] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [09:32:17] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [09:32:45] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [09:33:15] (03CR) 10Gehel: [C: 032] Modify partitions to reflect new disk added in WDQS nodes [puppet] - 10https://gerrit.wikimedia.org/r/285353 (https://phabricator.wikimedia.org/T133566) (owner: 10Gehel) [09:40:01] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 13Patch-For-Review: Reinstall and data reload of WDQS servers - https://phabricator.wikimedia.org/T133566#2238709 (10Gehel) >>! In T133566#2236346, @Smalyshev wrote: > Planned sequence: > > # [X] (day before) Send email to the wikidata lis... [09:41:45] !log uploaded apache 2.4.10-10+deb8u4+wmf1 to carbon [09:41:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:41:51] 06Operations, 10Traffic: Fix apache-2.4 + DHE ciphersuites issue - https://phabricator.wikimedia.org/T133217#2238726 (10MoritzMuehlenhoff) apache 2.4.10-10+deb8u4+wmf1 has been built against openssl 1.0.2 and uploaded to carbon. I'll update this bug once all existing jessie systems are upgraded. [09:42:34] (03PS1) 10Muehlenhoff: Update TODO [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/285358 [09:43:37] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:44:06] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:45:46] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [09:46:07] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [09:47:11] !log restarting elasticsearch server elastic2008.codfw.wmnet - activating unicast (T110236) [09:47:12] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [09:47:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:48:12] (03PS1) 10Dereckson: Switch to short syntax array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285361 [09:48:42] (03CR) 10jenkins-bot: [V: 04-1] Switch to short syntax array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285361 (owner: 10Dereckson) [09:50:10] !log starting reinstall of wdqs1001 (T133566) [09:50:11] T133566: Reinstall and data reload of WDQS servers - https://phabricator.wikimedia.org/T133566 [09:50:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:51:39] !log restarting db2068 for upgrade before returning from maintenance [09:51:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:52:19] (03PS2) 10ArielGlenn: pass through verbose option from caller to XmlDump jobs [dumps] - 10https://gerrit.wikimedia.org/r/285244 [09:53:25] PROBLEM - Host wdqs1001 is DOWN: PING CRITICAL - Packet loss = 100% [09:53:53] ^ disregard, forgot to disable alerting... [09:54:01] (03CR) 10ArielGlenn: [C: 032] pass through verbose option from caller to XmlDump jobs [dumps] - 10https://gerrit.wikimedia.org/r/285244 (owner: 10ArielGlenn) [09:55:06] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:55:57] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:56:25] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:56:45] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:57:38] (03PS1) 10Elukey: Modify the default Varnish error page to increase visibility of error messages. [puppet] - 10https://gerrit.wikimedia.org/r/285363 (https://phabricator.wikimedia.org/T76348) [09:57:40] (03PS1) 10Elukey: Add a maintenance flag to cache::misc directors. [puppet] - 10https://gerrit.wikimedia.org/r/285364 (https://phabricator.wikimedia.org/T76348) [09:58:05] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [09:58:29] (03PS1) 10ArielGlenn: get_stub_files expects an int arg for file part number [dumps] - 10https://gerrit.wikimedia.org/r/285365 [09:59:18] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [09:59:24] (03CR) 10jenkins-bot: [V: 04-1] Add a maintenance flag to cache::misc directors. [puppet] - 10https://gerrit.wikimedia.org/r/285364 (https://phabricator.wikimedia.org/T76348) (owner: 10Elukey) [10:01:07] (03PS2) 10ArielGlenn: get_stub_files expects an int arg for file part number [dumps] - 10https://gerrit.wikimedia.org/r/285365 [10:01:12] (03PS2) 10Elukey: Add a maintenance flag to cache::misc directors. [puppet] - 10https://gerrit.wikimedia.org/r/285364 (https://phabricator.wikimedia.org/T76348) [10:04:40] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:06:32] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:09:50] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [10:11:00] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [10:12:20] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [10:13:00] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [10:17:31] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:17:51] RECOVERY - Host wdqs1001 is UP: PING OK - Packet loss = 0%, RTA = 0.57 ms [10:18:31] PROBLEM - Apache HTTP on mw1143 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:18:40] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:18:50] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:19:31] PROBLEM - HHVM rendering on mw1143 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:19:40] PROBLEM - SSH on mw1143 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:19:40] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:20:10] PROBLEM - RAID on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:20:13] (03PS4) 10Hashar: contint: drop integration/phpcs [puppet] - 10https://gerrit.wikimedia.org/r/285226 [10:20:21] PROBLEM - DPKG on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:20:40] someone working on 1143 ? [10:20:41] PROBLEM - nutcracker port on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:20:42] (03CR) 10Hashar: "Need to keep integration/phpunit which is needed by operations/mediawiki-config :(" [puppet] - 10https://gerrit.wikimedia.org/r/285226 (owner: 10Hashar) [10:20:50] PROBLEM - configured eth on mw1143 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [10:21:01] PROBLEM - puppet last run on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:21:20] PROBLEM - salt-minion processes on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:21:21] PROBLEM - HHVM processes on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:21:41] PROBLEM - Check size of conntrack table on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:21:50] PROBLEM - nutcracker process on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:22:30] RECOVERY - DPKG on mw1143 is OK: All packages OK [10:22:41] RECOVERY - nutcracker port on mw1143 is OK: TCP OK - 0.000 second response time on port 11212 [10:25:10] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [10:25:12] PROBLEM - Disk space on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:27:34] strange. network issue? mgmt too slow [10:27:36] (03PS4) 10Hashar: contint: decouple slave_scripts and composer [puppet] - 10https://gerrit.wikimedia.org/r/285235 (https://phabricator.wikimedia.org/T128092) [10:28:05] I can't access the serial console either, getting "Disconnected gtom UNKNOWN port 0" after a while [10:28:10] (03CR) 10Hashar: [C: 031] "I have moved PHPUnit to its own class as well." [puppet] - 10https://gerrit.wikimedia.org/r/285235 (https://phabricator.wikimedia.org/T128092) (owner: 10Hashar) [10:28:11] I can [10:28:19] but it took a lot of time [10:28:28] for me too [10:28:31] both everthings seems normal [10:29:11] PROBLEM - dhclient process on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:29:11] PROBLEM - DPKG on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:29:24] (03PS3) 10Muehlenhoff: Add ferm service for debug proxy [puppet] - 10https://gerrit.wikimedia.org/r/283606 [10:29:31] PROBLEM - nutcracker port on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:29:41] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [10:29:50] I was wrong, login timeouts [10:29:51] jynus: could be the same weird issue that we saw recently for the jobrunners getting stuck somehow [10:30:09] well, not necessarily weird, an OOM [10:30:16] <_joe_> it's just an oom [10:30:32] :-) [10:30:32] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [10:30:41] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [10:30:46] all right :) [10:30:49] but cannot longer ssh or com2 [10:31:00] me neither [10:31:03] _joe_, wanna do something or soft reboot? [10:31:14] <_joe_> reboot [10:31:18] doing [10:31:21] <_joe_> thanks [10:31:52] !log restarting mw1143 as is no longer accessible [10:31:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:32:00] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:32:06] acking alerts too [10:32:19] (03PS1) 10Giuseppe Lavagetto: mediawiki::web: drop the HHVM/Zend conditionals from beta sites [puppet] - 10https://gerrit.wikimedia.org/r/285366 (https://phabricator.wikimedia.org/T126310) [10:32:21] (03PS1) 10Giuseppe Lavagetto: mediawiki::web: drop HHVM define/Zend conditionals in all vhosts [puppet] - 10https://gerrit.wikimedia.org/r/285367 (https://phabricator.wikimedia.org/T126310) [10:32:23] (03PS1) 10Giuseppe Lavagetto: mediawiki::web: drop HHVM define, explicitly block php [puppet] - 10https://gerrit.wikimedia.org/r/285368 (https://phabricator.wikimedia.org/T126310) [10:32:24] <_joe_> they'll recovery anyways [10:32:42] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add ferm service for debug proxy [puppet] - 10https://gerrit.wikimedia.org/r/283606 (owner: 10Muehlenhoff) [10:32:56] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [10:34:08] the host is not actually ackable, as ping still worked [10:35:27] RECOVERY - HHVM processes on mw1143 is OK: PROCS OK: 2 processes with command name hhvm [10:35:35] RECOVERY - nutcracker port on mw1143 is OK: TCP OK - 0.000 second response time on port 11212 [10:35:36] RECOVERY - DPKG on mw1143 is OK: All packages OK [10:35:46] RECOVERY - nutcracker process on mw1143 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [10:36:15] RECOVERY - HHVM rendering on mw1143 is OK: HTTP OK: HTTP/1.1 200 OK - 68219 bytes in 6.469 second response time [10:36:16] RECOVERY - Disk space on mw1143 is OK: DISK OK [10:36:33] running sync-common [10:36:37] RECOVERY - configured eth on mw1143 is OK: OK - interfaces up [10:36:37] RECOVERY - puppet last run on mw1143 is OK: OK: Puppet is currently enabled, last run 50 minutes ago with 0 failures [10:36:45] RECOVERY - RAID on mw1143 is OK: OK: no RAID installed [10:36:56] RECOVERY - Apache HTTP on mw1143 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 1.088 second response time [10:37:04] (I do not think anything deployed while issues, but anyway [10:37:05] RECOVERY - SSH on mw1143 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [10:37:16] RECOVERY - Check size of conntrack table on mw1143 is OK: OK: nf_conntrack is 7 % full [10:37:16] RECOVERY - dhclient process on mw1143 is OK: PROCS OK: 0 processes with command name dhclient [10:37:25] RECOVERY - salt-minion processes on mw1143 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [10:38:09] (03CR) 10Muehlenhoff: [C: 032 V: 032] Update TODO [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/285358 (owner: 10Muehlenhoff) [10:38:29] (03PS3) 10Muehlenhoff: Enable base::firewall for hassaleh/hassium [puppet] - 10https://gerrit.wikimedia.org/r/283607 [10:42:15] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:42:46] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:43:07] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:43:54] (03CR) 10Dereckson: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285361 (owner: 10Dereckson) [10:44:16] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [10:44:56] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [10:45:16] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [10:46:11] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable base::firewall for hassaleh/hassium [puppet] - 10https://gerrit.wikimedia.org/r/283607 (owner: 10Muehlenhoff) [10:51:26] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:51:55] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:51:55] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:51:57] (03PS1) 10Muehlenhoff: Remove duplicate definition of service [puppet] - 10https://gerrit.wikimedia.org/r/285369 [10:52:39] (03CR) 10Muehlenhoff: [C: 032 V: 032] Remove duplicate definition of service [puppet] - 10https://gerrit.wikimedia.org/r/285369 (owner: 10Muehlenhoff) [10:53:05] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:53:15] PROBLEM - puppet last run on hassium is CRITICAL: CRITICAL: puppet fail [10:55:26] RECOVERY - puppet last run on hassium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:59:35] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [11:02:22] 06Operations, 06Research-and-Data-Backlog, 10Research-management, 06Revision-Scoring-As-A-Service, and 3 others: [Epic] Deploy Revscoring/ORES service in Prod - https://phabricator.wikimedia.org/T106867#2238896 (10mark) [11:02:35] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 13Patch-For-Review: Reinstall and data reload of WDQS servers - https://phabricator.wikimedia.org/T133566#2238897 (10Gehel) While rebuilding the RAID to add new disks, I realized wdqs1001 has 2x 300GB + 2x 150GB disks. I'm reinstalling anyw... [11:03:24] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [11:04:33] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [11:05:03] elukey, server board for memory usage: https://grafana.wikimedia.org/dashboard/db/server-board?panelId=14&fullscreen&from=1461662458186&to=1461668404858&var-server=mw1143&var-network=eth0 [11:06:25] jynus: thanks! [11:07:54] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [11:08:20] jynus: very nice ramp up https://grafana.wikimedia.org/dashboard/db/server-board?panelId=14&fullscreen&from=1461662458186&to=1461668404858&var-server=mw1143&var-network=eth0 [11:09:42] I can see the same behavior on mw1148, but not really the same on all the job runners [11:10:13] (brb lunch) [11:13:08] 06Operations, 10Monitoring, 05codfw-rollout, 03codfw-rollout-Jan-Mar-2016: Add RAID monitoring for HP servers - https://phabricator.wikimedia.org/T97998#2238921 (10jcrespo) [11:13:10] 06Operations, 10Monitoring: Refactor RAID checks (check-raid) - https://phabricator.wikimedia.org/T84050#2238920 (10jcrespo) [11:15:33] (03PS1) 10Dereckson: Reformatted comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285373 [11:16:54] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:19:33] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:19:43] 06Operations, 10DBA, 13Patch-For-Review: Reimage db2047 - check for hardware errors - https://phabricator.wikimedia.org/T132011#2238950 (10jcrespo) a:05jcrespo>03None After reimage, I do not see anything significantly bad: * RAID is OK ``` cciss_vol_status --verbose /dev/sg0 Controller: Smart Array P420... [11:19:55] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:21:43] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [11:23:14] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [11:23:57] (03PS2) 10Dereckson: Reformatted comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285373 [11:28:10] (03PS1) 10Dereckson: Enable action=credits on test and beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285374 (https://phabricator.wikimedia.org/T130820) [11:28:33] (03CR) 10jenkins-bot: [V: 04-1] Enable action=credits on test and beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285374 (https://phabricator.wikimedia.org/T130820) (owner: 10Dereckson) [11:30:03] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:30:33] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:30:53] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [11:35:30] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [11:35:40] (03PS2) 10Dereckson: Enable action=credits on test and beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285374 (https://phabricator.wikimedia.org/T130820) [11:36:42] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:37:00] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [11:37:07] (03PS1) 10Muehlenhoff: Move ferm rules into the role [puppet] - 10https://gerrit.wikimedia.org/r/285375 [11:38:01] PROBLEM - puppet last run on mw1148 is CRITICAL: CRITICAL: Puppet has 2 failures [11:38:11] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:42:00] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:43:39] !log restarting elasticsearch server elastic2009.codfw.wmnet - activating unicast (T110236) [11:43:40] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [11:43:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:45:41] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:45:45] (03PS1) 10Dereckson: Clean InitialiseSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285377 [11:46:11] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [11:46:51] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [11:47:21] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [11:47:42] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [11:51:50] PROBLEM - puppet last run on mw2101 is CRITICAL: CRITICAL: puppet fail [11:52:01] PROBLEM - puppet last run on cp3030 is CRITICAL: CRITICAL: Puppet has 1 failures [11:52:31] PROBLEM - puppet last run on mc1012 is CRITICAL: CRITICAL: puppet fail [11:52:31] PROBLEM - puppet last run on mw1078 is CRITICAL: CRITICAL: puppet fail [11:58:21] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:58:41] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:59:21] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:59:51] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:02:43] PROBLEM - salt-minion processes on mw1145 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:03:21] PROBLEM - Apache HTTP on mw1145 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:03:32] PROBLEM - HHVM processes on mw1145 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:04:31] PROBLEM - HHVM rendering on mw1145 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:04:32] PROBLEM - RAID on mw1145 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:04:42] mw1145's memory usage looks like mw1143 - https://grafana.wikimedia.org/dashboard/db/server-board?panelId=14&fullscreen&from=1461662458186&to=1461668404858&var-server=mw1143&var-network=eth0 [12:04:52] PROBLEM - SSH on mw1145 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:05:01] RECOVERY - puppet last run on mw1148 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:05:12] PROBLEM - Check size of conntrack table on mw1145 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:05:32] PROBLEM - DPKG on mw1145 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:05:51] PROBLEM - Disk space on mw1145 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:06:11] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [12:06:42] PROBLEM - configured eth on mw1145 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:07:01] PROBLEM - dhclient process on mw1145 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:07:21] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [12:07:33] PROBLEM - nutcracker port on mw1145 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:07:49] checking mgmt on mw1145 [12:07:52] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [12:07:52] PROBLEM - nutcracker process on mw1145 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:08:02] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [12:08:11] PROBLEM - puppet last run on mw1145 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:09:35] !log mw1145 powercycled (root login timeout on the console) [12:09:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:10:01] <_joe_> elukey: thanks, I am opening a ticket [12:10:46] hey there. I am going to cut the wmf branch over the afternoon. No idea when it will start branching though. [12:11:32] RECOVERY - Check size of conntrack table on mw1145 is OK: OK: nf_conntrack is 0 % full [12:11:32] RECOVERY - HHVM processes on mw1145 is OK: PROCS OK: 2 processes with command name hhvm [12:11:42] RECOVERY - DPKG on mw1145 is OK: All packages OK [12:11:51] RECOVERY - nutcracker port on mw1145 is OK: TCP OK - 0.000 second response time on port 11212 [12:12:02] RECOVERY - nutcracker process on mw1145 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [12:12:02] RECOVERY - Disk space on mw1145 is OK: DISK OK [12:12:21] RECOVERY - puppet last run on mw1145 is OK: OK: Puppet is currently enabled, last run 48 minutes ago with 0 failures [12:12:23] RECOVERY - salt-minion processes on mw1145 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:12:52] RECOVERY - RAID on mw1145 is OK: OK: no RAID installed [12:13:01] RECOVERY - configured eth on mw1145 is OK: OK - interfaces up [12:13:12] RECOVERY - SSH on mw1145 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [12:13:12] RECOVERY - dhclient process on mw1145 is OK: PROCS OK: 0 processes with command name dhclient [12:13:21] RECOVERY - Apache HTTP on mw1145 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.109 second response time [12:13:51] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:14:32] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:14:51] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:14:52] RECOVERY - HHVM rendering on mw1145 is OK: HTTP OK: HTTP/1.1 200 OK - 68206 bytes in 0.313 second response time [12:17:16] (03PS1) 10Dereckson: Clean InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285381 [12:18:08] 06Operations, 10MediaWiki-API, 07Availability, 07HHVM: HHVM is leaking memory on the API appservers - https://phabricator.wikimedia.org/T133674#2239113 (10Joe) [12:18:12] RECOVERY - puppet last run on mc1012 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [12:18:22] RECOVERY - puppet last run on mw2101 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [12:18:51] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [12:18:52] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [12:19:22] RECOVERY - puppet last run on cp3030 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:20:11] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [12:20:22] RECOVERY - puppet last run on mw1078 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:20:51] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:23:03] Checking partitions on my wdqs1001 reinstall I see that my partman recipe has not been taken into account (swap partition created, where I declare no swap and /var/lib/wdqs not created). [12:24:04] Could anyone have a look at https://github.com/wikimedia/operations-puppet/blob/production/modules/install_server/files/autoinstall/partman/lvm-wdqs.cfg and tell me if you see something obviously wrong? [12:25:12] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:25:31] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:26:42] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:27:29] !log restarting elasticsearch server elastic2010.codfw.wmnet - activating unicast (T110236) [12:27:30] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [12:27:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:29:41] (03CR) 10Addshore: [C: 031] Enable action=credits on test and beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285374 (https://phabricator.wikimedia.org/T130820) (owner: 10Dereckson) [12:31:02] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [12:31:41] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [12:32:01] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [12:32:26] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [12:32:44] what's with citoid? [12:36:56] (03CR) 10Faidon Liambotis: "Why/how? Can you fix properly? :)" [puppet] - 10https://gerrit.wikimedia.org/r/285208 (owner: 10Faidon Liambotis) [12:37:35] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:38:31] akosiaris/mobrovac: ^^^ [12:38:36] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:38:46] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:39:23] paravoid: veni, vidi, will file a phab task as that's because of an external site having trouble [12:39:57] ok [12:40:06] PROBLEM - puppet last run on labsdb1003 is CRITICAL: CRITICAL: puppet fail [12:43:15] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [12:45:45] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:47:50] 06Operations, 10media-storage: authoritative copy of 'root' files for upload.wikimedia.org is only in swift - https://phabricator.wikimedia.org/T130709#2239205 (10fgiunchedi) definitely, I've removed #traffic since it got inherited from the parent and not relevant [12:48:26] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [12:49:56] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:58:55] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [13:00:34] 06Operations, 10Traffic: Fix apache-2.4 + DHE ciphersuites issue - https://phabricator.wikimedia.org/T133217#2239219 (10MoritzMuehlenhoff) Apache on all jessie systems has been upgraded and restarted. [13:00:49] 06Operations, 10Traffic: Fix apache-2.4 + DHE ciphersuites issue - https://phabricator.wikimedia.org/T133217#2239220 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff>03None [13:02:30] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:04:38] (03PS1) 10Gehel: WDQS - Smaller /var/lib/wdqs partition [puppet] - 10https://gerrit.wikimedia.org/r/285387 (https://phabricator.wikimedia.org/T133566) [13:05:32] (03PS2) 10ArielGlenn: mark closed wikis as such on the main index.html page for downloads [dumps] - 10https://gerrit.wikimedia.org/r/285187 (https://phabricator.wikimedia.org/T133112) [13:06:29] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [13:06:42] (03CR) 10ArielGlenn: [C: 032] get_stub_files expects an int arg for file part number [dumps] - 10https://gerrit.wikimedia.org/r/285365 (owner: 10ArielGlenn) [13:07:10] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [13:07:18] (03CR) 10ArielGlenn: [C: 032] mark closed wikis as such on the main index.html page for downloads [dumps] - 10https://gerrit.wikimedia.org/r/285187 (https://phabricator.wikimedia.org/T133112) (owner: 10ArielGlenn) [13:07:20] RECOVERY - puppet last run on labsdb1003 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [13:07:40] (03CR) 10Gehel: [C: 032] WDQS - Smaller /var/lib/wdqs partition [puppet] - 10https://gerrit.wikimedia.org/r/285387 (https://phabricator.wikimedia.org/T133566) (owner: 10Gehel) [13:11:06] !log restarting elasticsearch server elastic2011.codfw.wmnet - activating unicast (T110236) [13:11:07] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [13:11:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:13:00] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:13:49] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:14:11] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:18:09] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [13:18:21] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [13:18:59] (03PS1) 10Alexandros Kosiaris: servermon: Add managed_puppet_modules parameter [puppet] - 10https://gerrit.wikimedia.org/r/285390 [13:21:40] (03PS1) 10Aude: Enable arbitrary access on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285391 (https://phabricator.wikimedia.org/T98307) [13:23:07] 06Operations: upgrade netmon1001 to jessie - https://phabricator.wikimedia.org/T125020#2239329 (10faidon) a:03akosiaris @akosiaris said he would do this at our last monitoring meeting :) [13:23:40] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [13:23:49] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [13:24:47] (03CR) 10Hoo man: [C: 04-1] "" Set allowDataAccessInUserLanguage for commons (needs to be announced first, probably)"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285391 (https://phabricator.wikimedia.org/T98307) (owner: 10Aude) [13:24:53] (03CR) 10Aude: "in a follow up patch, think we can remove the arbitraryaccess.dblist" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285391 (https://phabricator.wikimedia.org/T98307) (owner: 10Aude) [13:25:10] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:25:32] 06Operations: Upgrade fluorine to trusty/jessie - https://phabricator.wikimedia.org/T123728#1936565 (10faidon) >>! In T123728#2186572, @fgiunchedi wrote: > should be fairly straightforward, though we'd need big disks (I'd suggest raid1 2x4TB or raid10 4x4TB as it is mostly cold data anyway) To do what? We shoul... [13:26:56] (03PS2) 10Aude: Enable arbitrary access on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285391 (https://phabricator.wikimedia.org/T98307) [13:26:59] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:27:29] 06Operations, 10Analytics-Cluster: Migrate titanium to jessie (archiva.wikimedia.org upgrade) - https://phabricator.wikimedia.org/T123725#2239350 (10faidon) I don't think archiva is a runtime dependency on anything — but fyi, @Ottomata. [13:28:41] ACKNOWLEDGEMENT - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. alexandros kosiaris eutils.ncbi.nlm.nih.gov is misbehaving again. Scheduled downtime of 5 hours and a force acknowledgment - The acknowledgement expires at: 2016-04-27 18:27:37. [13:28:41] ACKNOWLEDGEMENT - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. alexandros kosiaris eutils.ncbi.nlm.nih.gov is misbehaving again. Scheduled downtime of 5 hours and a force acknowledgment - The acknowledgement expires at: 2016-04-27 18:27:37. [13:28:41] ACKNOWLEDGEMENT - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. alexandros kosiaris eutils.ncbi.nlm.nih.gov is misbehaving again. Scheduled downtime of 5 hours and a force acknowledgment - The acknowledgement expires at: 2016-04-27 18:27:37. [13:28:41] ACKNOWLEDGEMENT - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. alexandros kosiaris eutils.ncbi.nlm.nih.gov is misbehaving again. Scheduled downtime of 5 hours and a force acknowledgment - The acknowledgement expires at: 2016-04-27 18:27:37. [13:29:20] how difficult is it to not get unresponsive when one of the external sites is down? :) [13:30:05] 06Operations, 10Analytics-Cluster: Migrate titanium to jessie (archiva.wikimedia.org upgrade) - https://phabricator.wikimedia.org/T123725#2239356 (10MoritzMuehlenhoff) I'm on this. There's already a replacement VM (meitnerium) with archiva installed, the next is the migration of /var/lib/archiva from the titan... [13:30:17] I have no idea. I would argue it's a mistake to use for tests external sites [13:31:51] (03PS1) 10Elukey: Add conf200[123] Zookeeper Service nodes in codfw. [puppet] - 10https://gerrit.wikimedia.org/r/285393 (https://phabricator.wikimedia.org/T131959) [13:33:05] !log Checking out MediaWiki 1.27.0-wmf.22 on tin [13:33:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:33:31] !log Checking out MediaWiki 1.27.0-wmf.22 on tin | T131556 [13:33:31] (03PS3) 10Aude: Enable arbitrary access on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285391 (https://phabricator.wikimedia.org/T98307) [13:33:32] T131556: MW-1.27.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T131556 [13:33:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:36:35] 06Operations, 10MediaWiki-API, 07Availability, 07HHVM: HHVM is leaking memory on the API appservers - https://phabricator.wikimedia.org/T133674#2239394 (10Joe) p:05Triage>03High [13:37:07] paravoid: akosiaris: the NIH db is extensively used for medical refs unfortunately [13:37:21] (this is not about our test checks per se) [13:38:00] mobrovac: not sure I understand what the extensive use of the NIH db for medical refs has to do with our tests [13:38:07] (03PS2) 10Elukey: Add conf200[123] Zookeeper Service nodes in codfw. [puppet] - 10https://gerrit.wikimedia.org/r/285393 (https://phabricator.wikimedia.org/T131959) [13:38:12] if you mean, we can not ditch it [13:38:37] then fine, but let's not return an error if it is not working reliably 2 days in a month [13:38:39] akosiaris: a lot of reqs use it, so when our checks come along, they cannot establish a conn because the slots are taken [13:39:14] reqs ? reqs inbound to citoid from users you mean ? [13:39:58] (03PS4) 10BBlack: ssl_ciphersuite refactoring, jessie+apache DHE support [puppet] - 10https://gerrit.wikimedia.org/r/284518 (https://phabricator.wikimedia.org/T133217) [13:41:11] (03CR) 10jenkins-bot: [V: 04-1] ssl_ciphersuite refactoring, jessie+apache DHE support [puppet] - 10https://gerrit.wikimedia.org/r/284518 (https://phabricator.wikimedia.org/T133217) (owner: 10BBlack) [13:43:16] (03CR) 10Elukey: [C: 04-1] "Open issue:" [puppet] - 10https://gerrit.wikimedia.org/r/285393 (https://phabricator.wikimedia.org/T131959) (owner: 10Elukey) [13:44:13] 06Operations, 10MediaWiki-API, 07Availability, 07HHVM: HHVM is leaking memory on the API appservers - https://phabricator.wikimedia.org/T133674#2239414 (10Joe) The admin module of HHVM has quite a few diagnostics on the status of the memory usage in hhvm; So what I found on one of the affected machines, w... [13:44:34] (03PS5) 10BBlack: ssl_ciphersuite refactoring, jessie+apache DHE support [puppet] - 10https://gerrit.wikimedia.org/r/284518 (https://phabricator.wikimedia.org/T133217) [13:44:59] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [13:45:30] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [13:48:39] 06Operations, 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Provide Jessie package to fullfil Mediawiki::Packages requirement - https://phabricator.wikimedia.org/T95002#2239441 (10fgiunchedi) promising indeed! we don't need php5-fss I think if we no longer need to run zend php, if we do need i... [13:51:09] 06Operations, 10ops-codfw, 13Patch-For-Review: rack/setup/deploy conf200[123] - https://phabricator.wikimedia.org/T131959#2239458 (10elukey) Submitted a code review with my understanding of the change, but I -1 it since I have doubts. The cleanest way to proceed in my opinion would be to move the zookeeper v... [13:52:05] 06Operations, 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Provide Jessie package to fullfil Mediawiki::Packages requirement - https://phabricator.wikimedia.org/T95002#2239459 (10hashar) If php5-fss is solely Zend, yeah I guess we can phase it out from mediawiki::packages::php5 when running o... [13:59:30] (03PS6) 10BBlack: ssl_ciphersuite refactoring, jessie+apache DHE support [puppet] - 10https://gerrit.wikimedia.org/r/284518 (https://phabricator.wikimedia.org/T133217) [14:03:43] (03CR) 10Hoo man: [C: 031] "Assuming this has been appropriately announced." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285391 (https://phabricator.wikimedia.org/T98307) (owner: 10Aude) [14:09:00] (03PS7) 10BBlack: ssl_ciphersuite refactoring, jessie+apache DHE support [puppet] - 10https://gerrit.wikimedia.org/r/284518 (https://phabricator.wikimedia.org/T133217) [14:10:07] (03CR) 10Jcrespo: "T133588 Please if you merge, babysit the changes afterwards, not just apply and forget." [puppet] - 10https://gerrit.wikimedia.org/r/285208 (owner: 10Faidon Liambotis) [14:11:40] PROBLEM - Apache HTTP on mw1137 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.007 second response time [14:11:59] PROBLEM - HHVM rendering on mw1137 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.004 second response time [14:13:00] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 631 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5175624 keys - replication_delay is 631 [14:13:20] <_joe_> this is me (mw1137) [14:13:35] I was about to ask :) [14:13:55] (03CR) 10BBlack: [C: 032] "compiler-verified on a variety of OS+nginx|apache, works as intended" [puppet] - 10https://gerrit.wikimedia.org/r/284518 (https://phabricator.wikimedia.org/T133217) (owner: 10BBlack) [14:15:12] (03CR) 10Aude: "@hoo I think there have been sufficient announcements:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285391 (https://phabricator.wikimedia.org/T98307) (owner: 10Aude) [14:16:01] 06Operations, 10Continuous-Integration-Config, 10Dumps-Generation, 13Patch-For-Review, 07WorkType-Maintenance: operations/dumps repo should pass flake8 - https://phabricator.wikimedia.org/T114249#2239536 (10ArielGlenn) 05Open>03Resolved Actually the hosts are configured and deployed now (except for m... [14:17:11] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5117836 keys - replication_delay is 0 [14:18:52] 06Operations, 10Traffic, 13Patch-For-Review: Fix apache-2.4 + DHE ciphersuites issue - https://phabricator.wikimedia.org/T133217#2239542 (10BBlack) 05Open>03Resolved a:03BBlack thanks @MoritzMuehlenhoff ! [14:19:49] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: puppet fail [14:20:00] RECOVERY - Apache HTTP on mw1137 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.050 second response time [14:20:20] RECOVERY - HHVM rendering on mw1137 is OK: HTTP OK: HTTP/1.1 200 OK - 68209 bytes in 0.515 second response time [14:21:16] (03PS47) 10Ladsgroup: ores: Scap3 deployment configurations [puppet] - 10https://gerrit.wikimedia.org/r/280403 [14:21:40] (03PS1) 10Hashar: Group0 to 1.27.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285398 [14:22:40] 06Operations, 10MediaWiki-API, 07Availability, 07HHVM: HHVM is leaking memory on the API appservers - https://phabricator.wikimedia.org/T133674#2239564 (10Joe) I can confirm that with heap profiling activated HHVM crashes when memory profiling is activated. [14:24:50] PROBLEM - puppet last run on stat1001 is CRITICAL: CRITICAL: puppet fail [14:25:20] RECOVERY - cassandra-a CQL 10.64.48.138:9042 on restbase1015 is OK: TCP OK - 0.000 second response time on port 9042 [14:26:08] !log hashar@tin Started scap: testwiki to php-1.27.0-wmf.22 and rebuild l10n cache [14:26:10] PROBLEM - puppet last run on krypton is CRITICAL: CRITICAL: puppet fail [14:26:12] (03CR) 10Elukey: [C: 04-1] "Puppet compiler seems to fail for analytics1003.eqiad.wmne:" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/284506 (https://phabricator.wikimedia.org/T133198) (owner: 10Ottomata) [14:26:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:27:51] (03CR) 10Luke081515: "Maybe enable it at test2wiki too?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285374 (https://phabricator.wikimedia.org/T130820) (owner: 10Dereckson) [14:28:30] it seems apache/nginx hosts that use ssl_ciphersuite() will fail 1x puppet run, then succeed on the next, thanks to my change there and puppet requiring 2x runs to use an updated parser function correctly :P [14:28:41] so, expect a few puppetfails + autorecoveries [14:29:13] (palladium, stat1001, krypton above all in that category) [14:29:58] (03CR) 10Filippo Giunchedi: "LGTM overall, mostly nits" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/283988 (https://phabricator.wikimedia.org/T132812) (owner: 10BBlack) [14:30:26] ack bblack! Thanks [14:30:30] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:34:10] PROBLEM - puppet last run on ytterbium is CRITICAL: CRITICAL: puppet fail [14:38:19] 06Operations, 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Provide Jessie package to fullfil Mediawiki::Packages requirement - https://phabricator.wikimedia.org/T95002#2239629 (10fgiunchedi) sounds good, thanks @hashar, see also {T131749} which seems to have some overlap with this (general th... [14:38:52] !log restarting elasticsearch server elastic2012.codfw.wmnet - activating unicast (T110236) [14:38:53] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [14:38:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:40:23] (03CR) 10BBlack: [C: 04-1] "If we're going to allow the data to set a custom message, it would be simpler and less-confusing to always require a message, and make it " [puppet] - 10https://gerrit.wikimedia.org/r/285364 (https://phabricator.wikimedia.org/T76348) (owner: 10Elukey) [14:45:39] PROBLEM - HHVM rendering on mw1144 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:45:50] here we go again [14:46:09] PROBLEM - Apache HTTP on mw1144 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:47:33] 06Operations, 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Provide Jessie package to fullfil Mediawiki::Packages requirement - https://phabricator.wikimedia.org/T95002#2239659 (10faidon) php5-fss shouldn't be needed even on Zend nowadays — Zend >= 5.5's native strtr() was made fast enough to... [14:47:41] (03PS2) 10Dzahn: udp2log: Move ferm rules into the role [puppet] - 10https://gerrit.wikimedia.org/r/285375 (owner: 10Muehlenhoff) [14:48:04] (03CR) 10Dzahn: [C: 031] udp2log: Move ferm rules into the role [puppet] - 10https://gerrit.wikimedia.org/r/285375 (owner: 10Muehlenhoff) [14:48:57] (03CR) 10Thcipriani: "Added to puppet SWAT 2016-04-26" [puppet] - 10https://gerrit.wikimedia.org/r/282441 (https://phabricator.wikimedia.org/T110407) (owner: 10Mattflaschen) [14:50:30] RECOVERY - puppet last run on stat1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:51:49] RECOVERY - puppet last run on krypton is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [14:53:56] 06Operations, 10Citoid, 06Services, 10VisualEditor: NIH db misbehaviour causing problems to Citoid - https://phabricator.wikimedia.org/T133696#2239664 (10mobrovac) [14:55:37] !log hashar@tin Finished scap: testwiki to php-1.27.0-wmf.22 and rebuild l10n cache (duration: 29m 28s) [14:55:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:59:30] (03PS1) 10Rush: dnsrecursor update module and template [puppet] - 10https://gerrit.wikimedia.org/r/285402 (https://phabricator.wikimedia.org/T124680) [14:59:47] (03PS2) 10Rush: dnsrecursor update module and template [puppet] - 10https://gerrit.wikimedia.org/r/285402 (https://phabricator.wikimedia.org/T124680) [15:00:04] anomie ostriches thcipriani marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160426T1500). [15:00:17] 06Operations, 06Analytics-Kanban, 10DNS, 10Traffic: Create analytics.wikimedia.org - https://phabricator.wikimedia.org/T132407#2239690 (10Nuria) p:05Triage>03High [15:00:38] Luke081515: for test2, I'm not sure: https://www.mediawiki.org/wiki/Manual:$wgActions states "Note Note: Unsetting core actions will probably cause things to complain loudly." [15:01:09] For the same rationale we want $wgActions credits on test, we currently probably want to keep a test site without credits too [15:01:10] Dereckson: Ok, then test "1" is enough ;) [15:01:54] (03CR) 10Luke081515: [C: 031] "Discussed at IRC, only at test "1" is enough." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285374 (https://phabricator.wikimedia.org/T130820) (owner: 10Dereckson) [15:02:00] RECOVERY - puppet last run on ytterbium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:02:01] 06Operations, 06Analytics-Kanban, 10DNS, 10Traffic: Create analytics.wikimedia.org - https://phabricator.wikimedia.org/T132407#2197243 (10Nuria) @BBlack : can you confirm whether is OK with ops to deploy this domain to 1001? [15:02:55] (03PS3) 10Elukey: Add a maintenance flag to cache::misc directors. [puppet] - 10https://gerrit.wikimedia.org/r/285364 (https://phabricator.wikimedia.org/T76348) [15:03:10] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 10hardware-requests: Additional diskspace of wdqs1001/wdqs1002 - https://phabricator.wikimedia.org/T119579#2239695 (10Gehel) [15:03:13] 06Operations, 10ops-eqiad, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: install two Intel 320 Series SSDSA2CW300G3 2.5" 300GB each in wdqs1001/wdqs1002 - https://phabricator.wikimedia.org/T120712#2239693 (10Gehel) 05Resolved>03Open [15:04:03] (03CR) 10Faidon Liambotis: "I did not forget -- I'm not very familiar with that whole sync thing so it's hard to babysit. This is why you were a reviewer in the first" [puppet] - 10https://gerrit.wikimedia.org/r/285208 (owner: 10Faidon Liambotis) [15:04:46] Dereckson: I can SWAT if you want someone else running this one (/me hadn't page refreshed in a bit, didn't see there were patches until just now) [15:05:14] 06Operations, 10ops-eqiad, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: install two Intel 320 Series SSDSA2CW300G3 2.5" 300GB each in wdqs1001/wdqs1002 - https://phabricator.wikimedia.org/T120712#1859742 (10Gehel) It seems that 300G disks were added to wdqs1002, but only 150G disks to wdqs1001: ``... [15:05:55] If you want. There is a convertion from array to short syntax, it should be no op but I'd suggest we try it on mw1107 to be absolutely sure all is fine. [15:06:05] 07Puppet, 10Beta-Cluster-Infrastructure, 07Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#2239716 (10Ladsgroup) [15:06:11] 07Puppet, 10Beta-Cluster-Infrastructure, 06Revision-Scoring-As-A-Service, 03Scap3: deployment-((sca|aqs)01|ores-web) puppet failures due to scap3 errors - https://phabricator.wikimedia.org/T132267#2239713 (10Ladsgroup) 05Open>03Resolved a:03Ladsgroup [15:07:17] Dereckson: yup. looks fun :) OK, doing. [15:08:02] 07Puppet, 10Beta-Cluster-Infrastructure, 03Scap3: deployment-((sca|aqs)01|ores-web) puppet failures due to scap3 errors - https://phabricator.wikimedia.org/T132267#2239758 (10Ladsgroup) 05Resolved>03Open [15:09:39] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285361 (owner: 10Dereckson) [15:09:41] (03CR) 10Rush: [C: 032] dnsrecursor update module and template [puppet] - 10https://gerrit.wikimedia.org/r/285402 (https://phabricator.wikimedia.org/T124680) (owner: 10Rush) [15:09:54] (03CR) 10Faidon Liambotis: [C: 04-1] "Can we really push this logic outside of modules/ganglia?" [puppet] - 10https://gerrit.wikimedia.org/r/284912 (https://phabricator.wikimedia.org/T115330) (owner: 10Filippo Giunchedi) [15:10:10] (03Merged) 10jenkins-bot: Switch to short syntax array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285361 (owner: 10Dereckson) [15:11:09] hashar: can I reset the wikiversions.json file on tin? [15:11:16] thcipriani: yes please [15:11:24] the sync is complete [15:11:33] (03CR) 10Faidon Liambotis: [C: 031] "This looks okay superficially, but without digging in the details of this whole plan, it's hard to know for sure and be a solid reviewer. " [puppet] - 10https://gerrit.wikimedia.org/r/282160 (owner: 10Muehlenhoff) [15:11:38] I am warming up HHVM hitting test.wikipedia.org :-} [15:12:48] Dereckson: ok, should be live on mw1017 [15:15:02] test.wikipedia.org is very slow to generate pages, homepage then Special:Version, it's the time HHVM puts stuff in cache? [15:15:22] Dereckson: yeah I have deployed it minutes ago and HHVM bytecode caches are cold [15:15:28] (03CR) 10Muehlenhoff: "So it's spread across several Phab tasks and partly outdated, I'll write something up to date on wikitech in the next days." [puppet] - 10https://gerrit.wikimedia.org/r/282160 (owner: 10Muehlenhoff) [15:15:35] so they got to compile the whole mess which takes ~30-40 seconds [15:16:46] looks good [15:16:53] doing a sync-file php-1.27.0-wmf.22/includes/DefaultSettings.php "Set $wgVersion = 1.27.0-wmf.22" [15:17:03] 06Operations, 10ops-eqiad: ship single dell 500GB sata to ulsfo - https://phabricator.wikimedia.org/T133699#2239809 (10RobH) [15:17:40] !log hashar@tin Synchronized php-1.27.0-wmf.22/includes/DefaultSettings.php: Set = 1.27.0-wmf.22 (duration: 01m 04s) [15:17:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:18:10] hashar: all clear? (/me got some patches to SWAT) [15:18:16] thcipriani: yeah done [15:18:19] kk [15:18:23] Dereckson: alright, here goes. [15:20:49] !log thcipriani@tin Synchronized wmf-config: SWAT: Switch to short syntax array [[gerrit:285361]] (duration: 00m 36s) [15:20:53] ^ Dereckson done [15:20:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:21:30] The world still exists, we're fine. [15:21:50] :) [15:22:21] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285373 (owner: 10Dereckson) [15:22:36] !log 1.27.0-wmf.22 HHVM cache is warmed up | T131556 [15:22:37] T131556: MW-1.27.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T131556 [15:22:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:22:48] (03Merged) 10jenkins-bot: Reformatted comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285373 (owner: 10Dereckson) [15:24:32] !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: Reformatted comment [[gerrit:285373]] (duration: 00m 28s) [15:24:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:24:41] !log restarting elasticsearch server elastic2013.codfw.wmnet - activating unicast (T110236) [15:24:42] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [15:24:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:25:00] PROBLEM - puppet last run on mw2193 is CRITICAL: CRITICAL: puppet fail [15:25:12] !log OS installation on graphite2002 [15:25:13] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285374 (https://phabricator.wikimedia.org/T130820) (owner: 10Dereckson) [15:25:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:25:37] (03CR) 10Filippo Giunchedi: [C: 04-1] "indeed that'd be better and should be already the case actually from 'standard':" [puppet] - 10https://gerrit.wikimedia.org/r/284912 (https://phabricator.wikimedia.org/T115330) (owner: 10Filippo Giunchedi) [15:25:39] (03Merged) 10jenkins-bot: Enable action=credits on test and beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285374 (https://phabricator.wikimedia.org/T130820) (owner: 10Dereckson) [15:27:30] (03PS1) 10Faidon Liambotis: ganglia: add codfw/esams/ulsfo under ganglia_clusters/cache_misc [puppet] - 10https://gerrit.wikimedia.org/r/285409 [15:27:42] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable action=credits on test and beta [[gerrit:285374]] (duration: 00m 29s) [15:27:45] Testing. [15:27:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:27:55] thcipriani: works [15:28:03] Dereckson: thanks [15:28:24] (on test, for labs, that'll wait jenkins job to propagate config) [15:28:58] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285377 (owner: 10Dereckson) [15:29:24] (03CR) 10Faidon Liambotis: "So the issue is probably that there is no else clause to include the Ganglia decom class. Let's add that instead?" [puppet] - 10https://gerrit.wikimedia.org/r/284912 (https://phabricator.wikimedia.org/T115330) (owner: 10Filippo Giunchedi) [15:29:29] (03Merged) 10jenkins-bot: Clean InitialiseSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285377 (owner: 10Dereckson) [15:29:52] addshore: enjoy http://en.wikipedia.beta.wmflabs.org/wiki/Main_Page?action=credits [15:30:02] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285381 (owner: 10Dereckson) [15:30:48] (03CR) 10BBlack: letsencrypt module guts + acme-setup script (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/283988 (https://phabricator.wikimedia.org/T132812) (owner: 10BBlack) [15:31:25] (03PS2) 10Thcipriani: Clean InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285381 (owner: 10Dereckson) [15:32:19] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285381 (owner: 10Dereckson) [15:32:47] (03Merged) 10jenkins-bot: Clean InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285381 (owner: 10Dereckson) [15:33:03] (03PS3) 10BBlack: apt|mirrors|ubuntu: puppetized LE certs [puppet] - 10https://gerrit.wikimedia.org/r/285196 (https://phabricator.wikimedia.org/T132450) [15:33:05] (03PS18) 10BBlack: letsencrypt module guts + acme-setup script [puppet] - 10https://gerrit.wikimedia.org/r/283988 (https://phabricator.wikimedia.org/T132812) [15:33:59] (03CR) 10BBlack: [C: 031] Add a maintenance flag to cache::misc directors. [puppet] - 10https://gerrit.wikimedia.org/r/285364 (https://phabricator.wikimedia.org/T76348) (owner: 10Elukey) [15:34:24] (03CR) 10jenkins-bot: [V: 04-1] apt|mirrors|ubuntu: puppetized LE certs [puppet] - 10https://gerrit.wikimedia.org/r/285196 (https://phabricator.wikimedia.org/T132450) (owner: 10BBlack) [15:34:53] 06Operations, 10Continuous-Integration-Config, 10Dumps-Generation, 13Patch-For-Review, 07WorkType-Maintenance: operations/dumps repo should pass flake8 - https://phabricator.wikimedia.org/T114249#2239865 (10hashar) Neat! Well done Ariel [15:34:58] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Clean InitialiseSettings.php [[gerrit:285381]] (duration: 00m 27s) [15:35:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:35:38] (03PS4) 10BBlack: apt|mirrors|ubuntu: puppetized LE certs [puppet] - 10https://gerrit.wikimedia.org/r/285196 (https://phabricator.wikimedia.org/T132450) [15:35:45] 06Operations, 10ops-codfw, 13Patch-For-Review: rack/setup new host graphite2002 - https://phabricator.wikimedia.org/T130938#2239868 (10Papaul) a:05RobH>03fgiunchedi I am getting the message below during installation, I chat with Filippo on IRC, he will be taking over the installation. No file system is s... [15:35:57] Dereckson: all sync'd, nice cleanup! [15:36:04] Thank you for the deployment and assistance with this spring cleaning. [15:36:21] :D [15:40:09] thcipriani: ohhhhh I just noticed https://gerrit.wikimedia.org/r/#/c/282441/ :) [15:40:16] (03PS1) 10Rush: dnsrecursor keep stats on last 1000 queries [puppet] - 10https://gerrit.wikimedia.org/r/285412 (https://phabricator.wikimedia.org/T124680) [15:40:40] (03Abandoned) 10Hashar: beta: update db script strip output on error [puppet] - 10https://gerrit.wikimedia.org/r/270902 (https://phabricator.wikimedia.org/T110407) (owner: 10Hashar) [15:40:48] I have abandonned my other patch [15:40:54] hashar: yup. Ran into again the other day, I knew there was a fix out there somewhere. [15:40:59] (03PS2) 10Rush: dnsrecursor keep stats on last 1000 queries [puppet] - 10https://gerrit.wikimedia.org/r/285412 (https://phabricator.wikimedia.org/T124680) [15:42:15] thcipriani: godog: coreyfloyd: I have proposed a bunch of patches for Puppet SWAT but can't stay around due to family duties (dinner, school etc). None have any impact on prod and all are already cherry picked on labs CI puppet master :-} [15:42:25] (03PS2) 10Filippo Giunchedi: ganglia: don't run ganglia-monitor in labs [puppet] - 10https://gerrit.wikimedia.org/r/284912 (https://phabricator.wikimedia.org/T115330) [15:43:23] (03CR) 10BBlack: [C: 031] "LGTM in compiler again" [puppet] - 10https://gerrit.wikimedia.org/r/285196 (https://phabricator.wikimedia.org/T132450) (owner: 10BBlack) [15:43:39] hashar: ok thanks! will it be always a bad time for you for puppet swat btw? anyways I'll probably merge [15:44:11] godog: yeah always that is 18:00-19:00 here and it is a terrible time with two kids ;-} [15:44:28] those little things tends to be hungry at that hour and are literally begging for food [15:44:43] feel free to skip anything which is not straightforward or for which you have doubt [15:45:24] eheh ok thanks [15:45:29] (03CR) 10BBlack: [C: 032] letsencrypt module guts + acme-setup script [puppet] - 10https://gerrit.wikimedia.org/r/283988 (https://phabricator.wikimedia.org/T132812) (owner: 10BBlack) [15:45:53] (03CR) 10BBlack: [C: 032] "One way to find out!" [puppet] - 10https://gerrit.wikimedia.org/r/285196 (https://phabricator.wikimedia.org/T132450) (owner: 10BBlack) [15:46:05] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/2557/ shows no changes for cp1052,cp1044,cp1043 as expected since the flag is not enabled by default." [puppet] - 10https://gerrit.wikimedia.org/r/285364 (https://phabricator.wikimedia.org/T76348) (owner: 10Elukey) [15:46:40] (03CR) 10Jgreen: Enable two-factor authentication in sshd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/282160 (owner: 10Muehlenhoff) [15:47:54] (03PS3) 10Rush: dnsrecursor keep stats on last 1000 queries [puppet] - 10https://gerrit.wikimedia.org/r/285412 (https://phabricator.wikimedia.org/T124680) [15:48:16] (03CR) 10Filippo Giunchedi: [C: 031] "tested on a labs instance, noop in prod: https://puppet-compiler.wmflabs.org/2558/" [puppet] - 10https://gerrit.wikimedia.org/r/284912 (https://phabricator.wikimedia.org/T115330) (owner: 10Filippo Giunchedi) [15:48:26] 06Operations: API apache servers OOMing: mw1134 mw1132 mw1139 mw1138 - https://phabricator.wikimedia.org/T132845#2239919 (10Andrew) [15:48:28] 06Operations, 10MediaWiki-API, 07Availability, 07HHVM: HHVM is leaking memory on the API appservers - https://phabricator.wikimedia.org/T133674#2239920 (10Andrew) [15:49:55] (03PS4) 10Rush: dnsrecursor keep stats on last 1000 queries [puppet] - 10https://gerrit.wikimedia.org/r/285412 (https://phabricator.wikimedia.org/T124680) [15:50:24] (03CR) 10Rush: [C: 032 V: 032] dnsrecursor keep stats on last 1000 queries [puppet] - 10https://gerrit.wikimedia.org/r/285412 (https://phabricator.wikimedia.org/T124680) (owner: 10Rush) [15:50:48] (03CR) 10Dereckson: "Should be manually rebased per 0a416f988 commit, switching array syntax to short notation." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285008 (https://phabricator.wikimedia.org/T133440) (owner: 10Urbanecm) [15:51:13] (03PS2) 10Dereckson: Enable DynamicPageList extension on tewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285009 (https://phabricator.wikimedia.org/T133032) (owner: 10Urbanecm) [15:52:07] (03CR) 10Nuria: Modify the default Varnish error page to increase visibility of error messages. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/285363 (https://phabricator.wikimedia.org/T76348) (owner: 10Elukey) [15:53:00] (03PS3) 10Dereckson: Enable DynamicPageList extension on tewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285009 (https://phabricator.wikimedia.org/T133032) (owner: 10Urbanecm) [15:53:16] (03PS1) 10BBlack: LE: include rather than require sslcert [puppet] - 10https://gerrit.wikimedia.org/r/285416 (https://phabricator.wikimedia.org/T132812) [15:53:18] (03CR) 10Dereckson: "PS2: rebased, PS3: adding bug identifier." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285009 (https://phabricator.wikimedia.org/T133032) (owner: 10Urbanecm) [15:53:44] (03PS2) 10BBlack: LE: include rather than require sslcert [puppet] - 10https://gerrit.wikimedia.org/r/285416 (https://phabricator.wikimedia.org/T132812) [15:53:51] (03CR) 10BBlack: [C: 032 V: 032] LE: include rather than require sslcert [puppet] - 10https://gerrit.wikimedia.org/r/285416 (https://phabricator.wikimedia.org/T132812) (owner: 10BBlack) [15:55:00] RECOVERY - puppet last run on mw2193 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:55:05] (03CR) 10Elukey: Modify the default Varnish error page to increase visibility of error messages. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/285363 (https://phabricator.wikimedia.org/T76348) (owner: 10Elukey) [15:55:30] (03CR) 10Dereckson: [C: 031] Enable DynamicPageList extension on tewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285009 (https://phabricator.wikimedia.org/T133032) (owner: 10Urbanecm) [15:57:08] LE on carbon worked! [15:57:44] \o/ still amazing to see previously-paid-for bits now being free [15:58:38] (03PS1) 10Rush: labs dnsrecurser update settings [puppet] - 10https://gerrit.wikimedia.org/r/285418 (https://phabricator.wikimedia.org/T124680) [15:58:44] well done bblack ! time for a blog post ;-} [15:59:29] <_joe_> hashar: not really, there are thousands of implementations of LE automation out there [15:59:45] <_joe_> bblack does things that are _way_ cooler than that :) [16:00:02] <_joe_> and yes, he should write some blog posts [16:00:04] godog coreyfloyd: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160426T1600). Please do the needful. [16:00:04] hashar thcipriani: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:07] <_joe_> or a manual :) [16:00:10] (03PS1) 10BBlack: LE: fix "creates" path on first exec [puppet] - 10https://gerrit.wikimedia.org/r/285419 (https://phabricator.wikimedia.org/T132812) [16:00:29] <_joe_> bbl [16:00:45] o/ [16:00:46] (03PS2) 10BBlack: LE: fix "creates" path on first exec [puppet] - 10https://gerrit.wikimedia.org/r/285419 (https://phabricator.wikimedia.org/T132812) [16:01:00] (03CR) 10BBlack: [C: 032 V: 032] LE: fix "creates" path on first exec [puppet] - 10https://gerrit.wikimedia.org/r/285419 (https://phabricator.wikimedia.org/T132812) (owner: 10BBlack) [16:01:09] (03PS2) 10Rush: labs dnsrecurser update settings [puppet] - 10https://gerrit.wikimedia.org/r/285418 (https://phabricator.wikimedia.org/T124680) [16:01:33] alright thcipriani I'll start puppet swat with you https://gerrit.wikimedia.org/r/#/c/282441/ [16:01:41] godog: sounds good [16:02:19] thcipriani: still has an unaddressed comment from you tho? [16:02:51] (03CR) 10Rush: [C: 032] labs dnsrecurser update settings [puppet] - 10https://gerrit.wikimedia.org/r/285418 (https://phabricator.wikimedia.org/T124680) (owner: 10Rush) [16:02:54] yeah the LE puppetization we've done here is nothing new for the world in general. It's just nice to have it fully-automated in a way that works here, for at least some cases. [16:03:14] why are we importing openjdk into our repo instead of using backports again? [16:03:19] godog: yeah, not really worried about it, just a style thing. having output on failure is more important to me :) [16:03:21] is that you, moritzm? [16:03:38] thcipriani: imho the f.seek(0) should happen after p.wait() [16:04:23] (03PS1) 10Eevans: cassandra: add restbase1015-b [puppet] - 10https://gerrit.wikimedia.org/r/285420 (https://phabricator.wikimedia.org/T128107) [16:04:24] 06Operations, 10Traffic, 07HTTPS: Preload STS for wikimedia.org - https://phabricator.wikimedia.org/T132685#2239986 (10BBlack) [16:04:27] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review: Enforce HTTPS+HSTS on remaining one-off sites in wikimedia.org that don't use standard cache cluster termination - https://phabricator.wikimedia.org/T132521#2239987 (10BBlack) [16:04:29] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review: enable https for (ubuntu|apt|mirrors).wikimedia.org - https://phabricator.wikimedia.org/T132450#2239984 (10BBlack) 05Open>03Resolved a:03BBlack [16:04:35] i.e. it looks fine to me [16:05:12] ah, yeah, that could scramble some output. Yeah, looks fine. [16:05:31] paravoid: that was me, myself and moritzm talked about it but openjdk-8 isn't in testing yet, I'll upload to jessie-backports as soon as that happens [16:05:53] (03CR) 10Dereckson: [C: 04-1] "i. This is not only a translation, but a new namespace request, and so need to be discussed at https://diq.wikipedia.org/wiki/Wikipedia:Po" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284866 (owner: 10Raimond Spekking) [16:06:14] thcipriani hashar ok merging [16:06:42] (03PS2) 10Filippo Giunchedi: Fix Beta update.php error handling [puppet] - 10https://gerrit.wikimedia.org/r/282441 (https://phabricator.wikimedia.org/T110407) (owner: 10Mattflaschen) [16:06:49] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Fix Beta update.php error handling [puppet] - 10https://gerrit.wikimedia.org/r/282441 (https://phabricator.wikimedia.org/T110407) (owner: 10Mattflaschen) [16:08:10] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review: Sort out letsencrypt puppetization for simple public hosts - https://phabricator.wikimedia.org/T132812#2240008 (10BBlack) Status Update: `letsencrypt::cert::integrated` seems to work as expected, and is managing 3x LE certs on carbon with automatic prov... [16:08:32] (03PS2) 10Dereckson: Add namespace translation 'Portal' for diq [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284866 (https://phabricator.wikimedia.org/T133702) (owner: 10Raimond Spekking) [16:08:55] (03CR) 10jenkins-bot: [V: 04-1] Add namespace translation 'Portal' for diq [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284866 (https://phabricator.wikimedia.org/T133702) (owner: 10Raimond Spekking) [16:10:25] godog: gotta head to kid school. My patch should be fine. The only one that might be of a concern is https://gerrit.wikimedia.org/r/#/c/276346/ - hiera_lookup: recognize labs project and site . But it is not that important ;-} [16:10:49] hashar: ok! [16:12:17] godog: ah ok [16:13:59] (03PS9) 10Filippo Giunchedi: contint: rsync server to hold jobs caches [puppet] - 10https://gerrit.wikimedia.org/r/253322 (https://phabricator.wikimedia.org/T116017) (owner: 10Hashar) [16:14:07] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] contint: rsync server to hold jobs caches [puppet] - 10https://gerrit.wikimedia.org/r/253322 (https://phabricator.wikimedia.org/T116017) (owner: 10Hashar) [16:15:20] !log restarting elasticsearch server elastic2014.codfw.wmnet - activating unicast (T110236) [16:15:22] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [16:15:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:16:29] (03PS5) 10Filippo Giunchedi: contint: drop integration/phpcs [puppet] - 10https://gerrit.wikimedia.org/r/285226 (owner: 10Hashar) [16:16:35] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] contint: drop integration/phpcs [puppet] - 10https://gerrit.wikimedia.org/r/285226 (owner: 10Hashar) [16:28:20] (03PS5) 10Filippo Giunchedi: contint: decouple slave_scripts and composer [puppet] - 10https://gerrit.wikimedia.org/r/285235 (https://phabricator.wikimedia.org/T128092) (owner: 10Hashar) [16:28:28] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] contint: decouple slave_scripts and composer [puppet] - 10https://gerrit.wikimedia.org/r/285235 (https://phabricator.wikimedia.org/T128092) (owner: 10Hashar) [16:32:15] (03PS2) 10Filippo Giunchedi: cassandra: add restbase1015-b [puppet] - 10https://gerrit.wikimedia.org/r/285420 (https://phabricator.wikimedia.org/T128107) (owner: 10Eevans) [16:32:23] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add restbase1015-b [puppet] - 10https://gerrit.wikimedia.org/r/285420 (https://phabricator.wikimedia.org/T128107) (owner: 10Eevans) [16:33:31] 06Operations, 10ops-eqiad: ship single dell 500GB sata to ulsfo - https://phabricator.wikimedia.org/T133699#2240233 (10RobH) p:05Normal>03Low [16:33:32] (03PS29) 1020after4: Automate the generation deployment keys (keyholder-managed ssh keys) [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) [16:34:25] (03PS1) 10Filippo Giunchedi: cassandra: fix restbase1015-b indentation [puppet] - 10https://gerrit.wikimedia.org/r/285427 [16:35:09] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: fix restbase1015-b indentation [puppet] - 10https://gerrit.wikimedia.org/r/285427 (owner: 10Filippo Giunchedi) [16:36:54] (03PS2) 10Urbanecm: Add Subject namespace to hiwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285008 (https://phabricator.wikimedia.org/T133440) [16:39:09] (03CR) 10Urbanecm: "@Dereckson: Rebased." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285008 (https://phabricator.wikimedia.org/T133440) (owner: 10Urbanecm) [16:39:43] (03CR) 10jenkins-bot: [V: 04-1] Add Subject namespace to hiwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285008 (https://phabricator.wikimedia.org/T133440) (owner: 10Urbanecm) [16:42:50] (03PS3) 10Urbanecm: Add Subject namespace to hiwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285008 (https://phabricator.wikimedia.org/T133440) [16:43:10] !log start restbase1014-[ab] cleanup [16:43:12] (03CR) 10jenkins-bot: [V: 04-1] Add Subject namespace to hiwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285008 (https://phabricator.wikimedia.org/T133440) (owner: 10Urbanecm) [16:43:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:44:40] (03PS4) 10Urbanecm: Add Subject namespace to hiwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285008 (https://phabricator.wikimedia.org/T133440) [16:45:07] (03CR) 10jenkins-bot: [V: 04-1] Add Subject namespace to hiwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285008 (https://phabricator.wikimedia.org/T133440) (owner: 10Urbanecm) [16:45:27] !log restarting elasticsearch server elastic2015.codfw.wmnet - activating unicast (T110236) [16:45:28] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [16:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:47:00] (03PS5) 10Urbanecm: Add Subject namespace to hiwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285008 (https://phabricator.wikimedia.org/T133440) [16:48:38] _joe_: I honk there was a editing mistake on the wiki. I have nothing to deploy today. [16:48:57] (03CR) 10Dereckson: [C: 031] Add Subject namespace to hiwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285008 (https://phabricator.wikimedia.org/T133440) (owner: 10Urbanecm) [16:49:32] _joe_: I also think. Not just honk. [16:51:56] (03PS30) 1020after4: Automate the generation deployment keys (keyholder-managed ssh keys) [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) [16:57:06] <_joe_> coreyfloyd: cool [17:00:04] yurik gwicke cscott arlolra subbu: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160426T1700). [17:02:01] might be able to finish maps upgrade in time, but might have to postpone for tomorrow [17:02:31] especially considering that there might be a new depl (scap3) system involved [17:03:58] 07Blocked-on-Operations, 10OOjs-UI, 05Continuous-Integration-Scaling, 13Patch-For-Review, 07WorkType-NewFunctionality: Provide composer on the nodepool servers so OOjs UI can use it in the npm job - https://phabricator.wikimedia.org/T128092#2240335 (10hashar) I have rebuild the Trusty image applying http... [17:04:28] no parsoid deploy [17:08:38] PROBLEM - cassandra-b CQL 10.64.48.139:9042 on restbase1015 is CRITICAL: Connection refused [17:10:43] ACKNOWLEDGEMENT - cassandra-b CQL 10.64.48.139:9042 on restbase1015 is CRITICAL: Connection refused Filippo Giunchedi bootstrapping [17:11:53] (03PS31) 1020after4: Automate the generation deployment keys (keyholder-managed ssh keys) [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) [17:16:09] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/0/1: down - Core: cr1-ulsfo:xe-1/2/0 (Telia, IC-313592, 51ms) {#11372} [10Gbps wave]BR [17:17:34] (03CR) 10Tim Landscheidt: "How will this display in the UI? For example, currently in Horizon I see "http://tools-checker-02.tools.eqiad.wmflabs:80" as "Backends" f" [puppet] - 10https://gerrit.wikimedia.org/r/285288 (https://phabricator.wikimedia.org/T133554) (owner: 10Alex Monk) [17:17:58] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/2/0: down - Core: cr1-eqord:xe-0/0/1 (Telia, IC-313592, 51ms) {#1502} [10Gbps wave]BR [17:27:09] (03CR) 10Alex Monk: "That will use an IP instead of a hostname, yes" [puppet] - 10https://gerrit.wikimedia.org/r/285288 (https://phabricator.wikimedia.org/T133554) (owner: 10Alex Monk) [17:28:18] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 690.13 seconds [17:30:15] mmm [17:31:31] ah, that's me [17:34:20] 06Operations: Upgrade fluorine to trusty/jessie - https://phabricator.wikimedia.org/T123728#2240433 (10fgiunchedi) I'd assumed another machine to have in parallel, fluorine is old and its partitioning is a mess, no partman recipe either. Other than that the puppet manifest seem fairly straightforward. We could t... [17:38:08] !log restarting elasticsearch server elastic2016.codfw.wmnet - activating unicast (T110236) [17:38:09] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [17:38:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:39:19] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 264.52 seconds [17:52:29] 07Puppet, 10Beta-Cluster-Infrastructure, 03Scap3: deployment-((sca|aqs)01|ores-web) puppet failures due to scap3 errors - https://phabricator.wikimedia.org/T132267#2240491 (10mmodell) Puppet status: |status |host |detail| |{icon check color=green}|deployment-sca01 |Success| |{ico... [17:54:54] 06Operations, 10Traffic: Letsencrypt all the prod things we can - planning - https://phabricator.wikimedia.org/T133717#2240497 (10BBlack) [17:55:07] 06Operations, 10Traffic: Letsencrypt all the prod things we can - planning - https://phabricator.wikimedia.org/T133717#2240512 (10BBlack) [17:55:25] !log restarting elasticsearch server elastic2017.codfw.wmnet - activating unicast (T110236) [17:55:26] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [17:55:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:57:16] 06Operations, 10Traffic: Letsencrypt all the prod things we can - planning - https://phabricator.wikimedia.org/T133717#2240517 (10Southparkfan) [17:58:18] PROBLEM - aqs endpoints health on aqs1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:01:02] (03PS3) 10Dzahn: planet: move role class to autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/285307 [18:03:24] PROBLEM - aqs endpoints health on aqs1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:04:05] RECOVERY - aqs endpoints health on aqs1002 is OK: All endpoints are healthy [18:05:15] RECOVERY - aqs endpoints health on aqs1003 is OK: All endpoints are healthy [18:15:52] YuviPanda: does watroles work currently? or do i use it wrong again [18:16:08] mutante: it should work yeah [18:16:16] mutante: http://tools.wmflabs.org/watroles/role/role::puppet::self [18:16:20] 07Blocked-on-Operations, 10OOjs-UI, 05Continuous-Integration-Scaling, 13Patch-For-Review, 07WorkType-NewFunctionality: Provide composer on the nodepool servers so OOjs UI can use it in the npm job - https://phabricator.wikimedia.org/T128092#2240615 (10hashar) ``` jenkins@ci-trusty-wikimedia-84003:~$ whic... [18:16:35] YuviPanda: thank you. yes, i used it wrong again :p [18:16:46] :D [18:17:41] 07Blocked-on-Operations, 10OOjs-UI, 05Continuous-Integration-Scaling, 13Patch-For-Review, 07WorkType-NewFunctionality: Provide composer on the nodepool servers so OOjs UI can use it in the npm job - https://phabricator.wikimedia.org/T128092#2240621 (10Jdforrester-WMF) So… Resolved? [18:19:45] "3x LE certs on carbon with automatic provisioning and renewal (no humans are harmed in the making of these simple outputs of math functions)." haha, very nice! [18:29:54] (03CR) 10Dzahn: [C: 032] "changes only class name and motd and nothing uses the role in labs" [puppet] - 10https://gerrit.wikimedia.org/r/285307 (owner: 10Dzahn) [18:33:46] (03PS2) 10Dzahn: racktables: move role class to autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/285308 [18:39:32] stashbot, tell jouncebot to restart grrrit-wm [18:39:38] thx [18:40:17] (03PS2) 10Dzahn: planet: adjust debdeploy hiera after class renaming [puppet] - 10https://gerrit.wikimedia.org/r/285437 [18:40:22] (03CR) 10Dzahn: [C: 032] planet: adjust debdeploy hiera after class renaming [puppet] - 10https://gerrit.wikimedia.org/r/285437 (owner: 10Dzahn) [18:40:39] LOL [18:40:47] * legoktm quips [18:41:28] !log restarting HHVM on terbium to enable upgrade to 3.12.1 (T132751) [18:41:29] T132751: upgrade HHVM to 3.12.1 on terbium (Elastica: missing curl_init_pooled method.) - https://phabricator.wikimedia.org/T132751 [18:41:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:44:07] 06Operations, 10CirrusSearch, 06Discovery, 06Discovery-Search-Backlog, and 2 others: upgrade HHVM to 3.12.1 on terbium (Elastica: missing curl_init_pooled method.) - https://phabricator.wikimedia.org/T132751#2240715 (10Gehel) 05Open>03Resolved [18:44:19] (03PS3) 10Dzahn: ganglia: don't run ganglia-monitor in labs [puppet] - 10https://gerrit.wikimedia.org/r/284912 (https://phabricator.wikimedia.org/T115330) (owner: 10Filippo Giunchedi) [18:45:20] (03CR) 10Dzahn: [C: 031] "at some point in history there used to be http://ganglia.wmflabs.org/ but that is gone" [puppet] - 10https://gerrit.wikimedia.org/r/284912 (https://phabricator.wikimedia.org/T115330) (owner: 10Filippo Giunchedi) [18:48:27] (03CR) 10Dzahn: [C: 032] ganglia: don't run ganglia-monitor in labs [puppet] - 10https://gerrit.wikimedia.org/r/284912 (https://phabricator.wikimedia.org/T115330) (owner: 10Filippo Giunchedi) [18:49:42] (03CR) 10Raimond Spekking: "@dereckson: "This is not only a translation, but a new namespace request": My fault, I thought the portal ns is default for all WMF wkis." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284866 (https://phabricator.wikimedia.org/T133702) (owner: 10Raimond Spekking) [18:50:42] (03PS1) 10BBlack: LE: add apache config/example [puppet] - 10https://gerrit.wikimedia.org/r/285440 (https://phabricator.wikimedia.org/T132812) [18:50:44] (03PS1) 10BBlack: ganglia: use LE cert [puppet] - 10https://gerrit.wikimedia.org/r/285441 (https://phabricator.wikimedia.org/T132812) [18:50:46] (03PS1) 10BBlack: ganglia: remove old cert absent line [puppet] - 10https://gerrit.wikimedia.org/r/285442 (https://phabricator.wikimedia.org/T132812) [18:54:32] (03PS2) 10Hashar: Group0 to 1.27.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285398 [18:56:38] jouncebot: [18:56:40] jouncebot: next [18:56:41] In 0 hour(s) and 3 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160426T1900) [18:56:46] (03PS4) 10Dzahn: Add cron job to synchronise AEAD files from primary auth server [puppet] - 10https://gerrit.wikimedia.org/r/285160 (owner: 10Muehlenhoff) [18:57:00] hashar: Just purged the only wmf.16 files. [18:57:11] any clue how Zerowiki ends up being in group0 ? [18:57:22] (03CR) 10jenkins-bot: [V: 04-1] ganglia: use LE cert [puppet] - 10https://gerrit.wikimedia.org/r/285441 (https://phabricator.wikimedia.org/T132812) (owner: 10BBlack) [18:57:24] (03CR) 10jenkins-bot: [V: 04-1] ganglia: remove old cert absent line [puppet] - 10https://gerrit.wikimedia.org/r/285442 (https://phabricator.wikimedia.org/T132812) (owner: 10BBlack) [18:57:30] hashar: Reasons that I can't remember offhand. [18:57:35] It was explicitly put in tho [18:57:36] kk [18:57:56] then I have /srv/mediawiki-staging/php pointing to php-1.27.0-wmf.21 [19:00:04] hashar ostriches: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160426T1900). [19:00:21] (03CR) 10Hashar: [C: 032] Group0 to 1.27.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285398 (owner: 10Hashar) [19:01:12] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/285160 (owner: 10Muehlenhoff) [19:01:36] (03CR) 10Dzahn: [C: 032] Add cron job to synchronise AEAD files from primary auth server [puppet] - 10https://gerrit.wikimedia.org/r/285160 (owner: 10Muehlenhoff) [19:02:42] (03Merged) 10jenkins-bot: Group0 to 1.27.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285398 (owner: 10Hashar) [19:02:49] (03PS2) 10BBlack: ganglia: remove old cert absent line [puppet] - 10https://gerrit.wikimedia.org/r/285442 (https://phabricator.wikimedia.org/T132812) [19:02:51] (03PS2) 10BBlack: ganglia: use LE cert [puppet] - 10https://gerrit.wikimedia.org/r/285441 (https://phabricator.wikimedia.org/T132812) [19:02:53] (03PS2) 10BBlack: LE: add apache config/example [puppet] - 10https://gerrit.wikimedia.org/r/285440 (https://phabricator.wikimedia.org/T132812) [19:04:51] !log hashar@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 to 1.27.0-wmf.22 [19:04:55] ostriches: that is really straightforward [19:04:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:05:19] (03CR) 10Dzahn: [C: 031] "times have changed, Letsencrypt just hit production and redirects should be possible again soon" [dns] - 10https://gerrit.wikimedia.org/r/254294 (owner: 10Odder) [19:05:48] hashar: :) [19:05:57] And now I just keep an eye on the logstash for a bit :) [19:06:02] Make sure nothing jumps out [19:06:25] should I purge the l10n cache now ? [19:07:03] (03CR) 10Dzahn: "well, but i'm not sure if it should be symlinked like that, let's find out" [dns] - 10https://gerrit.wikimedia.org/r/254294 (owner: 10Odder) [19:07:15] yeah [19:07:19] !log hashar@tin Purged l10n cache for 1.27.0-wmf.20 [19:07:22] 06Operations, 10Citoid, 06Services, 10VisualEditor: NIH db misbehaviour causing problems to Citoid - https://phabricator.wikimedia.org/T133696#2240837 (10Jdforrester-WMF) p:05Triage>03High [19:07:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:07:28] (03PS1) 10Eevans: update collector version [software/cassandra-metrics-collector] - 10https://gerrit.wikimedia.org/r/285445 (https://phabricator.wikimedia.org/T126629) [19:08:32] (03CR) 10Dzahn: "bblack, if you like we could use this as a test example for a redirect domain with LE, hasnt been used before but has been waiting to be a" [dns] - 10https://gerrit.wikimedia.org/r/254294 (owner: 10Odder) [19:11:49] (03CR) 10Dzahn: "duh:) of course. confirmed on auth2001, the cron job has been created" [puppet] - 10https://gerrit.wikimedia.org/r/285160 (owner: 10Muehlenhoff) [19:12:14] (03PS3) 10BBlack: ganglia: remove old cert absent line [puppet] - 10https://gerrit.wikimedia.org/r/285442 (https://phabricator.wikimedia.org/T132812) [19:12:16] (03PS3) 10BBlack: ganglia: use LE cert [puppet] - 10https://gerrit.wikimedia.org/r/285441 (https://phabricator.wikimedia.org/T132812) [19:12:18] (03PS3) 10BBlack: LE: add apache config/example [puppet] - 10https://gerrit.wikimedia.org/r/285440 (https://phabricator.wikimedia.org/T132812) [19:13:35] (03CR) 10Dzahn: "also manuallay ran the command, works fine" [puppet] - 10https://gerrit.wikimedia.org/r/285160 (owner: 10Muehlenhoff) [19:14:04] (03CR) 10BBlack: "The redirects problem still needs some infrastructure/puppetization work, it's laid out in T133548" [dns] - 10https://gerrit.wikimedia.org/r/254294 (owner: 10Odder) [19:14:29] !log restarting elasticsearch server elastic2018.codfw.wmnet - activating unicast (T110236) [19:14:30] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [19:14:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:18:37] ostriches: there is a bunch of log warnings about duplicate fetches from bags of stuff which is "normal" and part of hunting them https://phabricator.wikimedia.org/T128543 [19:19:13] ori: Krinkle: duplicate memcached gets are being logged now on group0 :-}  ( https://phabricator.wikimedia.org/T128543 ) [19:19:34] (03PS2) 10Dzahn: ganglia: add codfw/esams/ulsfo under ganglia_clusters/cache_misc [puppet] - 10https://gerrit.wikimedia.org/r/285409 (owner: 10Faidon Liambotis) [19:20:32] (03CR) 10Dzahn: "yes, confirmed in compiler. this creates 2041.conf on install2001 and 4041.conf on bast4001" [puppet] - 10https://gerrit.wikimedia.org/r/285409 (owner: 10Faidon Liambotis) [19:21:48] (03CR) 10Dzahn: [C: 032] ganglia: add codfw/esams/ulsfo under ganglia_clusters/cache_misc [puppet] - 10https://gerrit.wikimedia.org/r/285409 (owner: 10Faidon Liambotis) [19:23:27] 06Operations, 10MediaWiki-API, 07Availability, 07HHVM: HHVM is leaking memory on the API appservers - https://phabricator.wikimedia.org/T133674#2239113 (10Legoktm) > * APC is using a significant yet small amount of available memory Do we need to reduce the amount of things we're putting into APC? Or incre... [19:23:38] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [19:23:55] (03CR) 10Dzahn: "install2001: .. Aggregator::Instance[cache_misc_codfw]/Service[ganglia-monitor-aggregator@2041.service].." [puppet] - 10https://gerrit.wikimedia.org/r/285409 (owner: 10Faidon Liambotis) [19:25:20] paravoid: actually, is there a reason why https://www.wikipedia.org doesn't have a rel_canonical tag? [19:35:09] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review: Enforce HTTPS+HSTS on remaining one-off sites in wikimedia.org that don't use standard cache cluster termination - https://phabricator.wikimedia.org/T132521#2241038 (10Dzahn) >>! In T132521#2202415, @Chmarkine wrote: >>>! In T132521#2202254, @BBlack wro... [19:36:47] (03PS5) 10BBlack: cache_maps: add all sites in LVS [puppet] - 10https://gerrit.wikimedia.org/r/268238 (https://phabricator.wikimedia.org/T109162) [19:37:21] (03PS4) 10Eevans: [WIP]: Cassandra 2.2.5 config [puppet] - 10https://gerrit.wikimedia.org/r/284078 (https://phabricator.wikimedia.org/T126629) [19:37:36] (03PS6) 10BBlack: cache_maps: add all sites in LVS [puppet] - 10https://gerrit.wikimedia.org/r/268238 (https://phabricator.wikimedia.org/T109162) [19:39:28] (03CR) 10Eevans: "I'm planning to cherry-pick this into deployment-prep at some point in the next couple of days, any help sanity checking in the meantime w" [puppet] - 10https://gerrit.wikimedia.org/r/284078 (https://phabricator.wikimedia.org/T126629) (owner: 10Eevans) [19:45:24] (03PS1) 10Dzahn: RT: add letsencrypt::cert::integrated [puppet] - 10https://gerrit.wikimedia.org/r/285447 [19:45:27] (03CR) 10Gehel: [C: 032] cache_maps: add all sites in LVS [puppet] - 10https://gerrit.wikimedia.org/r/268238 (https://phabricator.wikimedia.org/T109162) (owner: 10BBlack) [19:46:52] (03CR) 10Gehel: [V: 032] cache_maps: add all sites in LVS [puppet] - 10https://gerrit.wikimedia.org/r/268238 (https://phabricator.wikimedia.org/T109162) (owner: 10BBlack) [19:52:57] !log restarting elasticsearch server elastic2019.codfw.wmnet - activating unicast (T110236) [19:52:57] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [19:53:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:54:37] !log restarting pybal on lvs4004 to enable new cache configuration for maps (T131880) [19:54:38] T131880: codfw/eqiad/esams/ulsfo: (4) servers for maps caching cluster - https://phabricator.wikimedia.org/T131880 [19:54:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:02:31] !log restarting pbyal on lvs4002 to enable new cache configuration for maps (T131880) [20:02:32] T131880: codfw/eqiad/esams/ulsfo: (4) servers for maps caching cluster - https://phabricator.wikimedia.org/T131880 [20:02:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:03:41] (03PS2) 10Dzahn: RT: add letsencrypt::cert::integrated [puppet] - 10https://gerrit.wikimedia.org/r/285447 [20:07:31] !log Killing Zuul entirely due to deadlock T128569 [20:07:32] T128569: Zuul deadlocks if unknown repo has activity in Gerrit - https://phabricator.wikimedia.org/T128569 [20:07:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:08:32] all changes pending in Zuul have been flushed ..... [20:08:55] 06Operations, 10Traffic, 13Patch-For-Review: Letsencrypt all the prod things we can - planning - https://phabricator.wikimedia.org/T133717#2241161 (10BBlack) [20:12:44] (03PS3) 10Mattflaschen: Beta Cluster: Use ExternalStore on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282440 (https://phabricator.wikimedia.org/T95871) [20:13:35] jynus, just wanted to ask you to look at https://gerrit.wikimedia.org/r/#/c/282440/ when you have a chance. [20:30:12] !log restarting pbyal on lvs2005 to enable new cache configuration for maps (T131880) [20:30:13] T131880: codfw/eqiad/esams/ulsfo: (4) servers for maps caching cluster - https://phabricator.wikimedia.org/T131880 [20:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:32:51] !log restarting pbyal on lvs2002 to enable new cache configuration for maps (T131880) [20:32:52] T131880: codfw/eqiad/esams/ulsfo: (4) servers for maps caching cluster - https://phabricator.wikimedia.org/T131880 [20:32:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:34:57] 07Puppet, 10Beta-Cluster-Infrastructure, 03Scap3: deployment-((sca|aqs)01|ores-web) puppet failures due to scap3 errors - https://phabricator.wikimedia.org/T132267#2241296 (10Krenair) I tried setting up the AQS deploy repository to fix aqs01 but it's missing .git/DEPLOY_HEAD? [20:35:56] !log restarting pbyal on lvs3004 to enable new cache configuration for maps (T131880) [20:35:57] T131880: codfw/eqiad/esams/ulsfo: (4) servers for maps caching cluster - https://phabricator.wikimedia.org/T131880 [20:36:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:38:16] 07Blocked-on-Operations, 10OOjs-UI, 05Continuous-Integration-Scaling, 13Patch-For-Review, 07WorkType-NewFunctionality: Provide composer on the nodepool servers so OOjs UI can use it in the npm job - https://phabricator.wikimedia.org/T128092#2241317 (10hashar) @Jdforrester-WMF that is for Trusty but the N... [20:38:25] !log restarting pybal on lvs3002 to enable new cache configuration for maps (T131880) [20:38:26] T131880: codfw/eqiad/esams/ulsfo: (4) servers for maps caching cluster - https://phabricator.wikimedia.org/T131880 [20:38:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:41:17] (03CR) 10Dereckson: "This needs manual rebase after 8d4cf82c: we've switched to short array syntax to be coherent with MediaWiki core." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285391 (https://phabricator.wikimedia.org/T98307) (owner: 10Aude) [20:44:10] (03CR) 10Aude: "arghhh..... :(" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285391 (https://phabricator.wikimedia.org/T98307) (owner: 10Aude) [20:46:12] aude: I can do it if you wish [20:46:24] Dereckson: go ahead [20:46:26] k [20:46:28] thanks [20:46:34] You're welcome. [20:47:26] i can do swat later (especially if it's only my patch or a few) [20:47:55] !log restarting elasticsearch server elastic2020.codfw.wmnet - activating unicast (T110236) [20:47:56] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [20:48:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:49:25] ah poor puppet [20:49:48] "Error: could not find a suitable provider for cron" [20:49:48] :D [20:54:29] (03PS6) 10Gehel: maps DNS 2/2: enable geodns routing [dns] - 10https://gerrit.wikimedia.org/r/268240 (https://phabricator.wikimedia.org/T109162) (owner: 10BBlack) [20:54:51] 07Puppet, 10Beta-Cluster-Infrastructure, 03Scap3: deployment-((sca|aqs)01|ores-web) puppet failures due to scap3 errors - https://phabricator.wikimedia.org/T132267#2241359 (10thcipriani) >>! In T132267#2241296, @Krenair wrote: > I tried setting up the AQS deploy repository to fix aqs01 but it's missing .git/... [20:56:05] (03PS3) 10Ori.livneh: stream.wm.o: rewrite / => /rcstream_status [puppet] - 10https://gerrit.wikimedia.org/r/284760 (https://phabricator.wikimedia.org/T132521) (owner: 10BBlack) [20:56:12] (03CR) 10Ori.livneh: [C: 032 V: 032] stream.wm.o: rewrite / => /rcstream_status [puppet] - 10https://gerrit.wikimedia.org/r/284760 (https://phabricator.wikimedia.org/T132521) (owner: 10BBlack) [20:59:05] (03PS4) 10Dereckson: Enable arbitrary access on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285391 (https://phabricator.wikimedia.org/T98307) (owner: 10Aude) [20:59:43] (03CR) 10Dereckson: "PS4: rebased" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285391 (https://phabricator.wikimedia.org/T98307) (owner: 10Aude) [20:59:53] Here you are. [21:00:13] thanks [21:01:06] (03PS7) 10Gehel: maps DNS 2/2: enable geodns routing [dns] - 10https://gerrit.wikimedia.org/r/268240 (https://phabricator.wikimedia.org/T109162) (owner: 10BBlack) [21:01:51] (03CR) 10BBlack: [C: 031] maps DNS 2/2: enable geodns routing [dns] - 10https://gerrit.wikimedia.org/r/268240 (https://phabricator.wikimedia.org/T109162) (owner: 10BBlack) [21:06:43] !log activating geodns for new varnish maps servers (T131880) [21:06:44] T131880: codfw/eqiad/esams/ulsfo: (4) servers for maps caching cluster - https://phabricator.wikimedia.org/T131880 [21:06:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:06:54] (03CR) 10Gehel: [C: 032] maps DNS 2/2: enable geodns routing [dns] - 10https://gerrit.wikimedia.org/r/268240 (https://phabricator.wikimedia.org/T109162) (owner: 10BBlack) [21:10:00] (03PS1) 1020after4: Fix multiple ssh::userkey resources per user [puppet] - 10https://gerrit.wikimedia.org/r/285519 [21:13:15] (03PS1) 10Papaul: DNS: Adding prod DNS for restbase200[7-9] Bug: T132976 [dns] - 10https://gerrit.wikimedia.org/r/285521 (https://phabricator.wikimedia.org/T132976) [21:13:53] 06Operations, 06Discovery, 10Kartotherian, 10Maps, and 2 others: Set up proper edge Varnish caching for maps cluster - https://phabricator.wikimedia.org/T109162#2241394 (10BBlack) 05Open>03Resolved [21:14:18] 06Operations, 06Discovery, 10Kartotherian, 10Maps, and 2 others: Set up proper edge Varnish caching for maps cluster - https://phabricator.wikimedia.org/T109162#1542014 (10BBlack) [21:14:22] 06Operations, 06Discovery, 10Kartotherian, 10Maps, and 2 others: codfw/eqiad/esams/ulsfo: (4) servers for maps caching cluster - https://phabricator.wikimedia.org/T131880#2241395 (10BBlack) 05Open>03Resolved [21:14:36] 06Operations, 06Discovery, 10Kartotherian, 10Maps, and 2 others: Set up proper edge Varnish caching for maps cluster - https://phabricator.wikimedia.org/T109162#2241397 (10Gehel) Varnish Maps cluster now fully configured, some traffic can already be seen on https://grafana.wikimedia.org/dashboard/db/varnis... [21:15:32] * yurik gives a huge cookie to bblack [21:15:44] Gosh. Magic. :-) [21:16:20] * yurik gives another large cookie to gehel [21:16:26] (03PS2) 10Papaul: DNS: Adding prod DNS for restbase200[7-9] Bug: T132976 [dns] - 10https://gerrit.wikimedia.org/r/285521 (https://phabricator.wikimedia.org/T132976) [21:17:17] * gehel is looking in the mail box, no cookies... [21:17:53] gehel, its a tracking cookie :) [21:17:57] gehel, why https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes ? [21:18:03] that's not a maps board [21:18:26] cache type = [text, maps, ...] [21:19:04] gehel, this graph hasn't changed though - https://grafana.wikimedia.org/dashboard/db/service-maps-varnish [21:19:09] yurik: it's a Varnish board, with a dropdown to make it a mpas board... [21:19:38] have a look at the maps traffic for esams for example [21:20:19] yurik: https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes?from=1461694798623&to=1461705598623&var-site=esams&var-cache_type=maps&var-status_type=1&var-status_type=2&var-status_type=3&var-status_type=4&var-status_type=5 [21:21:20] gehel, awesome, i see it! [21:21:25] 06Operations, 10ops-codfw: rack/setup/deploy restbase200[7-9] - https://phabricator.wikimedia.org/T132976#2241404 (10Papaul) [21:22:00] yurik: that graph you linked hasn't changed because this was an outage-free transition :) [21:22:00] gehel, actually this graph also shows it - https://grafana.wikimedia.org/dashboard/db/service-maps-varnish [21:22:15] the sum of all incoming client requests is still what is it, they're just starting to use different servers [21:22:20] bblack, i realize it now. Well done! [21:22:28] 06Operations, 10ops-codfw: rack/setup/deploy restbase200[7-9] - https://phabricator.wikimedia.org/T132976#2215958 (10Papaul) [21:22:46] yurik: in any case, let me cut and paste from another channel here: [21:22:50] bblack, thanks for all the hard work, and sorry for being a pest :D [21:23:12] 21:17 < bblack> I don't think anyone's really made any kind of meta-task for something like "give maps service full 'production' status" or something like that [21:23:14] 21:17 < gehel> I haven't seen it... [21:23:17] 21:18 < bblack> which would've parented the stuff we just finished up, and also some task for configuring up the software and getting it running on the 2x4 karotherian servers currently being ordered in codfw+eqiad, which would parent their procurement tickets [21:23:21] 21:18 < bblack> and then there's some other little bits I'm sure. for one, actually monitoring the service with full alerting and all that [21:23:22] (03CR) 10Aaron Schulz: Set descriptionCacheExpiry for Commons repo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283239 (owner: 10Aaron Schulz) [21:23:24] 21:19 < bblack> cache_maps doesn't have monitoring/paging currently, it's disabled because it's non-production and we don't want it waking us up [21:23:27] 21:19 < bblack> I'm sure there's some equivalent backend monitoring to turn on, too [21:23:30] 21:20 < bblack> someone should organize that up in phab so we have a more-concrete idea of when "maps is fully production-ified" [21:23:33] 21:21 < gehel> I'll create the task and ask around if [21:23:38] 06Operations, 10ops-codfw: rack/setup/deploy restbase200[7-9] - https://phabricator.wikimedia.org/T132976#2241407 (10Papaul) [21:23:44] (03PS2) 10Ori.livneh: Set descriptionCacheExpiry for Commons repo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283239 (owner: 10Aaron Schulz) [21:23:50] (03CR) 10Ori.livneh: [C: 032] Set descriptionCacheExpiry for Commons repo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283239 (owner: 10Aaron Schulz) [21:24:06] yurik: you might want to sync up with gehel on sorting out the sub-tasks there a bit, especially re: whatever setup/transition needs to take place on the new backend hw [21:24:19] (03Merged) 10jenkins-bot: Set descriptionCacheExpiry for Commons repo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283239 (owner: 10Aaron Schulz) [21:24:44] bblack, yep, i will work with gehel to set it up. Thanks!!! [21:25:03] gehel, should i create an umbrella "switch maps to production status" task? [21:25:16] or have you already created one? [21:25:18] 06Operations, 06Performance-Team, 10Traffic, 13Patch-For-Review: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#2241409 (10ori) >>! In T96848#2231806, @BBlack wrote: > If you have time and want to do it (next week!), by all means go for it, I have lots else to keep me busy indefinitely :) My ba... [21:25:22] yurik: he'll probably do it, he probably has a better idea of the overall scope, he's just AFK at the moment [21:25:34] ok [21:25:52] 06Operations, 06Performance-Team, 10Traffic, 13Patch-For-Review: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#2241411 (10BBlack) Ok, np! [21:26:44] !log ori@tin Synchronized wmf-config/filebackend-production.php: Ia4434256c3: Set descriptionCacheExpiry for Commons repo (duration: 00m 35s) [21:26:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:27:02] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [21:27:33] bblack, do you know why ganglia is still showing all red for all maps caches? or is that ok? [21:27:35] https://ganglia.wikimedia.org/latest/?m=cpu_report&c=Maps+caches+ulsfo&h=&p=2&tab=m&vn=&hide-hf=false&hc=4&p=2 [21:27:39] https://ganglia.wikimedia.org/latest/?c=Maps%20caches%20ulsfo&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 [21:27:53] it's ok [21:28:01] ganglia's often not in touch with reality :/ [21:29:37] it has something to do with machines moving between different cluster roles and aggregators not updating, or something [21:29:49] I hit things like this all the time, I've kinda given up wasting time on them :/ [21:30:17] I did a quick restart of the ganglia-monitor clients on all the cache_maps machines though. sometimes that helps. [21:31:15] fwiw, it has new clusters for "misc cache" in esams and ulsfo since like earlier today [21:31:50] heh [21:32:30] the misc cache started existing in esams ~ late November 2015 [21:32:41] https://ganglia.wikimedia.org/latest/?c=Maps%20caches%20ulsfo&h=cp4011.ulsfo.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2 [21:32:46] and ulsfo I guess, and codfw [21:33:04] 07Blocked-on-Operations, 10OOjs-UI, 05Continuous-Integration-Scaling, 13Patch-For-Review, 07WorkType-NewFunctionality: Provide composer on the nodepool servers so OOjs UI can use it in the npm job - https://phabricator.wikimedia.org/T128092#2241414 (10hashar) Still a WIP :) [21:33:35] bblack: yes, see the latest comments on https://gerrit.wikimedia.org/r/#/c/285409/ [21:34:16] checking the ulsfo aggregator host [21:34:20] ah [21:34:40] I do have cache_maps in there for all 4 though [21:35:15] 4051.conf: name = "Maps caches ulsfo" [21:35:20] ok, that exists [21:35:31] that's the aggregator config for that cluster [21:35:37] 06Operations, 06Discovery, 10Maps: Switch Maps to production status - https://phabricator.wikimedia.org/T133744#2241419 (10Gehel) [21:35:41] (03PS4) 10BBlack: LE: add apache config/example [puppet] - 10https://gerrit.wikimedia.org/r/285440 (https://phabricator.wikimedia.org/T132812) [21:35:48] (03CR) 10BBlack: [C: 032 V: 032] LE: add apache config/example [puppet] - 10https://gerrit.wikimedia.org/r/285440 (https://phabricator.wikimedia.org/T132812) (owner: 10BBlack) [21:36:21] 06Operations, 06Discovery, 10Kartotherian, 10Maps, and 2 others: Set up proper edge Varnish caching for maps cluster - https://phabricator.wikimedia.org/T109162#2241433 (10Gehel) [21:36:23] 06Operations, 06Discovery, 10Maps: Switch Maps to production status - https://phabricator.wikimedia.org/T133744#2241432 (10Gehel) [21:36:58] 06Operations, 06Discovery, 10Maps: Switch Maps to production status - https://phabricator.wikimedia.org/T133744#2241419 (10Gehel) [21:37:00] 06Operations, 06Discovery, 10Maps, 10hardware-requests: Maps back end hardware - https://phabricator.wikimedia.org/T131180#2241435 (10Gehel) [21:37:46] 06Operations, 06Discovery, 10Maps: Switch Maps to production status - https://phabricator.wikimedia.org/T133744#2241419 (10Gehel) [21:37:53] stream.wikimedia.org (rcstream), I can't reach from the outside world, but works fine inside [21:38:03] I thought it worked from the outside world when I was testing the other day... [21:38:16] also, no alerts on that? [21:38:18] (03PS2) 1020after4: Fix multiple ssh::userkey resources per user [puppet] - 10https://gerrit.wikimedia.org/r/285519 [21:38:26] maybe the change was bad, let me check [21:38:50] I can hit it fine from palladium, though [21:39:18] (03CR) 10jenkins-bot: [V: 04-1] Fix multiple ssh::userkey resources per user [puppet] - 10https://gerrit.wikimedia.org/r/285519 (owner: 1020after4) [21:39:54] !log restarting elasticsearch server elastic2021.codfw.wmnet - activating unicast (T110236) [21:39:55] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [21:40:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:40:10] * Immediate connect fail for 2620:0:861:ed1a::3:15: No route to host [21:40:40] (03PS3) 1020after4: Fix multiple ssh::userkey resources per user [puppet] - 10https://gerrit.wikimedia.org/r/285519 [21:40:55] yeah [21:41:39] maybe pybal depooled both backends [21:42:06] doesn't seem to have [21:42:24] TCP 208.80.154.249:80 sh -> 10.64.0.17:80 Route 10 4 11 -> 10.64.32.148:80 Route 10 0 14 [21:42:27] TCP 208.80.154.249:443 sh -> 10.64.0.17:443 Route 10 0 0 -> 10.64.32.148:443 Route 10 0 4 [21:42:36] they're there with weight 10 for both backends [21:43:16] (03PS4) 1020after4: Fix multiple ssh::userkey resources per user [puppet] - 10https://gerrit.wikimedia.org/r/285519 [21:43:29] something is off here, I just don't know what [21:43:56] (03CR) 1020after4: "This is tested on beta cluster." [puppet] - 10https://gerrit.wikimedia.org/r/285519 (owner: 1020after4) [21:44:45] (03PS5) 1020after4: Fix multiple ssh::userkey resources per user [puppet] - 10https://gerrit.wikimedia.org/r/285519 (https://phabricator.wikimedia.org/T132747) [21:45:06] bblack: I think I know what's wrong, it should work now [21:45:25] re: ganglia, i confirmed the aggregator config exists on bast4001, it's listening on that port, cp4019 can talk to it with netcat, it's not iptables. it works and receives data for another cluster, the "misc web caching cluster ulsfo", can see it with tcpdump.. just somehow doesnt get data from maps hosts [21:45:39] yes, it works [21:45:39] so what bblack said :p [21:45:40] (03PS3) 10Muehlenhoff: rcstream: Update source range [puppet] - 10https://gerrit.wikimedia.org/r/284161 [21:45:41] iptables [21:45:54] ? [21:46:21] (03CR) 10Ori.livneh: [C: 031] rcstream: Update source range [puppet] - 10https://gerrit.wikimedia.org/r/284161 (owner: 10Muehlenhoff) [21:46:47] moritzm: needs :443 too [21:46:53] yeah, the patch above wasn't merged yet. I had stopped ferm as an interim solution, but got re-enabled (probably by a change in some ferm definition) [21:47:01] (03PS1) 10Hashar: hhvm: log dir creation requires rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/285526 [21:47:10] right, let me fix that [21:47:26] thanks [21:48:50] (03PS4) 10Muehlenhoff: rcstream: Update source range [puppet] - 10https://gerrit.wikimedia.org/r/284161 [21:49:12] incoming jenkins -1 [21:49:22] because of the silly arrow-alignment lint check lol [21:49:37] stupid pplint... [21:49:55] (03CR) 10jenkins-bot: [V: 04-1] rcstream: Update source range [puppet] - 10https://gerrit.wikimedia.org/r/284161 (owner: 10Muehlenhoff) [21:50:14] (03PS5) 10Muehlenhoff: rcstream: Update source range [puppet] - 10https://gerrit.wikimedia.org/r/284161 [21:51:08] (03CR) 10Hashar: "That happens while building a fresh disk image for Nodepool and including contint::hhvm. rsyslog is not included and that explodes." [puppet] - 10https://gerrit.wikimedia.org/r/285526 (owner: 10Hashar) [21:51:48] (03CR) 10Muehlenhoff: [C: 032 V: 032] rcstream: Update source range [puppet] - 10https://gerrit.wikimedia.org/r/284161 (owner: 10Muehlenhoff) [21:51:59] the other thing odd here is lack of monitoring [21:52:05] I see in the manifest we just edited: check_ssl_http!stream.wikimedia.org', [21:52:16] yet searching for 'stream' in the icinga web UI turns up nothing useful [21:53:19] eh. let me try to find that [21:53:48] https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=rcs1001&service=HTTPS [21:53:59] service HTTPS on host rcs1001 etc [21:54:11] includes cert expiry [21:54:15] SSL OK - Certificate stream.wikimedia.org valid until 2016-06-27 12:00:00 +0000 (expires in 61 days) [21:54:18] yeah and I guess unless we add it to catchpoint, we wouldn't have seen this with a real public IP check anyways [21:54:35] after all, it still worked from neon the whole time heh [21:55:04] yea, we can make it better in icinga [21:55:13] by adding a virtual host "stream.wikimedia.org" [21:55:16] and adding the checks on that [21:55:34] merged and I ran puppet von rcs100[12] [21:56:07] thanks! [21:57:17] thanks bblack / moritzm [22:00:35] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review: Enforce HTTPS+HSTS on remaining one-off sites in wikimedia.org that don't use standard cache cluster termination - https://phabricator.wikimedia.org/T132521#2241465 (10BBlack) >>! In T132521#2241038, @Dzahn wrote: >>>! In T132521#2202415, @Chmarkine wro... [22:03:01] (03PS1) 10Dzahn: icinga: add virtual host and ssl check for stream.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/285530 [22:03:19] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 06Labs, 10hardware-requests: eqiad: (2) Relevance forge servers - https://phabricator.wikimedia.org/T131184#2241466 (10Deskana) [22:05:30] (03PS2) 10Dzahn: icinga: add virtual host and ssl check for stream.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/285530 [22:07:36] (03PS1) 10Jdlrobson: Enable lazy loaded references in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285531 (https://phabricator.wikimedia.org/T129693) [22:10:55] (03PS6) 1020after4: Fix multiple ssh::userkey resources per user [puppet] - 10https://gerrit.wikimedia.org/r/285519 (https://phabricator.wikimedia.org/T132747) [22:15:36] (03PS7) 1020after4: Fix multiple ssh::userkey resources per user [puppet] - 10https://gerrit.wikimedia.org/r/285519 (https://phabricator.wikimedia.org/T132747) [22:20:59] (03CR) 10Ori.livneh: [C: 031] "Thanks" [puppet] - 10https://gerrit.wikimedia.org/r/285530 (owner: 10Dzahn) [22:21:29] (03PS3) 10Dzahn: DNS: Adding prod DNS for restbase200[7-9] Bug: T132976 [dns] - 10https://gerrit.wikimedia.org/r/285521 (https://phabricator.wikimedia.org/T132976) (owner: 10Papaul) [22:23:57] (03CR) 10Dzahn: [C: 032] DNS: Adding prod DNS for restbase200[7-9] Bug: T132976 [dns] - 10https://gerrit.wikimedia.org/r/285521 (https://phabricator.wikimedia.org/T132976) (owner: 10Papaul) [22:23:59] (03PS2) 10Ori.livneh: hhvm: log dir creation requires rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/285526 (owner: 10Hashar) [22:25:06] 07Puppet, 10Beta-Cluster-Infrastructure, 07Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#2241541 (10Krenair) [22:25:12] 07Puppet, 10Beta-Cluster-Infrastructure, 13Patch-For-Review, 03Scap3: deployment-((sca|aqs)01|ores-web) puppet failures due to scap3 errors - https://phabricator.wikimedia.org/T132267#2241539 (10Krenair) 05Open>03Resolved AQS is now fine. [22:28:51] (03PS1) 10Tim Starling: Remove obsolete ocwiki hack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285538 [22:31:44] (03PS3) 10Mattflaschen: Change login cookies (for 'Remember me') to a one year expiry. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230954 (https://phabricator.wikimedia.org/T68699) [22:35:44] (03CR) 10jenkins-bot: [V: 04-1] Change login cookies (for 'Remember me') to a one year expiry. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230954 (https://phabricator.wikimedia.org/T68699) (owner: 10Mattflaschen) [22:43:51] PROBLEM - Host eeden is DOWN: PING CRITICAL - Packet loss = 100% [22:44:01] (03CR) 10Dzahn: [C: 032] icinga: add virtual host and ssl check for stream.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/285530 (owner: 10Dzahn) [22:44:54] icinga-wm: i disagree [22:46:01] PROBLEM - Host ns2-v4 is DOWN: PING CRITICAL - Packet loss = 100% [22:51:31] RECOVERY - Host eeden is UP: PING OK - Packet loss = 0%, RTA = 83.27 ms [22:52:22] RECOVERY - Host ns2-v4 is UP: PING OK - Packet loss = 0%, RTA = 84.25 ms [22:58:02] 06Operations, 10Citoid, 06Services, 10VisualEditor: NIH db misbehaviour causing problems to Citoid - https://phabricator.wikimedia.org/T133696#2241618 (10Mvolz) As I said in chat, we currently look up a pmid for every citation that has a doi. We could easily stop doing that and that would reduce the number... [23:00:04] RoanKattouw ostriches Krenair MaxSem Dereckson: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160426T2300). [23:00:04] aude jdlrobson RoanKattouw: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:12] I'll do it [23:00:15] ok [23:00:42] hmm, did i put mine on the wrong date? i have two patches [23:00:53] nope i got the right date, jouncebot just didn't update [23:02:35] ebernhardson: BTW that blank page on Special:Search bug reportedly also affects production [23:02:51] Or so I heard [23:03:02] maybe on test.wikipedia etc [23:03:16] or mediawiki.org [23:03:57] \o [23:04:18] jdlrobson: you unlock the achievement "introduce the first setting using from the start the short array syntax in the config". [23:04:19] (03Abandoned) 10Dzahn: RT: add letsencrypt::cert::integrated [puppet] - 10https://gerrit.wikimedia.org/r/285447 (owner: 10Dzahn) [23:04:50] (03CR) 10Catrope: [C: 032] Enable arbitrary access on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285391 (https://phabricator.wikimedia.org/T98307) (owner: 10Aude) [23:04:57] Dereckson: oh i did something good for a change? yay :) [23:05:13] RoanKattouw: oh? it shouldn't, the patch that caused it is https://gerrit.wikimedia.org/r/#/c/282217/ which was merged ~11am. checking the branch though [23:05:28] Aha [23:05:48] Yeah that patch says "included in master", not in a wmf branch [23:06:02] So I was probably wrong and it's not broken in prod [23:06:02] rvv? [23:06:38] (03PS5) 10Catrope: Enable arbitrary access on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285391 (https://phabricator.wikimedia.org/T98307) (owner: 10Aude) [23:06:57] Ugh who made mediawiki-config a fast-forward-only repo :( [23:07:23] (03CR) 10Catrope: [C: 032] Enable arbitrary access on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285391 (https://phabricator.wikimedia.org/T98307) (owner: 10Aude) [23:07:29] :/ [23:07:53] (03Merged) 10jenkins-bot: Enable arbitrary access on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285391 (https://phabricator.wikimedia.org/T98307) (owner: 10Aude) [23:07:55] It would be less annoying if Zuul didn't have this bug where rebased patches with a persisted +2 are ignored [23:09:12] it's a good idea to have the repo ff [23:09:15] !log catrope@tin Synchronized dblists/arbitraryaccess.dblist: Enable arbitrary access on Commons (duration: 00m 33s) [23:09:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:09:27] That allows to quickly revert any problematic change. [23:09:43] Fair enough [23:10:09] Hmm, does it though? IME FF-ness doesn't help/hurt with that especially [23:10:21] You can still easily end up in a situation where you have to revert B in order to be able to revert A [23:10:37] And frequently you just have to revert back to before A happened, meaning you revert A and everything after it [23:11:15] Okay. Yes, ff only help to allow to revert last change. [23:11:42] Yeah come to think of it, I did recently experience strange revert behavior in a non-FF repo [23:11:44] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Enable arbitrary access in user language on Commons and testwikidatawiki (duration: 00m 33s) [23:11:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:11:53] aude: That's both parts of yours done ---^^ [23:12:01] ok, testing... [23:12:29] (03CR) 10Catrope: [C: 032] Enable lazy loaded references in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285531 (https://phabricator.wikimedia.org/T129693) (owner: 10Jdlrobson) [23:12:31] git revert -m (1 or 2, take tbe bets! en.wikipedia is down! hurry up! 1 or 2?) [23:12:40] looks ok [23:12:42] Yeah that works [23:12:50] But I'm lazy and I use the revert button in Gerrit :) [23:12:55] That's the only advantage I can see for ff [23:13:01] (03PS2) 10Catrope: Enable lazy loaded references in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285531 (https://phabricator.wikimedia.org/T129693) (owner: 10Jdlrobson) [23:13:09] (03CR) 10Catrope: [C: 032] Enable lazy loaded references in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285531 (https://phabricator.wikimedia.org/T129693) (owner: 10Jdlrobson) [23:14:00] (03Merged) 10jenkins-bot: Enable lazy loaded references in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285531 (https://phabricator.wikimedia.org/T129693) (owner: 10Jdlrobson) [23:16:33] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Enable lazy-loaded references in mobile web beta on enwiki (duration: 00m 26s) [23:16:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:17:39] jdlrobson: ---^^ [23:19:18] (03PS1) 10Aude: Use wmgWikibaseAllowDataAccessInUserLanguage in Wikibase production settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285551 [23:19:28] RoanKattouw: i have a follow up ^ [23:19:47] (03CR) 10Catrope: [C: 032] Use wmgWikibaseAllowDataAccessInUserLanguage in Wikibase production settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285551 (owner: 10Aude) [23:19:52] hah, whoops [23:19:54] thanks [23:20:01] RoanKattouw: something is not right. May need to revert... currently investigating. [23:20:16] bblack: added https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=stream.wikimedia.org&service=HTTPS+stream.wikimedia.org [23:20:20] (03Merged) 10jenkins-bot: Use wmgWikibaseAllowDataAccessInUserLanguage in Wikibase production settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285551 (owner: 10Aude) [23:21:05] (03CR) 10Dzahn: "https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=1&host=stream.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/285530 (owner: 10Dzahn) [23:21:16] !log catrope@tin Synchronized php-1.27.0-wmf.21/extensions/WikimediaEvents: SWAT (duration: 00m 28s) [23:21:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:21:43] !log catrope@tin Synchronized php-1.27.0-wmf.22/extensions/WikimediaEvents: SWAT (duration: 00m 25s) [23:21:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:22:21] RoanKattouw: seems i've been completely confused by the train delay. The patch hasn't rolled out to Wikipedia on the train yet and won't until Thursday. [23:22:24] ebernhardson: That's your WikimediaEvents change ---^^ [23:22:32] So can either rollback or swat the change from MobileFrontend if you have time [23:22:46] jdlrobson: If the config doesn't hurt we can leave it in [23:22:51] (03PS3) 10Dzahn: RT: move role classes to autoloader layout, split labs [puppet] - 10https://gerrit.wikimedia.org/r/285310 [23:23:02] RoanKattouw: we should probably revert [23:23:07] i'll redo it on thursday. [23:23:09] OK, will do [23:23:12] (it does hurt unfortunately :/) [23:23:13] RoanKattouw: sweet, thanks. Will look in a few once the new js code starts going out [23:23:15] silly me [23:23:51] (03PS1) 10Catrope: Revert "Enable lazy loaded references in beta" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285552 [23:23:54] RoanKattouw: really sorry for wasting your time with that. [23:23:57] No worries [23:24:04] (03CR) 10Catrope: [C: 032] Revert "Enable lazy loaded references in beta" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285552 (owner: 10Catrope) [23:24:45] Oh FFS [23:24:53] ff-only evidently does NOT help with rebases [23:24:56] !log catrope@tin Synchronized wmf-config/Wikibase.php: Actually enable data access in user language for Commons and testwikidatawiki (duration: 00m 26s) [23:25:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:25:03] https://gerrit.wikimedia.org/r/#/c/285552/ refuses to merge because it needs a rebase :( [23:25:09] RoanKattouw: looks better :) [23:25:15] (03PS2) 10Catrope: Revert "Enable lazy loaded references in beta" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285552 [23:25:21] (03CR) 10Catrope: [C: 032] Revert "Enable lazy loaded references in beta" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285552 (owner: 10Catrope) [23:25:52] (03Merged) 10jenkins-bot: Revert "Enable lazy loaded references in beta" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285552 (owner: 10Catrope) [23:26:40] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Revert lazy-loaded references (duration: 00m 25s) [23:26:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:27:32] jdlrobson: Revert should be live now [23:28:49] (03PS4) 10Dzahn: RT: move role classes to autoloader layout, split labs [puppet] - 10https://gerrit.wikimedia.org/r/285310 [23:28:56] !log catrope@tin Synchronized php-1.27.0-wmf.22/extensions/Echo: SWAT (duration: 00m 31s) [23:29:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:29:08] RoanKattouw: confirmed fixed. Thank you [23:29:41] !log catrope@tin Synchronized php-1.27.0-wmf.22/extensions/Flow: SWAT (duration: 00m 45s) [23:29:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:30:23] (03PS1) 10Jdlrobson: Enable lazy loaded references in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285553 (https://phabricator.wikimedia.org/T129693) [23:30:36] (03CR) 10Jdlrobson: [C: 04-1] "Do not merge till Thursday" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285553 (https://phabricator.wikimedia.org/T129693) (owner: 10Jdlrobson) [23:30:40] (03CR) 10Dzahn: [C: 032] "no-op except class name / motd. checked that it's not used in labs with watroles. http://puppet-compiler.wmflabs.org/2567/" [puppet] - 10https://gerrit.wikimedia.org/r/285310 (owner: 10Dzahn) [23:31:46] oh come on, compiler said it's fine and then it's not [23:32:04] due to hiera something [23:33:30] well .. and now it is, some kind of delay between puppet and hiera update [23:37:27] (03PS1) 10EBernhardson: Increase curl pools on jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/285554 (https://phabricator.wikimedia.org/T133755) [23:42:53] (03CR) 10EBernhardson: "I ran puppet compiler against 2 job runner nodes and one normal mediawiki node, all 3 failed to compile: https://gerrit.wikimedia.org/r/28" [puppet] - 10https://gerrit.wikimedia.org/r/285554 (https://phabricator.wikimedia.org/T133755) (owner: 10EBernhardson) [23:43:21] (03CR) 10EBernhardson: "the puppet compiler link should have been: http://puppet-compiler.wmflabs.org/2568/" [puppet] - 10https://gerrit.wikimedia.org/r/285554 (https://phabricator.wikimedia.org/T133755) (owner: 10EBernhardson) [23:45:42] (03PS3) 10Dzahn: racktables: move role class to autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/285308 [23:47:10] (03CR) 10EBernhardson: "Puppet compiler errors were PEBKAC. Corrected puppet compiler output, looks to work as expected: http://puppet-compiler.wmflabs.org/2569/" [puppet] - 10https://gerrit.wikimedia.org/r/285554 (https://phabricator.wikimedia.org/T133755) (owner: 10EBernhardson) [23:48:15] (03CR) 10Dzahn: [C: 032] "no change except motd http://puppet-compiler.wmflabs.org/2570/krypton.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/285308 (owner: 10Dzahn)