[00:01:23] 10Operations, 10Release-Engineering-Team (Watching / External): rename mira to deploy2001 and reinstall with stretch - https://phabricator.wikimedia.org/T193916#4183539 (10RobH) mira appears to be a ganeti instance at this point, not a physical machine. The 'mira' in racktables is a decommissioned host wmf5818. [00:01:40] (03CR) 10Dzahn: [C: 032] "no-op on terbium" [puppet] - 10https://gerrit.wikimedia.org/r/431047 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [00:01:54] PROBLEM - Check whether ferm is active by checking the default input chain on mw2193 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:13:35] PROBLEM - MD RAID on mw2191 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:21:03] 10Operations, 10Patch-For-Review: setup replacement for terbium (maintenance_server) on stretch - https://phabricator.wikimedia.org/T192092#4183555 (10Dzahn) - mwmaint1001 is now up and running with stretch - the mediawiki_maintenance puppet role now supports PHP7/stretch -- (since https://gerrit.wikimedia.or... [00:23:35] 10Operations, 10Release-Engineering-Team (Watching / External): rename mira to deploy2001 and reinstall with stretch - https://phabricator.wikimedia.org/T193916#4183562 (10Dzahn) You are right, thank you. It once was mira but correct is: **naos.codfw.wmnet** [00:24:51] 10Operations, 10Release-Engineering-Team (Watching / External): rename naos to deploy2001 and reinstall with stretch - https://phabricator.wikimedia.org/T193916#4183564 (10Dzahn) [00:26:17] !log Purging values for navtiming 'mediaWikiLoadEnd' sub-properties from 2018-04-21 to 2018-05-04T04:40:00 (T160315) [00:26:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:26:22] T160315: Remove mwLoadStart and track mwLoadEnd relative to fetchStart - https://phabricator.wikimedia.org/T160315 [00:35:46] !log Delete navtiming 'mediaWikiLoadComplete' metrics (T160315) [00:35:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:35:50] T160315: Remove mwLoadStart and track mwLoadEnd relative to fetchStart - https://phabricator.wikimedia.org/T160315 [00:39:39] (03PS1) 10Dzahn: mwmaint1001: ensure tendril crons are disabled [puppet] - 10https://gerrit.wikimedia.org/r/431054 (https://phabricator.wikimedia.org/T192092) [00:48:02] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/11124/" [puppet] - 10https://gerrit.wikimedia.org/r/431054 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [00:48:31] (03CR) 10Dzahn: [C: 032] "same for tendril crons: https://gerrit.wikimedia.org/r/#/c/431054/" [puppet] - 10https://gerrit.wikimedia.org/r/431047 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [00:52:02] 30 [2018-05-05 00:50:23.047] nc.c:189 run, rabbit run / dig that hole, forget the sun / and when at last the work is done / don't sit down / it's time to dig anothe r one [00:52:06] 31 [2018-05-05 00:50:23.057] nc_proxy.c:148 bind on p 43 to addr '/var/run/nutcracker/nutcracker.sock 0666' failed: No such file or directory [00:52:19] lol, nice message in logs about the rabbit hole [00:52:41] and followed by nutcracker fail ;) (stretch testing) [00:55:33] 10Operations, 10Patch-For-Review: setup replacement for terbium (maintenance_server) on stretch - https://phabricator.wikimedia.org/T192092#4183661 (10Dzahn) **why nutcracker fails:** 30 [2018-05-05 00:50:23.047] nc.c:189 run, rabbit run / dig that hole, forget the sun / and when at last the work is done / do... [01:00:17] (03CR) 10Faidon Liambotis: [C: 032] vim: don't use Stretch's default, infuriating mouse mode [puppet] - 10https://gerrit.wikimedia.org/r/430937 (owner: 10Andrew Bogott) [01:10:45] 10Operations: provide proxysql for stretch, add package to puppet - https://phabricator.wikimedia.org/T193919#4183696 (10Dzahn) p:05Triage>03High [01:12:45] 10Operations: provide proxysql for stretch, add package to puppet - https://phabricator.wikimedia.org/T193919#4183711 (10Dzahn) a:05Dzahn>03None [01:29:04] (03PS1) 10Dzahn: nutcracker: puppetize missing /var/run/nutcracker dir [puppet] - 10https://gerrit.wikimedia.org/r/431057 (https://phabricator.wikimedia.org/T192092) [01:30:00] (03PS2) 10Dzahn: nutcracker: puppetize missing /var/run/nutcracker dir [puppet] - 10https://gerrit.wikimedia.org/r/431057 (https://phabricator.wikimedia.org/T192092) [01:33:53] (03PS2) 10Dzahn: add mwmaint1001 to scap hosts [puppet] - 10https://gerrit.wikimedia.org/r/430521 (https://phabricator.wikimedia.org/T192092) [01:34:50] (03CR) 10Dzahn: [C: 032] add mwmaint1001 to scap hosts [puppet] - 10https://gerrit.wikimedia.org/r/430521 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [01:37:31] !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=mw2187.codfw.wmnet [01:37:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:03:06] !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=mw2193.codfw.wmnet [02:03:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:11:53] 10Operations, 10HHVM, 10Patch-For-Review, 10User-Elukey: Upgrade mw* servers to Debian Stretch (using HHVM) - https://phabricator.wikimedia.org/T174431#4183746 (10Dzahn) [02:12:26] 10Operations, 10HHVM, 10Patch-For-Review, 10User-Elukey: Upgrade mw* servers to Debian Stretch (using HHVM) - https://phabricator.wikimedia.org/T174431#3561778 (10Dzahn) All (regular) codfw appservers are now on stretch. [02:16:31] !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=mw2191.codfw.wmnet [02:16:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:18:04] RECOVERY - mediawiki-installation DSH group on mw2191 is OK: OK [02:42:30] RECOVERY - MegaRAID on labstore1003 is OK: OK: optimal, 5 logical, 34 physical [03:25:52] (03PS1) 10Ayounsi: Smokeping, remove Rigel [puppet] - 10https://gerrit.wikimedia.org/r/431066 [03:34:03] (03PS2) 10Ayounsi: Smokeping, remove Rigel [puppet] - 10https://gerrit.wikimedia.org/r/431066 [03:35:20] (03CR) 10Ayounsi: [C: 032] Smokeping, remove Rigel [puppet] - 10https://gerrit.wikimedia.org/r/431066 (owner: 10Ayounsi) [03:36:12] !log commenting rigel out from smokeping [03:36:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:01:54] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "The nutcracker dir gets created on boot via the systemd's tmpfiles. Our current systemd::tmpfiles puppet resource doesn't support creatin" [puppet] - 10https://gerrit.wikimedia.org/r/431057 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [08:21:21] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4183902 (10Vgutierrez) Right, from ethtool: ``` root@lvs1016:~# ethtool -i enp5s0f0 | grep firmware firmware-version: bc 7.14.10 root@lvs1016:~# ethtool -i enp4s0f0 |grep fir... [08:46:15] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4183923 (10Vgutierrez) Checking kernel logs we've something weird going on with bnx2x fws across LVS servers: ```root@lvs1016:~# dmesg |grep direct-loading [ 15.744692] bnx... [08:58:50] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0 [09:25:52] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4183939 (10Vgutierrez) oh, and it looks like it isn't consistent across reboots: ``` root@lvs1016:/var/log# grep direct-loading kern.log May 4 14:34:11 lvs1016 kernel: [... [10:32:11] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 [10:41:04] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4183998 (10BBlack) I don't think the firmware versions you see there in ethtool and/or dmesg are the whole story anyways. The ones visible from bios-level setup have like 6... [10:53:23] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4184003 (10Vgutierrez) I've seen the VPD access failed error in other boxes as well, currently: ``` ===== NODE GROUP ===== (1) lvs2005.codfw.wmnet ----- OUTPUT of 'dmesg | gr... [10:59:44] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4184004 (10BBlack) Yeah, lvs200x are all HPs as well, so they're "different" in many respects for better or worse, and getting replaced this quarter with something more-like... [11:05:40] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4184005 (10Vgutierrez) right, so let's @Cmjohnson update firmware on lvs1016 NICs and I'll check MSI-X status after that :) [14:19:31] PROBLEM - HHVM rendering on mw2216 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:20:30] RECOVERY - HHVM rendering on mw2216 is OK: HTTP OK: HTTP/1.1 200 OK - 73355 bytes in 0.395 second response time [17:08:02] PROBLEM - ensure kvm processes are running on labvirt1015 is CRITICAL: PROCS CRITICAL: 96 processes with regex args /usr/bin/kvm [17:09:12] RECOVERY - ensure kvm processes are running on labvirt1015 is OK: PROCS OK: 95 processes with regex args /usr/bin/kvm [18:16:55] (03PS1) 10Andrew Bogott: nova: depool labvirt1015 [puppet] - 10https://gerrit.wikimedia.org/r/431106 [18:18:30] (03CR) 10Andrew Bogott: [C: 032] nova: depool labvirt1015 [puppet] - 10https://gerrit.wikimedia.org/r/431106 (owner: 10Andrew Bogott) [19:23:19] (03PS1) 10Zhuyifei1999: maintain-kubeusers.systemd: get the project name from @labsproject [puppet] - 10https://gerrit.wikimedia.org/r/431110 (https://phabricator.wikimedia.org/T190893) [19:58:30] (03CR) 10Zhuyifei1999: maintain-kubeusers.systemd: get the project name from @labsproject (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/431110 (https://phabricator.wikimedia.org/T190893) (owner: 10Zhuyifei1999) [20:25:14] (03PS2) 10Zhuyifei1999: maintain-kubeusers.systemd: get the project name from @labsproject [puppet] - 10https://gerrit.wikimedia.org/r/431110 (https://phabricator.wikimedia.org/T190893) [20:38:48] (03CR) 10Zhuyifei1999: "Can I go ahead and merge this to proceed with the task?" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/430647 (https://phabricator.wikimedia.org/T190893) (owner: 10Zhuyifei1999) [20:44:20] 10Operations, 10Commons, 10Wikimedia-SVG-rendering, 10media-storage, 10Patch-For-Review: Install Noto fonts on scaling servers for SVG rendering - https://phabricator.wikimedia.org/T184664#3891777 (10Verdy_p) So you just installed only a subset of Noto fonts for just a few *scripts* but not even all the... [21:03:52] 10Operations, 10Commons, 10Wikimedia-SVG-rendering, 10media-storage, 10Patch-For-Review: Install Noto fonts on scaling servers for SVG rendering - https://phabricator.wikimedia.org/T184664#4184457 (10kaldari) @Verdy_p: See T184664#3892719. (Only `fonts-noto` is packaged for jessie, so we have to wait for... [23:07:54] 10Operations, 10MediaWiki-Platform-Team, 10HHVM, 10TechCom-RFC (TechCom-Approved), 10User-ArielGlenn: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370#3625029 (10Verdy_p) Note that PHP is still only marginally faster than HHVM, only starting since PHP 7.2. HHVM is still fast...