[00:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170119T0000). [00:00:27] I might have a patch to deploy in a sec [00:00:52] 06Operations, 10Electron-PDFs, 06TCB-Team, 13Patch-For-Review, and 2 others: Deploy ElectronPdfService Extension to dewiki - https://phabricator.wikimedia.org/T150942#2951090 (10Addshore) a:03Addshore [00:00:55] 06Operations, 10Electron-PDFs, 06TCB-Team, 13Patch-For-Review, and 2 others: Deploy ElectronPdfService Extension to metawiki - https://phabricator.wikimedia.org/T150943#2951091 (10Addshore) a:03Addshore [00:16:13] (03PS1) 10Addshore: WIP Enable InterwikiSorting on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332917 [00:16:49] (03CR) 10Dzahn: [C: 031] Add missing comment for some Ganeti instances [dns] - 10https://gerrit.wikimedia.org/r/332710 (owner: 10Volans) [00:17:03] (03CR) 10Addshore: [C: 04-1] "#notyet I'll deploy this one" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332908 (owner: 10Addshore) [00:17:05] (03CR) 10Addshore: [C: 04-1] "#notyet I'll deploy this one" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332909 (owner: 10Addshore) [00:17:08] (03CR) 10Addshore: [C: 04-1] "#notyet I'll deploy this one" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332910 (owner: 10Addshore) [00:17:11] (03CR) 10Addshore: "#notyet I'll deploy this one" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332911 (owner: 10Addshore) [00:17:34] (03CR) 10Addshore: [C: 04-1] "#notyet I'll deploy this one" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332911 (owner: 10Addshore) [00:24:22] !log maxsem@tin Synchronized php-1.29.0-wmf.8/extensions/Graph: SWAT https://gerrit.wikimedia.org/r/#/c/332916/1 (duration: 00m 40s) [00:24:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:33:53] PROBLEM - puppet last run on ms-be1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:42:42] jouncebot nwo [00:42:44] jouncebot now [00:46:17] 06Operations, 13Patch-For-Review: Remote IPMI doesn't work for ~17% of the fleet - https://phabricator.wikimedia.org/T150160#2951190 (10Volans) @Dzahn I've fixed all the ones that were fixable with this method. From a full run across the whole fleet I found some that are still failing IPMI and need more invest... [00:56:13] PROBLEM - puppet last run on ms-be1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:56:30] 06Operations, 10Traffic, 06Wikipedia-iOS-App-Backlog, 10iOS-app-feature-Links: Fix universal link support in iOS when the OS requests the site association file from m.wikipedia.org - https://phabricator.wikimedia.org/T155504#2945413 (10ema) We do have [[https://github.com/wikimedia/operations-puppet/blob/p... [01:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170119T0100). [01:01:23] PROBLEM - Host mw2098 is DOWN: PING CRITICAL - Packet loss = 100% [01:01:53] RECOVERY - puppet last run on ms-be1013 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [01:02:18] 06Operations, 10ops-codfw: mw2098 drac offline - system unreachable - https://phabricator.wikimedia.org/T155688#2951211 (10RobH) [01:03:17] ACKNOWLEDGEMENT - Host mw2098 is DOWN: PING CRITICAL - Packet loss = 100% rhalsell T155688 drac offline [01:14:03] !log temp disable puppet on install1001 for papaul debugging [01:14:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:16:41] !log switching mw2251 to trusty-installer for test [01:16:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:17:37] 06Operations, 10ops-codfw: ms-be2002.codfw.wmnet has drac issues - https://phabricator.wikimedia.org/T155689#2951231 (10RobH) [01:21:04] 06Operations, 10ops-codfw: troubleshoot drac on ms-be2010.codfw.wmnet - https://phabricator.wikimedia.org/T155690#2951246 (10RobH) [01:23:53] !log install1001 - re-enable puppet - install2001 - same thing, temp disable and live-hack mw2251 to use trusty installer [01:23:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:25:13] RECOVERY - puppet last run on ms-be1002 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [01:27:15] mutante: oh hai! [01:27:39] 06Operations, 10Traffic: Evaluate Apache Traffic Server - https://phabricator.wikimedia.org/T96853#2951262 (10ema) I've started adding some notes about ATS to wikitech: https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server. [01:27:57] urandom: hello, debugging something else but i can totally merge that one if you want to bootstrap now [01:28:00] 06Operations, 10ops-eqiad: es1019.eqiad.wmnet drac unresponsive - https://phabricator.wikimedia.org/T155691#2951263 (10RobH) [01:28:11] (03CR) 10Eevans: [C: 031] Enable remaining restbase-dev* instances [puppet] - 10https://gerrit.wikimedia.org/r/332876 (https://phabricator.wikimedia.org/T153880) (owner: 10Eevans) [01:28:25] mutante: ok, volans offered as well [01:28:51] mutante: i think volans is going to get it [01:29:20] mutante: I'm merging it, then if there is something to fix later maybe urandom will ping you ;) [01:29:37] (03CR) 10Volans: [C: 032] Enable remaining restbase-dev* instances [puppet] - 10https://gerrit.wikimedia.org/r/332876 (https://phabricator.wikimedia.org/T153880) (owner: 10Eevans) [01:29:46] volans: everything will work perfectly the first time [01:29:51] volans: it always does. [01:30:08] TM [01:30:27] cool! :) ok! [01:30:46] merged, urandom it's all yours :) [01:30:58] volans: awesome; thank you sir! [01:34:37] 06Operations, 10ops-eqiad: ocg1001.eqiad.wmnet ipmi error - https://phabricator.wikimedia.org/T155692#2951278 (10RobH) [01:34:53] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[zotero/translators],Package[zotero/translation-server],Exec[chown /srv/deployment/zotero for deploy-service] [01:35:02] 06Operations, 10ops-eqiad: ocg1001.eqiad.wmnet ipmi error - https://phabricator.wikimedia.org/T155692#2951278 (10RobH) Please note DRAC is responsive, but I didn't test its power capabilities since it would change the power state of the system. [01:38:23] (03PS1) 10Dzahn: install: make mw2251 use trusty installer (debug) [puppet] - 10https://gerrit.wikimedia.org/r/332930 [01:39:10] 06Operations, 13Patch-For-Review: Remote IPMI doesn't work for ~2% of the fleet - https://phabricator.wikimedia.org/T150160#2951295 (10Dzahn) [01:39:45] volans: thanks for IPMI "17% -> 2%" ^ :)) [01:39:49] volans: told ya; perfectly first time [01:39:56] * urandom ducks [01:40:17] (03PS2) 10Dzahn: install: make mw2251 use trusty installer (debug) [puppet] - 10https://gerrit.wikimedia.org/r/332930 [01:40:30] urandom: lol! [01:40:47] mutante: yeah got some progress, we have now just weird cases ;) [01:41:21] (03CR) 10Dzahn: [C: 032] install: make mw2251 use trusty installer (debug) [puppet] - 10https://gerrit.wikimedia.org/r/332930 (owner: 10Dzahn) [01:41:24] (03CR) 10Papaul: [C: 032] install: make mw2251 use trusty installer (debug) [puppet] - 10https://gerrit.wikimedia.org/r/332930 (owner: 10Dzahn) [01:41:29] sorry i cannot resist making this joke, if we use trusty why cant we trust the servers for more than a minute [01:42:35] we only use it for a minute to test if install issues are a (jessie) installer bug [01:43:00] mutante god do you always have to ruin my jokes xD [01:47:53] !log install1001/2001 - re-enabled, carbon is still DHCP for some rows [01:47:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:50:49] !log install - if in private1-c-eqiad, private1-b-codfw, you are using install1001 for both DHCP and TFTP, if in other networks you still use carbon as DHCP but then also install1001 as TFTP [01:50:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:51:09] (03CR) 10Jforrester: "This is a beta feature, right? You'll need to add it to the beta feature whitelist…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332908 (owner: 10Addshore) [01:51:46] mutante what purpose does mw2251 have? i cant seem to find it on wikitech [01:52:19] Zppix: new appserver that has not been installed yet, problems with install [01:52:22] (03CR) 10Addshore: [C: 04-1] "https://gerrit.wikimedia.org/r/#/c/332904/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332908 (owner: 10Addshore) [01:53:14] mutante gotta love servers... [01:55:13] (03CR) 10Jforrester: "Things to fix first:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332904 (https://phabricator.wikimedia.org/T150184) (owner: 10Addshore) [01:55:23] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[zotero/translators],Package[zotero/translation-server],Exec[chown /srv/deployment/zotero for deploy-service] [01:55:30] (03CR) 10Jforrester: "Ta, found it. :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332908 (owner: 10Addshore) [01:58:36] 06Operations, 10Traffic, 06Wikipedia-iOS-App-Backlog, 10iOS-app-feature-Links: Fix universal link support in iOS when the OS requests the site association file from m.wikipedia.org - https://phabricator.wikimedia.org/T155504#2951305 (10Fjalapeno) [01:59:24] be back later [01:59:27] 06Operations, 10Traffic, 06Wikipedia-iOS-App-Backlog, 10iOS-app-feature-Links: Fix universal link support in iOS when the OS requests the site association file from m.wikipedia.org - https://phabricator.wikimedia.org/T155504#2945413 (10Fjalapeno) @ema we only need to cover *.wikipedia.org - does that make... [02:00:18] 06Operations, 10ops-codfw: codfw: mw2251-mw2260 rack/setup - https://phabricator.wikimedia.org/T155180#2951307 (10Papaul) During installation at the partition disks step I can the following error: ┌───────────────────────┤ [!!] Partition disks ├────────────────────────┐ │... [02:02:47] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [02:15:30] 06Operations, 10ops-codfw, 13Patch-For-Review: rack/setup/deploy new codfw mw app servers - https://phabricator.wikimedia.org/T135466#2951317 (10RobH) [02:15:33] 06Operations, 10ops-codfw, 10hardware-requests: codfw old mw app server decomission - https://phabricator.wikimedia.org/T135468#2951316 (10RobH) 05Open>03Resolved [02:16:31] 06Operations, 10ops-codfw: rack/setup/install mw2251-mw2260 - https://phabricator.wikimedia.org/T152698#2951318 (10RobH) a:05RobH>03Papaul [02:17:06] 06Operations, 10ops-codfw: rack/setup/install mw2251-mw2260 - https://phabricator.wikimedia.org/T152698#2857513 (10RobH) [02:17:08] 06Operations, 10ops-codfw: codfw: mw2251-mw2260 rack/setup - https://phabricator.wikimedia.org/T155180#2951324 (10RobH) [02:18:02] 06Operations, 10ops-codfw: codfw: mw2251-mw2260 rack/setup - https://phabricator.wikimedia.org/T155180#2936454 (10RobH) [02:18:31] 06Operations, 10Monitoring: Switch Icinga from smsglobal - https://phabricator.wikimedia.org/T106589#2951341 (10RobH) [02:22:35] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [02:26:37] PROBLEM - puppet last run on db1059 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:28:16] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.7) (duration: 07m 28s) [02:28:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:40:35] just had a blip to phabRequest from 198.73.209.5 via cp4004 cp4004, Varnish XID 66871034 [02:40:36] Error: 503, Backend fetch failed at Thu, 19 Jan 2017 02:40:17 GMT [02:55:37] RECOVERY - puppet last run on db1059 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [03:11:35] (03CR) 10Alex Monk: Keystone hooks: Set up default security groups for new projects. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/332899 (https://phabricator.wikimedia.org/T136871) (owner: 10Andrew Bogott) [03:15:37] PROBLEM - puppet last run on lvs1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:18:16] (03CR) 10Alex Monk: "I've cherry-picked this on deployment-puppetmaster02 and applied on -fluorine02, seems to be working. Will check on it again tomorrow." [puppet] - 10https://gerrit.wikimedia.org/r/313604 (https://phabricator.wikimedia.org/T123728) (owner: 10Filippo Giunchedi) [03:27:36] (03Abandoned) 10MaxSem: WIP: puppetize grants, kill grants.sql [puppet] - 10https://gerrit.wikimedia.org/r/299033 (owner: 10MaxSem) [03:31:56] (03CR) 10Alex Monk: "ugh, 'role' is used everywhere in this code... but let's give it a go" [puppet] - 10https://gerrit.wikimedia.org/r/332781 (owner: 10Andrew Bogott) [03:32:19] (03CR) 10Alex Monk: [C: 031] Horizon puppettab: display profiles as well as roles [puppet] - 10https://gerrit.wikimedia.org/r/332781 (owner: 10Andrew Bogott) [03:42:32] 07Puppet, 10Beta-Cluster-Infrastructure, 07Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#2951484 (10Krenair) [03:42:36] 07Puppet, 10Beta-Cluster-Infrastructure: deployment-eventlogging03 has puppet failure due to missing class - https://phabricator.wikimedia.org/T152842#2951482 (10Krenair) 05Open>03Resolved a:03Ottomata [03:44:37] RECOVERY - puppet last run on lvs1002 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [03:58:17] PROBLEM - puppet last run on conf1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:11:00] 06Operations, 07Puppet, 10Beta-Cluster-Infrastructure: Make deployment-prep puppetmaster more similar to Production puppetmaster - https://phabricator.wikimedia.org/T146627#2666629 (10Krenair) #1 was done. [04:20:38] (03CR) 10Alex Monk: Consolidate database lists list in one place (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331825 (owner: 10Dereckson) [04:21:37] (03PS1) 10Alex Monk: Read closed-labs as closed tag on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332940 (https://phabricator.wikimedia.org/T115584) [04:21:53] (03CR) 10jerkins-bot: [V: 04-1] Read closed-labs as closed tag on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332940 (https://phabricator.wikimedia.org/T115584) (owner: 10Alex Monk) [04:27:17] RECOVERY - puppet last run on conf1001 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [04:29:42] (03PS3) 10Alex Monk: Consolidate database lists list in one place [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331825 (owner: 10Dereckson) [04:29:44] (03PS2) 10Alex Monk: Read closed-labs as closed tag on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332940 (https://phabricator.wikimedia.org/T115584) [04:30:13] (03CR) 10Alex Monk: [C: 04-1] "PS3 is a rebase, see my PS2 comment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331825 (owner: 10Dereckson) [04:30:50] (03CR) 10jerkins-bot: [V: 04-1] Consolidate database lists list in one place [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331825 (owner: 10Dereckson) [04:31:16] (03CR) 10jerkins-bot: [V: 04-1] Read closed-labs as closed tag on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332940 (https://phabricator.wikimedia.org/T115584) (owner: 10Alex Monk) [04:32:26] huh [04:42:57] PROBLEM - puppet last run on elastic1047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:11:57] RECOVERY - puppet last run on elastic1047 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [05:16:47] PROBLEM - puppet last run on bast3001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:30:07] (03CR) 10Andrew Bogott: Keystone hooks: Set up default security groups for new projects. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/332899 (https://phabricator.wikimedia.org/T136871) (owner: 10Andrew Bogott) [05:46:47] RECOVERY - puppet last run on bast3001 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:26:52] 06Operations, 10ops-codfw: codfw: mw2251-mw2260 rack/setup - https://phabricator.wikimedia.org/T155180#2936454 (10Dzahn) Yup, mw2251 switched to trusty, first manual then in puppet (https://gerrit.wikimedia.org/r/#/c/332930/), issue only happens in jessie, not trusty. [06:32:27] PROBLEM - Check HHVM threads for leakage on mw1260 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [06:34:03] (03CR) 10Legoktm: Rewrite wmf-beta-autoupdate as a scap3 plugin (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325875 (https://phabricator.wikimedia.org/T151519) (owner: 10Chad) [06:36:27] PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [06:52:37] PROBLEM - Check HHVM threads for leakage on mw1259 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [06:55:27] PROBLEM - Check HHVM threads for leakage on mw1168 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [07:01:17] (03PS1) 10Dzahn: openstack: fix lint warnings in service.pp [puppet] - 10https://gerrit.wikimedia.org/r/332953 [07:01:19] (03PS1) 10Dzahn: openstack: instancersync not in autoload module layout [puppet] - 10https://gerrit.wikimedia.org/r/332954 [07:01:21] (03PS1) 10Dzahn: openstack: designate/glance/keystone not in autoload module [puppet] - 10https://gerrit.wikimedia.org/r/332955 [07:01:23] (03PS1) 10Dzahn: kartotherian: optional parameter listed before required [puppet] - 10https://gerrit.wikimedia.org/r/332956 [07:01:25] (03PS1) 10Dzahn: proxysql: optional parameter before required parameter [puppet] - 10https://gerrit.wikimedia.org/r/332957 [07:01:27] (03PS1) 10Dzahn: labspuppetbackend: optional parameter before required [puppet] - 10https://gerrit.wikimedia.org/r/332958 [07:01:29] (03PS1) 10Dzahn: interface: rps::modparams, aggregate_member not in autoload layout [puppet] - 10https://gerrit.wikimedia.org/r/332959 [07:14:57] !log Compressing enwikivoyage.text and shwiki.logging tables on db1038 - T154465 [07:15:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:01] T154465: Defragment db1038 - https://phabricator.wikimedia.org/T154465 [07:18:16] !log Compressing enwikivoyage.text and shwiki.logging tables on db1044 - T153826 [07:18:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:20] T153826: Defragment db1044 - https://phabricator.wikimedia.org/T153826 [07:20:56] (03PS1) 10Dereckson: Fix throttle rule for KCES IMR edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332960 [07:21:50] Hi. There is a need for an emergency throttle rule fix for an ongoing event ^ [07:22:38] (03CR) 10Dereckson: [C: 032] "Emergency fix" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332960 (owner: 10Dereckson) [07:23:58] (03Merged) 10jenkins-bot: Fix throttle rule for KCES IMR edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332960 (owner: 10Dereckson) [07:24:12] (03CR) 10jenkins-bot: Fix throttle rule for KCES IMR edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332960 (owner: 10Dereckson) [07:24:59] Works on mwdebug1002, I'm going to push it live [07:26:31] sync-apaches: 99% (ok: 296; fail: 0; left: 1) [07:28:03] !log dereckson@tin Synchronized wmf-config/throttle.php: Fix throttle rule for KCES IMR edit-a-thon (duration: 02m 42s) [07:28:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:15] ssh: connect to host mw2098.codfw.wmnet port 22: Connection timed out [07:28:44] (the server not synced by scap sync-file) [07:33:22] Actually, according Icinga, mw2098 has been down for 6 hours 31 minutes [07:33:44] ah yes 01:03:17 < icinga-wm> ACKNOWLEDGEMENT - Host mw2098 is DOWN: PING CRITICAL - Packet loss = 100% rhalsell T155688 drac offline [07:33:44] T155688: mw2098 drac offline - system unreachable - https://phabricator.wikimedia.org/T155688 [07:37:46] (03CR) 10Dereckson: "Fix for T154312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332960 (owner: 10Dereckson) [07:42:22] !log volans@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=mw2098.wmnet [07:42:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:45] !log volans@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=mw2098.codfw.wmnet [07:45:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:53] Dereckson: this should fix it ^^^ [07:46:49] I've also run puppet on tin [07:49:50] ok [08:06:47] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [08:16:27] 06Operations, 10ops-codfw: codfw: mw2251-mw2260 rack/setup - https://phabricator.wikimedia.org/T155180#2951853 (10MoritzMuehlenhoff) That happens after jessie point releases (and there was one last weekend). Until a while ago the Squid cache needed to be purged manually, but I commited some config tweaks a few... [08:18:15] 06Operations, 06Operations-Software-Development: confctl: log to SAL even if the selection doesn't match any host - https://phabricator.wikimedia.org/T155705#2951854 (10Volans) [08:25:34] (03PS2) 10Marostegui: m1,m2,m3,m4.hosts: Add new host files [software] - 10https://gerrit.wikimedia.org/r/332747 [08:34:50] (03PS1) 10Marostegui: eventlogging: Enable gtid_domainid on eventlogging [puppet] - 10https://gerrit.wikimedia.org/r/332965 (https://phabricator.wikimedia.org/T149418) [08:41:16] (03CR) 10Marostegui: "Compiles fine and only changes db1046, which is the only host for eventlogging : https://puppet-compiler.wmflabs.org/5157/" [puppet] - 10https://gerrit.wikimedia.org/r/332965 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [08:49:11] (03PS1) 10Dereckson: Fix Portal talk namespace name on Sanskrit Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332967 [08:49:27] RECOVERY - Check HHVM threads for leakage on mw1260 is OK: OK [08:51:10] (03PS2) 10Dereckson: Fix Portal talk namespace name on Sanskrit Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332967 (https://phabricator.wikimedia.org/T101634) [08:53:19] (03PS8) 10Filippo Giunchedi: udp2log: move to service_unit and systemd [puppet] - 10https://gerrit.wikimedia.org/r/313604 (https://phabricator.wikimedia.org/T123728) [08:55:30] (03CR) 10Filippo Giunchedi: [C: 032] udp2log: move to service_unit and systemd [puppet] - 10https://gerrit.wikimedia.org/r/313604 (https://phabricator.wikimedia.org/T123728) (owner: 10Filippo Giunchedi) [08:57:33] !log bounce udp2log on fluorine after https://gerrit.wikimedia.org/r/313604 [08:57:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:49] (03CR) 10Filippo Giunchedi: [C: 031] Prometheus JMX exporter deploy repository [software/prometheus_jmx_exporter] - 10https://gerrit.wikimedia.org/r/332542 (https://phabricator.wikimedia.org/T155120) (owner: 10Eevans) [09:04:35] (03PS6) 10Elukey: kafka: fix Unrecognized escape sequence '\.' [puppet] - 10https://gerrit.wikimedia.org/r/331451 (owner: 10Hashar) [09:05:39] (03CR) 10TTO: Fix Portal talk namespace name on Sanskrit Wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332967 (https://phabricator.wikimedia.org/T101634) (owner: 10Dereckson) [09:09:30] (03CR) 10Filippo Giunchedi: "LGTM modulo the scap::dsh::groups comment" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/332535 (https://phabricator.wikimedia.org/T155120) (owner: 10Eevans) [09:10:31] (03PS6) 10Giuseppe Lavagetto: Generalize entities definitions [software/conftool] - 10https://gerrit.wikimedia.org/r/288609 [09:10:37] RECOVERY - Check HHVM threads for leakage on mw1259 is OK: OK [09:11:23] elukey: good morning :) [09:11:41] elukey: looks like Andrew Otto likes the fixes ;} [09:11:53] again, I am not sure what kind of magic puppet might end up doing when applying the change [09:12:07] maybe that will cause a refresh/restart of some kafka service :( [09:12:57] I am running pcc, I don't think that it will cause problems.. I'll also disable puppet and merge one at the time :) [09:13:10] https://puppet-compiler.wmflabs.org/5159/ looks good [09:13:48] \O/ [09:20:12] 06Operations, 10Parsoid, 15User-Joe, 15User-mobrovac: Parsoid timing out or failing when trying to parse specific user page - https://phabricator.wikimedia.org/T155618#2951910 (10elukey) >>! In T155618#2949470, @mobrovac wrote: > > I will blacklist this specific title in RESTBase for now. Marco is it som... [09:20:20] 06Operations, 10ops-codfw: ms-be2002.codfw.wmnet has drac issues - https://phabricator.wikimedia.org/T155689#2951922 (10fgiunchedi) Host can be taken down at any time with a clean `shutdown` to make sure all services are stopped [09:20:30] 06Operations, 10ops-codfw: troubleshoot drac on ms-be2010.codfw.wmnet - https://phabricator.wikimedia.org/T155690#2951246 (10fgiunchedi) Host can be taken down at any time with a clean `shutdown` to make sure all services are stopped [09:23:50] hashar: I am still puzzled why this thing is needed [09:24:16] got some draft notes to send to ops list to give all the context [09:24:33] but in short the idea is "puppet parser validate" emits a a warning on such issue [09:24:39] which causes it to exit 1 [09:24:42] 06Operations, 10Parsoid, 15User-Joe, 15User-mobrovac: Parsoid timing out or failing when trying to parse specific user page - https://phabricator.wikimedia.org/T155618#2951927 (10Joe) @elukey apparently this needs a code deploy (!!) which means accepting a pull request on github(!!!) where not everyone fro... [09:25:08] <_joe_> hashar: which change are you talking about? [09:25:26] _joe_ https://gerrit.wikimedia.org/r/331451 [09:25:53] _joe_: kafka: fix Unrecognized escape sequence '\.' [puppet] - https://gerrit.wikimedia.org/r/331451 [09:26:54] <_joe_> hashar: I'm not sure what that change would fix [09:27:15] <_joe_> seems like \\. would've done the trick [09:28:46] hashar: can we try with the \\. ? [09:30:31] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting extended unicode characters - https://phabricator.wikimedia.org/T145885#2951958 (10Marostegui) Hi, I have been doing some tests with the gerritdb to sum up all the stuff that have been discussed here... [09:31:48] 06Operations, 10Wikimedia-General-or-Unknown: Increase $wgHTTPImportTimeout to a higher value on WMF wikis - https://phabricator.wikimedia.org/T155209#2951959 (10Nemo_bis) I tried to import 45k+ revisions of [[w:en:George W. Bush]] on https://test.wikipedia.org/w/index.php?title=Special:Import (which is a clea... [09:32:57] !log upgrading firejail on image scalers [09:33:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:27] RECOVERY - Check HHVM threads for leakage on mw1169 is OK: OK [09:34:50] _joe_: elukey: yeah you can try \\. it looks a bit strange to me though [09:35:02] though my intermediate variable is even more awkward ;} [09:36:50] 06Operations, 10Wikimedia-General-or-Unknown: Increase $wgHTTPImportTimeout to a higher value on WMF wikis - https://phabricator.wikimedia.org/T155209#2951969 (10Joe) @Nemo_bis a blank page usually means something different than a timeout has happened. Probably a memory limit was hit; if we want to be able to... [09:45:21] (03CR) 10Filippo Giunchedi: [C: 04-1] "See inline comments" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/331623 (owner: 10Alexandros Kosiaris) [09:47:19] (03PS9) 10Marostegui: Reporting tests with the private data script [puppet] - 10https://gerrit.wikimedia.org/r/328352 (https://phabricator.wikimedia.org/T153680) [09:53:27] RECOVERY - Check HHVM threads for leakage on mw1168 is OK: OK [10:06:38] (03PS1) 10Muehlenhoff: Switch app servers in codfw to systemd-timesyncd [puppet] - 10https://gerrit.wikimedia.org/r/332976 (https://phabricator.wikimedia.org/T150257) [10:14:34] !log Compressing templatelinks tables on db1015 - T153739 [10:14:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:37] T153739: Defragment db1015 - https://phabricator.wikimedia.org/T153739 [10:19:31] 06Operations, 10ops-codfw: decommission mw2075-2089 to make room for new systems - https://phabricator.wikimedia.org/T154621#2952060 (10MoritzMuehlenhoff) mw2075 was still shown in servermon, I ran "puppet node clean mw2075.codfw.wmnet" and "puppet node deactivate mw2075.codfw.wmnet" to fix it. [10:25:27] (03PS1) 10Elukey: Add aqs100[789] to the AQS partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/332980 (https://phabricator.wikimedia.org/T155654) [10:27:04] (03CR) 10Elukey: [C: 032] Add aqs100[789] to the AQS partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/332980 (https://phabricator.wikimedia.org/T155654) (owner: 10Elukey) [10:28:38] (03PS1) 10Hashar: (WIP) test: add xmlrpc gem for ruby 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/332981 [10:28:40] 06Operations, 10ops-eqiad: rack and set up aqs100[7-9] - https://phabricator.wikimedia.org/T155654#2952070 (10elukey) p:05Triage>03Normal [10:29:10] (03PS3) 10Muehlenhoff: hhvm: Restrict to domain networks [puppet] - 10https://gerrit.wikimedia.org/r/316550 [10:30:13] (03CR) 10jerkins-bot: [V: 04-1] (WIP) test: add xmlrpc gem for ruby 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/332981 (owner: 10Hashar) [10:30:37] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:31:27] (03CR) 10Muehlenhoff: [C: 032] hhvm: Restrict to domain networks [puppet] - 10https://gerrit.wikimedia.org/r/316550 (owner: 10Muehlenhoff) [10:33:52] (03PS2) 10Hashar: test: remove Gemfile.lock + add xmlrpc for ruby 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/332981 [10:36:37] 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2011 - https://phabricator.wikimedia.org/T153740#2952106 (10Marostegui) Hey @Papaul can disk be replaced someday this week or early next week? Thank you! [10:42:47] PROBLEM - puppet last run on wtp1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:47:28] jouncebot: next [10:47:29] In 3 hour(s) and 12 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170119T1400) [10:54:18] 06Operations, 10Wikimedia-General-or-Unknown: Increase $wgHTTPImportTimeout to a higher value on WMF wikis - https://phabricator.wikimedia.org/T155209#2952246 (10Nemo_bis) >>! In T155209#2951969, @Joe wrote: > @Nemo_bis a blank page usually means something different than a timeout has happened. True. On the o... [10:59:57] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [11:03:09] 06Operations, 10Wikimedia-General-or-Unknown: Increase $wgHTTPImportTimeout to a higher value on WMF wikis - https://phabricator.wikimedia.org/T155209#2952256 (10TTO) >>! In T155209#2952246, @Nemo_bis wrote: >> Probably a memory limit was hit; if we want to be able to import tens of thousands of revisions we m... [11:03:51] (03Abandoned) 10Muehlenhoff: Add es-tool also on jessie [puppet] - 10https://gerrit.wikimedia.org/r/314695 (owner: 10Muehlenhoff) [11:04:49] (03PS3) 10Muehlenhoff: Provide a systemd override unit for memcached [puppet] - 10https://gerrit.wikimedia.org/r/319820 [11:09:07] jouncebot: Nemo_bis [11:09:10] jouncebot: next [11:09:10] In 2 hour(s) and 50 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170119T1400) [11:11:47] RECOVERY - puppet last run on wtp1015 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [11:16:27] PROBLEM - puppet last run on mw1294 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:19:37] (03CR) 10Juniorsys: [C: 031] openstack: fix lint warnings in service.pp [puppet] - 10https://gerrit.wikimedia.org/r/332953 (owner: 10Dzahn) [11:21:19] (03CR) 10Juniorsys: [C: 031] openstack: instancersync not in autoload module layout [puppet] - 10https://gerrit.wikimedia.org/r/332954 (owner: 10Dzahn) [11:21:54] (03PS7) 10Juniorsys: geowiki module: Lint changes + modes/umask quoting [puppet] - 10https://gerrit.wikimedia.org/r/332101 (https://phabricator.wikimedia.org/T93645) [11:22:20] (03PS6) 10Juniorsys: mediawiki module: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/332103 (https://phabricator.wikimedia.org/T93645) [11:22:28] (03PS6) 10Juniorsys: postgresql module: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/332104 (https://phabricator.wikimedia.org/T93645) [11:22:36] (03PS6) 10Juniorsys: puppetmaster module: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/332105 (https://phabricator.wikimedia.org/T93645) [11:29:10] (03CR) 10Juniorsys: [C: 04-1] role analytics_cluster: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/332106 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [11:29:35] (03PS6) 10Juniorsys: toollabs role modules: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/332110 (https://phabricator.wikimedia.org/T93645) [11:29:48] (03PS6) 10Juniorsys: toollabs module: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/332111 [11:41:53] !log Compressing dewiki db1045 - T155399 [11:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:58] T155399: db1045 disk space - compression needed - https://phabricator.wikimedia.org/T155399 [11:46:27] RECOVERY - puppet last run on mw1294 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [12:06:38] (03PS1) 10Elukey: WIP - Add base Redis instance if no MW shard is configured. [puppet] - 10https://gerrit.wikimedia.org/r/332983 (https://phabricator.wikimedia.org/T137345) [12:06:43] (03CR) 10jerkins-bot: [V: 04-1] WIP - Add base Redis instance if no MW shard is configured. [puppet] - 10https://gerrit.wikimedia.org/r/332983 (https://phabricator.wikimedia.org/T137345) (owner: 10Elukey) [12:09:22] (03PS2) 10Elukey: WIP - Add base Redis instance if no MW shard is configured. [puppet] - 10https://gerrit.wikimedia.org/r/332983 (https://phabricator.wikimedia.org/T137345) [12:10:14] (03CR) 10jerkins-bot: [V: 04-1] WIP - Add base Redis instance if no MW shard is configured. [puppet] - 10https://gerrit.wikimedia.org/r/332983 (https://phabricator.wikimedia.org/T137345) (owner: 10Elukey) [12:11:47] Jenkins doesn't like me but mc1019 could get a shiny Redis instance with --^ https://puppet-compiler.wmflabs.org/5162/mc1019.eqiad.wmnet/ [12:12:14] still figuring out if this is the right approach or not [12:14:31] (03PS3) 10Elukey: WIP - Add base Redis instance if no MW shard is configured. [puppet] - 10https://gerrit.wikimedia.org/r/332983 (https://phabricator.wikimedia.org/T137345) [12:18:13] (03CR) 10Alex Monk: [C: 04-1] "Giuseppe thinks we should first two to merge a couple of these before doing it on this scale" [puppet] - 10https://gerrit.wikimedia.org/r/322425 (owner: 10Alex Monk) [12:18:32] (03CR) 10Alex Monk: [C: 04-1] "s/two to //" [puppet] - 10https://gerrit.wikimedia.org/r/322425 (owner: 10Alex Monk) [12:23:05] (03CR) 10Alex Monk: "does compute.pp need to be updated?" [puppet] - 10https://gerrit.wikimedia.org/r/332954 (owner: 10Dzahn) [12:23:23] (03CR) 10Alex Monk: [C: 031] openstack: fix lint warnings in service.pp [puppet] - 10https://gerrit.wikimedia.org/r/332953 (owner: 10Dzahn) [12:25:56] 06Operations, 10Analytics: Move cloudera packages to a separate archive section - https://phabricator.wikimedia.org/T155726#2952404 (10MoritzMuehlenhoff) [12:39:19] (03CR) 10Alex Monk: [C: 031] openstack: designate/glance/keystone not in autoload module [puppet] - 10https://gerrit.wikimedia.org/r/332955 (owner: 10Dzahn) [12:48:34] (03CR) 10Gehel: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/331856 (https://phabricator.wikimedia.org/T78342) (owner: 10Hashar) [12:49:51] (03PS1) 10Muehlenhoff: Add ladsgroup to analytics-privatedata-users for Hadoop access [puppet] - 10https://gerrit.wikimedia.org/r/332986 (https://phabricator.wikimedia.org/T155303) [12:55:47] (03PS2) 10Addshore: Enable TwoColConflict on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332908 (https://phabricator.wikimedia.org/T155716) [12:56:39] (03PS3) 10Addshore: Enable TwoColConflict on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332908 (https://phabricator.wikimedia.org/T155716) [12:57:12] (03PS2) 10Addshore: Enable TwoColConflict on mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332909 (https://phabricator.wikimedia.org/T155717) [13:02:15] (03CR) 10Muehlenhoff: [C: 032] Add ladsgroup to analytics-privatedata-users for Hadoop access [puppet] - 10https://gerrit.wikimedia.org/r/332986 (https://phabricator.wikimedia.org/T155303) (owner: 10Muehlenhoff) [13:05:47] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 10 failures. Last run 2 minutes ago with 10 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[zotero/translators],Package[zotero/translation-server],Exec[chown /srv/deployment/zotero for deploy-service] [13:06:48] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Request to access hadoop (stat1004) for Ladsgroup - https://phabricator.wikimedia.org/T155303#2952535 (10MoritzMuehlenhoff) 05Open>03Resolved Hi Amir, I've enabled your access. You should now be able to log into stat1004.eqiad.wmnet. If you run int... [13:07:30] 06Operations, 10DNS, 10Traffic, 07Beta-Cluster-reproducible, 07Upstream: Ferm/DNS library weirdness on deployment-mediawiki boxes - https://phabricator.wikimedia.org/T153468#2952537 (10MoritzMuehlenhoff) p:05Triage>03Normal [13:11:04] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Request to access hadoop (stat1004) for Ladsgroup - https://phabricator.wikimedia.org/T155303#2952538 (10Ladsgroup) Thanks! Should I make another phab card for hadoop ldap group (to access hue.wikimedia.org)? [13:11:46] (03PS4) 10Addshore: Enable TwoColConflict on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332908 (https://phabricator.wikimedia.org/T155716) [13:11:48] (03PS3) 10Addshore: Enable TwoColConflict on mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332909 (https://phabricator.wikimedia.org/T155717) [13:11:50] (03PS2) 10Addshore: Enable TwoColConflict on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332910 (https://phabricator.wikimedia.org/T155721) [13:11:52] (03PS2) 10Addshore: Enable TwoColConflict on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332911 (https://phabricator.wikimedia.org/T155721) [13:13:10] 06Operations, 06TCB-Team, 10Two-Column-Edit-Conflict-Merge, 13Patch-For-Review, and 2 others: Deploy TwoColConflict extension to production - https://phabricator.wikimedia.org/T150184#2952543 (10Addshore) [13:14:03] 06Operations: Reimage achernar and amacar to jessie - https://phabricator.wikimedia.org/T155411#2952553 (10MoritzMuehlenhoff) p:05Triage>03Normal [13:16:17] 06Operations, 06TCB-Team, 10Two-Column-Edit-Conflict-Merge, 13Patch-For-Review, and 2 others: Deploy TwoColConflict extension to production - https://phabricator.wikimedia.org/T150184#2952592 (10Addshore) a:03Addshore [13:16:56] 06Operations, 10MediaWiki-extensions-InterwikiSorting, 10Wikidata, 10Wikimedia-Extension-setup, and 2 others: Deploy InterwikiSorting extension to production - https://phabricator.wikimedia.org/T150183#2952603 (10Addshore) a:03Addshore [13:20:59] !log Compressing revision,pagelinks and templatelinks tables on db1035 - T110504 [13:21:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:03] T110504: defragment db1015, db1035 and db1027 - https://phabricator.wikimedia.org/T110504 [13:24:27] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[zotero/translators],Package[zotero/translation-server],Exec[chown /srv/deployment/zotero for deploy-service] [13:32:57] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [13:43:42] (03PS4) 10Alex Monk: Consolidate database lists list in one place [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331825 (owner: 10Dereckson) [13:44:55] (03CR) 10jerkins-bot: [V: 04-1] Consolidate database lists list in one place [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331825 (owner: 10Dereckson) [13:52:37] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [13:56:44] (03CR) 10Zfilipin: [C: 04-1] "Could not rebase in Gerrit: "The change could not be rebased due to a conflict during merge."" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331841 (https://phabricator.wikimedia.org/T155162) (owner: 10Odder) [14:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170119T1400). Please do the needful. [14:00:04] Odder and Dereckson: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [14:00:16] Is it though. [14:00:48] o/ [14:00:50] zeljkof: just rebase them manually ? :] [14:01:49] hashar: well, the author (odder) also gave it a -1 until community consensus is reached [14:02:15] so done :] [14:02:34] (03PS2) 10Odder: Add noratelimit user right to translation admins on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331841 (https://phabricator.wikimedia.org/T155162) [14:02:35] odder: there is consensus I think for that now, isn't there? [14:03:16] In that there are no oppose votes, yes. [14:03:16] odder: so, what should I do with 331841? deploy or skip for now? [14:03:29] Frankly, no one appears to care about this at all. [14:03:32] (03CR) 10jerkins-bot: [V: 04-1] Add noratelimit user right to translation admins on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331841 (https://phabricator.wikimedia.org/T155162) (owner: 10Odder) [14:03:45] zeljkof: Happy to have it deployed. [14:03:52] Yes, that's well what I thought: Commons notified, no objection [14:04:14] Failed syntax though, just give me a moment. [14:04:15] odder, Dereckson: ok, in that case, deploying [14:04:36] 06Operations, 10Analytics: Move cloudera packages to a separate archive section - https://phabricator.wikimedia.org/T155726#2952404 (10Ottomata) +1 I like this idea. [14:04:42] 06Operations, 10Analytics, 10Analytics-Cluster: Move cloudera packages to a separate archive section - https://phabricator.wikimedia.org/T155726#2952704 (10Ottomata) [14:04:44] hashar: should I do SWAT today? [14:04:58] I can do it if you're both busy. [14:05:22] I am skipping [14:05:28] Dereckson: go ahead, since one of the commits is yours :) [14:05:31] k [14:05:38] but i am lurking in case you need me ;] [14:06:00] I'm around too, but I doubt anybody would need my help ;) [14:07:24] odder: 14:02:42 PHP Parse error: syntax error, unexpected ''translationadmin'' (T_CONSTANT_ENCAPSED_STRING), expecting ']' in wmf-config/InitialiseSettings.php on line 7810 [14:07:55] Yup. [14:08:10] ;( [14:08:12] before to resubmit next change: `php -l wmf-config/InitialiseSettings.php` [14:08:39] Yup :) [14:08:57] (03PS1) 10Filippo Giunchedi: site: provision roles for mwlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/332996 (https://phabricator.wikimedia.org/T153361) [14:09:15] If you've Arcanist installed, you can also do `arc lint`. ostriches: where did you publish your Git hook to do an arc lint on commit? [14:10:36] (03CR) 10TTO: [C: 04-1] "See comment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332967 (https://phabricator.wikimedia.org/T101634) (owner: 10Dereckson) [14:11:44] (03PS3) 10Odder: Add noratelimit user right to translation admins on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331841 (https://phabricator.wikimedia.org/T155162) [14:11:47] tto: thanks, good catch [14:13:22] Sorry about that, Dereckson [14:13:32] (03PS3) 10Dereckson: Fix Portal talk namespace name on Sanskrit Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332967 (https://phabricator.wikimedia.org/T101634) [14:14:22] (03CR) 10TTO: [C: 031] Fix Portal talk namespace name on Sanskrit Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332967 (https://phabricator.wikimedia.org/T101634) (owner: 10Dereckson) [14:14:49] (03CR) 10Tim Landscheidt: [C: 04-1] "> does compute.pp need to be updated?" [puppet] - 10https://gerrit.wikimedia.org/r/332954 (owner: 10Dzahn) [14:16:10] odder: on mwdebug1002, I tested print_r($groupOverrides['translationadmin']) -> Notice: Undefined index: translationadmin [14:16:22] so that's fine for a non plus array [14:16:36] We can merge it. [14:17:03] 06Operations, 10Analytics, 10Analytics-Cluster: Move cloudera packages to a separate archive section - https://phabricator.wikimedia.org/T155726#2952729 (10MoritzMuehlenhoff) [14:17:07] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331841 (https://phabricator.wikimedia.org/T155162) (owner: 10Odder) [14:18:11] (03CR) 10Filippo Giunchedi: [C: 032] site: provision roles for mwlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/332996 (https://phabricator.wikimedia.org/T153361) (owner: 10Filippo Giunchedi) [14:18:17] (03PS2) 10Filippo Giunchedi: site: provision roles for mwlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/332996 (https://phabricator.wikimedia.org/T153361) [14:18:24] (03Merged) 10jenkins-bot: Add noratelimit user right to translation admins on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331841 (https://phabricator.wikimedia.org/T155162) (owner: 10Odder) [14:18:38] (03PS7) 10Hashar: kafka: fix Unrecognized escape sequence '\.' [puppet] - 10https://gerrit.wikimedia.org/r/331451 [14:18:41] (03CR) 10jenkins-bot: Add noratelimit user right to translation admins on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331841 (https://phabricator.wikimedia.org/T155162) (owner: 10Odder) [14:19:53] (03CR) 10Ottomata: [C: 032] "Ok, seeing as we have a puppet 3.8 running in Trusty on labcontrol1001, this should make a difference. Going to try it. It *shouldn't* h" [puppet] - 10https://gerrit.wikimedia.org/r/332853 (owner: 10Ottomata) [14:19:58] (03PS2) 10Ottomata: Fix 'invalid byte sequence in US-ASCII' on non Jessie passenger puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/332853 [14:20:11] Dereckson: Thanks, looks OK. [14:21:17] (03PS3) 10Ottomata: Fix 'invalid byte sequence in US-ASCII' on non Jessie passenger puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/332853 [14:21:21] (03CR) 10Ottomata: [V: 032 C: 032] Fix 'invalid byte sequence in US-ASCII' on non Jessie passenger puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/332853 (owner: 10Ottomata) [14:21:37] PROBLEM - puppet last run on mw1295 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:22:34] odder: live on mwdebug1002 [14:22:34] (03CR) 10Hashar: [V: 031 C: 031] "After a discussion on IRC I dropped the intermediate variable and just double escape the slash" [puppet] - 10https://gerrit.wikimedia.org/r/331451 (owner: 10Hashar) [14:22:46] Looks fine. [14:23:08] hashar: running pcc now :) [14:23:18] elukey: I have updated the unrecognized escape sequence change for kafka https://gerrit.wikimedia.org/r/#/c/331451/7/modules/role/manifests/kafka/main/mirror.pp [14:23:23] elukey: oh I did ] [14:23:45] (03PS8) 10Hashar: puppet parse validate from rake [puppet] - 10https://gerrit.wikimedia.org/r/331239 (https://phabricator.wikimedia.org/T154915) [14:24:41] hashar: everything looks good, does it resolve the warning? [14:25:16] (03PS8) 10Elukey: kafka: fix Unrecognized escape sequence '\.' [puppet] - 10https://gerrit.wikimedia.org/r/331451 (owner: 10Hashar) [14:25:19] odder: okay, syncing [14:25:55] !log dereckson@tin Synchronized wmf-config: Add noratelimit user right to translation admins on Commons (T155162) (duration: 00m 42s) [14:25:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:59] T155162: Add (noratelimit) flag to translation administrator usergroup on Commons - https://phabricator.wikimedia.org/T155162 [14:26:12] 06Operations, 10hardware-requests: eqiad: (1) Mediawiki log host to replace fluorine - https://phabricator.wikimedia.org/T153008#2952748 (10fgiunchedi) [14:26:14] 06Operations, 13Patch-For-Review: setup/install mwlog1001/WMF4724 - https://phabricator.wikimedia.org/T153361#2952744 (10fgiunchedi) 05Open>03Resolved mwlog1001 has its roles applied now, resolving and following up in T123728 [14:26:16] (03PS4) 10Dereckson: Fix Portal talk namespace name on Sanskrit Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332967 (https://phabricator.wikimedia.org/T101634) [14:26:28] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332967 (https://phabricator.wikimedia.org/T101634) (owner: 10Dereckson) [14:27:45] (03Merged) 10jenkins-bot: Fix Portal talk namespace name on Sanskrit Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332967 (https://phabricator.wikimedia.org/T101634) (owner: 10Dereckson) [14:28:03] (03CR) 10jenkins-bot: Fix Portal talk namespace name on Sanskrit Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332967 (https://phabricator.wikimedia.org/T101634) (owner: 10Dereckson) [14:28:34] Live on mwdebug1002. [14:32:34] PROBLEM - puppet last run on elastic1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:33:33] Works. [14:34:58] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Fix Portal talk namespace name on Sanskrit Wikipedia (T101634) (duration: 00m 39s) [14:35:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:02] T101634: Correction of namespace names in Sanskrit - https://phabricator.wikimedia.org/T101634 [14:36:58] !log restarting apache/puppetmaster on labcontrol1001 to try to fix 'invalid byte sequence in US-ASCII' puppet error [14:37:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:56] (03PS1) 10Marostegui: db-codfw.php: Depool db2047 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332997 (https://phabricator.wikimedia.org/T153300) [14:40:02] (03PS1) 10Dzahn: Revert "install: make mw2251 use trusty installer (debug)" [puppet] - 10https://gerrit.wikimedia.org/r/332998 [14:40:13] !log `mwscript namespaceDupes.php sawiki --fix` (T101634) [14:40:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:16] T101634: Correction of namespace names in Sanskrit - https://phabricator.wikimedia.org/T101634 [14:40:22] !log EU SWAT done [14:40:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:35] (03PS2) 10Dzahn: Revert "install: make mw2251 use trusty installer (debug)" [puppet] - 10https://gerrit.wikimedia.org/r/332998 [14:40:44] (03CR) 10Dzahn: [C: 032] Revert "install: make mw2251 use trusty installer (debug)" [puppet] - 10https://gerrit.wikimedia.org/r/332998 (owner: 10Dzahn) [14:41:04] 06Operations, 07Puppet, 05Puppet-infrastructure-modernization: Make the WMF puppet tree compile equally under puppet 3.4 and 3.8 - https://phabricator.wikimedia.org/T141242#2491255 (10Ottomata) For posterity: I've merged https://gerrit.wikimedia.org/r/#/c/332853/3, which was sort of related to this. [14:41:12] For namespaceDupes, everything related to the the namespace is okay, but there are 4 links bugged with a side issue, I'm reporting that on the task [14:43:02] (03CR) 10Dzahn: [C: 032] openstack: fix lint warnings in service.pp [puppet] - 10https://gerrit.wikimedia.org/r/332953 (owner: 10Dzahn) [14:43:08] (03PS2) 10Dzahn: openstack: fix lint warnings in service.pp [puppet] - 10https://gerrit.wikimedia.org/r/332953 [14:45:54] elukey: yeah that resolves the warning :] I have confirmed [14:46:15] goooood! merging [14:46:25] (03PS9) 10Elukey: kafka: fix Unrecognized escape sequence '\.' [puppet] - 10https://gerrit.wikimedia.org/r/331451 (owner: 10Hashar) [14:46:40] thanks a ton [14:50:34] RECOVERY - puppet last run on mw1295 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [14:50:40] (03CR) 10Elukey: [C: 032] kafka: fix Unrecognized escape sequence '\.' [puppet] - 10https://gerrit.wikimedia.org/r/331451 (owner: 10Hashar) [14:50:43] (03PS4) 10Elukey: WIP - Add base Redis instance if no MW shard is configured. [puppet] - 10https://gerrit.wikimedia.org/r/332983 (https://phabricator.wikimedia.org/T137345) [14:50:51] (03PS2) 10Dzahn: openstack: instancersync not in autoload module layout [puppet] - 10https://gerrit.wikimedia.org/r/332954 [14:55:51] (03PS3) 10Dzahn: deployment-prep: Fully qualify hostnames [puppet] - 10https://gerrit.wikimedia.org/r/328455 (https://phabricator.wikimedia.org/T153608) (owner: 10Tim Landscheidt) [14:56:32] (03CR) 10Dzahn: [C: 032] "yep, all those are resolving (and to the same IP without "deployment-prep")" [puppet] - 10https://gerrit.wikimedia.org/r/328455 (https://phabricator.wikimedia.org/T153608) (owner: 10Tim Landscheidt) [14:58:37] (03PS3) 10Dzahn: trebuchet: Fully qualify hostname [puppet] - 10https://gerrit.wikimedia.org/r/328457 (https://phabricator.wikimedia.org/T153608) (owner: 10Tim Landscheidt) [14:59:18] (03CR) 10Dzahn: [C: 031] trebuchet: Fully qualify hostname [puppet] - 10https://gerrit.wikimedia.org/r/328457 (https://phabricator.wikimedia.org/T153608) (owner: 10Tim Landscheidt) [14:59:46] (03CR) 10Dzahn: "like this?" [puppet] - 10https://gerrit.wikimedia.org/r/332954 (owner: 10Dzahn) [15:00:25] 06Operations, 10MediaWiki-Vagrant, 06Release-Engineering-Team, 07Epic: [EPIC] Migrate base image to Debian Jessie - https://phabricator.wikimedia.org/T136429#2952820 (10Gilles) [15:00:40] 06Operations, 13Patch-For-Review: Upgrade fluorine to trusty/jessie - https://phabricator.wikimedia.org/T123728#2952822 (10fgiunchedi) mwlog[12]001 have been provisioned with jessie and are up and running. `udp2log-mw` runs as a systemd unit and so does `xenon-log` now. Next steps: [] send mediawiki udp2log t... [15:01:34] RECOVERY - puppet last run on elastic1031 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [15:03:57] Thanks for the help, Dereckson! [15:04:01] * odder off now [15:04:58] bd808: I was looking for scholarships udp2log logs on fluorine but couldn't find any in /a/mw-log, is it possible nothing was logged? [15:23:49] (03CR) 10Tim Landscheidt: [C: 031] "Untested, but should work." [puppet] - 10https://gerrit.wikimedia.org/r/332954 (owner: 10Dzahn) [15:25:39] 06Operations, 10ORES, 06Revision-Scoring-As-A-Service, 13Patch-For-Review: Set up monitoring for ORES redis database - https://phabricator.wikimedia.org/T155482#2952898 (10Halfak) p:05Triage>03High [15:41:47] (03PS2) 10Filippo Giunchedi: Add new mandatory config value for SVG engine in Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/330867 (https://phabricator.wikimedia.org/T150754) (owner: 10Gilles) [15:41:49] (03CR) 10Andrew Bogott: [C: 031] "lgtm, I'm giving the puppet compiler a run right now." [puppet] - 10https://gerrit.wikimedia.org/r/332105 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [15:42:07] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster module: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/332105 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [15:42:55] (03CR) 10Filippo Giunchedi: [C: 032] Add new mandatory config value for SVG engine in Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/330867 (https://phabricator.wikimedia.org/T150754) (owner: 10Gilles) [15:44:50] dcausse: thanks for the CirrusSearch +2 :] [15:45:13] dcausse: that one kept me quite busy. It is definitely an issue in MediaWiki core itself though. [15:45:20] hashar: he, sorry for the problem in the first place :/ [15:45:25] not your fault [15:45:33] most would have done the exact same thing [15:45:48] the fault is MediaWiki test suite not resetting the MEdiaWiki services [15:46:12] \O/ [15:46:14] oh ok, yes test suites should be isolated [15:46:36] there is a php feature to run each tests in an isolated PHP [15:46:40] but afaik that breaks a lot of tests [15:46:47] and makes the whole run even slower than it is now :( [15:47:01] (03PS7) 10Andrew Bogott: puppetmaster module: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/332105 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [15:47:03] (03CR) 10Filippo Giunchedi: [C: 032] Upgrade to 0.1.32 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/330866 (owner: 10Gilles) [15:47:43] well, I have a base test class in the cirrus tests, I think I'll add more cleanups there [15:49:34] godog: the scholarships app doesn't log very often. really just when something goes wrong. [15:50:29] bd808: ah ok! I'll send a code review your way to move that traffic to mwlog1001 as per T123728 [15:50:29] T123728: Upgrade fluorine to trusty/jessie - https://phabricator.wikimedia.org/T123728 [15:57:34] PROBLEM - DPKG on thumbor1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:59:34] RECOVERY - DPKG on thumbor1001 is OK: All packages OK [16:02:31] that was me ^ [16:03:03] 06Operations, 10DBA, 10Monitoring: Create a check/calendar alert for MariaDB TLS certs - https://phabricator.wikimedia.org/T152427#2953089 (10Dzahn) Hi, i can take a shot at this. Did it for other certs before. where are the certs located please. I looked in files/ssl/ in puppet repo. Where do they get insta... [16:09:08] 06Operations, 10DBA, 10Monitoring: Create a check/calendar alert for MariaDB TLS certs - https://phabricator.wikimedia.org/T152427#2953095 (10Marostegui) Hey @Dzahn help is welcomed!! They get installed here: ``` /etc/mysql/ssl ``` [16:17:28] (03PS3) 10Hashar: Introduce linters using rake [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/331328 (https://phabricator.wikimedia.org/T154894) [16:17:42] (03CR) 10Hashar: "check experimental" [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/331328 (https://phabricator.wikimedia.org/T154894) (owner: 10Hashar) [16:18:01] (03CR) 10Hashar: "I have removed the Gemfile.lock file which is not needed." [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/331328 (https://phabricator.wikimedia.org/T154894) (owner: 10Hashar) [16:18:24] (03PS2) 10Hashar: Introduce linters using rake [puppet/jmxtrans] - 10https://gerrit.wikimedia.org/r/331327 (https://phabricator.wikimedia.org/T154894) [16:18:33] (03CR) 10Hashar: "I have removed the Gemfile.lock file which is not needed." [puppet/jmxtrans] - 10https://gerrit.wikimedia.org/r/331327 (https://phabricator.wikimedia.org/T154894) (owner: 10Hashar) [16:18:39] (03CR) 10Hashar: "check experimental" [puppet/jmxtrans] - 10https://gerrit.wikimedia.org/r/331327 (https://phabricator.wikimedia.org/T154894) (owner: 10Hashar) [16:19:01] (03PS2) 10Hashar: Introduce linters using rake [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/331330 (https://phabricator.wikimedia.org/T154894) [16:19:10] (03CR) 10Hashar: "I have removed the Gemfile.lock file which is not needed." [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/331330 (https://phabricator.wikimedia.org/T154894) (owner: 10Hashar) [16:19:16] (03CR) 10Hashar: "check experimental" [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/331330 (https://phabricator.wikimedia.org/T154894) (owner: 10Hashar) [16:19:19] (03CR) 10Ottomata: [C: 031] "We *could* alter the use of namenode_opts in the cdh module so that it would auto populate the -Xmx value based on namenode heapsize, buuu" [puppet] - 10https://gerrit.wikimedia.org/r/330154 (https://phabricator.wikimedia.org/T88640) (owner: 10Elukey) [16:19:57] (03PS2) 10Hashar: Introduce linters using rake [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/331332 (https://phabricator.wikimedia.org/T154894) [16:20:02] (03CR) 10Hashar: "I have removed the Gemfile.lock file which is not needed." [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/331332 (https://phabricator.wikimedia.org/T154894) (owner: 10Hashar) [16:20:14] (03CR) 10Hashar: "check experimental" [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/331332 (https://phabricator.wikimedia.org/T154894) (owner: 10Hashar) [16:20:30] 06Operations, 10DBA, 10Monitoring: Create a check/calendar alert for MariaDB TLS certs - https://phabricator.wikimedia.org/T152427#2953129 (10jcrespo) @Dzahn, ideally, the check should be done connecting to the servers. The files could be there, but not loaded into memory after a restart, and files are not l... [16:22:51] (03PS2) 10Hashar: Introduce linters using rake [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/331329 (https://phabricator.wikimedia.org/T154894) [16:23:07] (03CR) 10Hashar: "I have removed the Gemfile.lock which is not needed." [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/331329 (https://phabricator.wikimedia.org/T154894) (owner: 10Hashar) [16:23:16] (03CR) 10Hashar: "check experimental" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/331329 (https://phabricator.wikimedia.org/T154894) (owner: 10Hashar) [16:23:21] 06Operations, 10DBA, 10Monitoring: Create a check/calendar alert for MariaDB TLS certs - https://phabricator.wikimedia.org/T152427#2953136 (10faidon) @Jcrespo is correct, files on disk aren't the right way to monitor this. `check_ssl` should work for this use case, has been explicitly been made to work with... [16:23:43] (03PS6) 10Hashar: Introduce linters using rake [puppet/cdh] - 10https://gerrit.wikimedia.org/r/331312 (https://phabricator.wikimedia.org/T154894) [16:24:46] (03CR) 10Hashar: "I have removed Gemfile.lock which is not needed." [puppet/cdh] - 10https://gerrit.wikimedia.org/r/331312 (https://phabricator.wikimedia.org/T154894) (owner: 10Hashar) [16:24:57] (03CR) 10Hashar: "check experimental" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/331312 (https://phabricator.wikimedia.org/T154894) (owner: 10Hashar) [16:28:40] (03PS1) 10Faidon Liambotis: nagios: fix check_ssl_http_on_port misnomer [puppet] - 10https://gerrit.wikimedia.org/r/333010 [16:28:42] (03PS1) 10Faidon Liambotis: exim: kill exim4.minimal.labtest.erb unused [puppet] - 10https://gerrit.wikimedia.org/r/333011 (https://phabricator.wikimedia.org/T148717) [16:30:18] (03PS1) 10Hashar: test: puppet-syntax now fails on deprecation notices [puppet] - 10https://gerrit.wikimedia.org/r/333012 (https://phabricator.wikimedia.org/T154915) [16:30:59] (03CR) 10Hashar: [C: 04-1] "Will fail due to use still having "import realm.pp" in site.pp T154915" [puppet] - 10https://gerrit.wikimedia.org/r/333012 (https://phabricator.wikimedia.org/T154915) (owner: 10Hashar) [16:31:34] PROBLEM - puppet last run on analytics1037 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:31:36] 06Operations, 07Puppet, 10Continuous-Integration-Config: Get rid of "import realm.pp" in manifests/site.pp - https://phabricator.wikimedia.org/T154915#2953176 (10hashar) [16:31:50] 06Operations, 07Puppet, 10Continuous-Integration-Config: Get rid of "import realm.pp" in manifests/site.pp - https://phabricator.wikimedia.org/T154915#2928315 (10hashar) [16:42:59] (03CR) 10Andrew Bogott: [C: 032] exim: kill exim4.minimal.labtest.erb unused [puppet] - 10https://gerrit.wikimedia.org/r/333011 (https://phabricator.wikimedia.org/T148717) (owner: 10Faidon Liambotis) [16:47:53] (03PS4) 10Andrew Bogott: Horizon puppettab: display profiles as well as roles [puppet] - 10https://gerrit.wikimedia.org/r/332781 [16:57:22] (03CR) 10jerkins-bot: [V: 04-1] test: puppet-syntax now fails on deprecation notices [puppet] - 10https://gerrit.wikimedia.org/r/333012 (https://phabricator.wikimedia.org/T154915) (owner: 10Hashar) [16:57:55] (03CR) 10Andrew Bogott: [C: 032] Horizon puppettab: display profiles as well as roles [puppet] - 10https://gerrit.wikimedia.org/r/332781 (owner: 10Andrew Bogott) [16:58:25] (03PS5) 10Hashar: contint: move from /mnt to /srv [puppet] - 10https://gerrit.wikimedia.org/r/312523 (https://phabricator.wikimedia.org/T146381) [16:59:05] (03CR) 10Hashar: [C: 04-1] "Rebased on deployment-prep and integration puppetmasters." [puppet] - 10https://gerrit.wikimedia.org/r/312523 (https://phabricator.wikimedia.org/T146381) (owner: 10Hashar) [17:00:05] godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170119T1700). Please do the needful. [17:00:05] Pchelolo and Krenair: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [17:00:10] (03PS2) 10Faidon Liambotis: exim: kill exim4.minimal.labtest.erb unused [puppet] - 10https://gerrit.wikimedia.org/r/333011 (https://phabricator.wikimedia.org/T148717) [17:00:21] (03CR) 10Faidon Liambotis: [V: 032] exim: kill exim4.minimal.labtest.erb unused [puppet] - 10https://gerrit.wikimedia.org/r/333011 (https://phabricator.wikimedia.org/T148717) (owner: 10Faidon Liambotis) [17:01:34] RECOVERY - puppet last run on analytics1037 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [17:11:14] PROBLEM - restbase endpoints health on restbase2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:12:02] elukey, ping? [17:12:04] RECOVERY - restbase endpoints health on restbase2002 is OK: All endpoints are healthy [17:13:49] Krenair: hello :) [17:14:11] 06Operations, 10Analytics, 10netops, 13Patch-For-Review: Open temporary access from analytics vlan to new-labsdb one - https://phabricator.wikimedia.org/T155487#2953282 (10Nuria) [17:14:34] PROBLEM - DPKG on thumbor1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:15:48] that's me ^ [17:16:29] elukey, hey [17:16:34] RECOVERY - DPKG on thumbor1002 is OK: All packages OK [17:16:48] is someone doing puppet swat? [17:17:29] sorry Krenair I didn't see jouncebot's ping, yes I'm looking at hte patches [17:17:35] ok [17:18:57] mmm I don't get the "So we can add beta suffixes later" [17:19:21] (03PS3) 10Filippo Giunchedi: Graphoid: Install all fonts available to mediawiki. [puppet] - 10https://gerrit.wikimedia.org/r/328316 (https://phabricator.wikimedia.org/T153726) (owner: 10Ppchelko) [17:19:33] can you be a bit more specific? [17:20:16] elukey, look at the dependent patch [17:21:04] (03CR) 10Filippo Giunchedi: [C: 032] Graphoid: Install all fonts available to mediawiki. [puppet] - 10https://gerrit.wikimedia.org/r/328316 (https://phabricator.wikimedia.org/T153726) (owner: 10Ppchelko) [17:21:34] Pchelolo: ^ merged [17:26:29] !log restarting and upgrading mariadb on labsdb1004 to 10.0.29 [17:26:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:26] elukey: I'm assuming you are looking at Krenair's patch? [17:29:34] (03PS2) 10Filippo Giunchedi: Switch Thumbor to swift loader [puppet] - 10https://gerrit.wikimedia.org/r/330869 (https://phabricator.wikimedia.org/T151441) (owner: 10Gilles) [17:30:23] godog: I am in a meeting now, trying to check it in the meantime but I could use some help.. the following patches are a bit dense :) [17:31:35] the following patches are not on the list for this swat [17:32:19] Krenair: mind adding a PCC run too to the patch? [17:33:07] Krenair: sure but it doesn't make sense to merge this one if the others have problems [17:33:25] (03CR) 10Filippo Giunchedi: [C: 032] Switch Thumbor to swift loader [puppet] - 10https://gerrit.wikimedia.org/r/330869 (https://phabricator.wikimedia.org/T151441) (owner: 10Gilles) [17:33:29] Not sure if I'm still allowed to do that godog [17:33:33] Let's see [17:34:44] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[zotero/translators],Package[zotero/translation-server],Exec[chown /srv/deployment/zotero for deploy-service] [17:34:57] hm, apparently I am [17:35:21] elukey, why not? [17:35:45] because this one adds templates that are not templates, but simple files [17:36:13] okay [17:36:17] This is getting ridiculous [17:36:38] I try to split these patches up to make your life easier [17:37:07] It should be perfectly safe to do this without the dependencies [17:37:12] without the dependents* [17:37:21] if those have issues, we can identify and fix those later, before they are merged [17:38:55] if I understood the comment right, the commit message could be clearer on what's going to happen after the patch is merged [17:39:16] Krenair: I am ONE reviewer and I am trying to check your work to ensure that we don't break anything, plus this does not seem to be a super urgent thing. I only merge when I am sure about a change and this is not to block anybody, but again to make sure that we do the right thing. [17:39:32] this is why I am asking and taking time [17:39:45] so please do not say "This is getting ridiculous" [17:40:01] we are all trying to do the right thing, it is only a matter of collaboration [17:40:04] you think this change is going to break things, or the other patches that I'm not asking you to fully review? [17:41:13] this is not a great attitude when you ask other people to help, just saying [17:42:28] Your change is fine, and probably PCC agrees. What I am trying to say is that if the other patches will not get accepted (not by me but maybe others etc..) we'll end up in having templates without any variables [17:42:30] I can change "So we can add beta suffixes later" to "So that in future commits we can add beta variants to the domains of the VHosts in these files" [17:42:32] so pointless [17:42:44] (03PS1) 10Addshore: Add wmde ldap group to grafana [puppet] - 10https://gerrit.wikimedia.org/r/333024 [17:42:50] (it was referring to my prev sentence) [17:43:22] Krenair: godog is reviewing your changes too, the +2 is not only from me [17:44:04] 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack and set up ms-fe100[5-7] - https://phabricator.wikimedia.org/T155095#2953423 (10Cmjohnson) @fgiunchedi ms-fe1005 and 1006 are both on asw2-a5 and the switch is not seeing them. They are both setup exactly the same as 1007 and 1008 and using the same cables... [17:44:06] I was just trying to help because you pinged [17:44:34] hmmmm, i'm not sure i have login credentials anymore. can anybody check on video scalers if there are ffmpeg jobs eating tons of memory? trying to track down whether they're dying because of time or memory limit [17:44:49] ganglia? [17:45:01] ganglia shows me totals but not per-process [17:45:07] https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&s=by+name&c=Video%2520scalers%2520eqiad&tab=m&vn=&hide-hf=false [17:45:09] Ah [17:45:09] i'm wondering if anything's approaching 4gb by itself [17:46:27] 06Operations, 10ops-codfw: codfw:rack/setup mc2019-mc2036 - https://phabricator.wikimedia.org/T155755#2953444 (10Papaul) [17:47:11] !log Reattach Zlazstadpieroniebomiurwieszkabelodinternetu CentralAuth account (T155184) [17:47:13] If puppet reviewers are more concerned about how I might later build on patches like this, I can squash these chains of commits into big patches [17:47:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:14] T155184: Reattach global account deleted by mistake - https://phabricator.wikimedia.org/T155184 [17:47:26] Big patches are very hard to get reviewed [17:47:30] They are much more dangerous [17:47:51] 06Operations, 10ops-codfw: codfw:rack/setup mc2019-mc2036 - https://phabricator.wikimedia.org/T155755#2953474 (10Papaul) [17:48:11] brion: I'm on $randomscaler watching top and I don't see any ffmpegs using more than 1.1% memory (of 16G) [17:48:18] so far anyways [17:48:19] 06Operations, 10ops-codfw: codfw:rack/setup mc2019-mc2036 - https://phabricator.wikimedia.org/T155755#2953479 (10RobH) [17:48:30] apergos: thx [17:48:52] Krenair: sure, and you did a very good thing in splitting. But the first one highly depends on the subsequent ones, so I was checking [17:48:57] yw [17:49:01] not really [17:49:16] ok i'm fixing my ssh config ;) can get in now \o/ [17:50:00] you chose the same random scaler I did [17:50:04] if this gets accepted without the future ones it's a pretty minor, almost stylistic, issue [17:50:19] brion: apergos: and we've big 1G videos 720p/1080p in the queue currently [17:50:35] (03PS3) 10Giuseppe Lavagetto: Add schema support [software/conftool] - 10https://gerrit.wikimedia.org/r/288881 [17:50:59] Dereckson: maybe you can let us/him know when they get to the top of the queue so we can check what happens [17:50:59] so < 200 Mo per processus seems a fair assumption [17:51:11] <_joe_> brion: whenever you want to activate the double queue for videos, we need to first configure the videoscalers to run them [17:51:14] Dereckson: yeah, a lot of the highest-res are failing with exitcode 137 [17:51:18] <_joe_> and other jobrunners not to happen [17:51:19] (03CR) 10jerkins-bot: [V: 04-1] Add schema support [software/conftool] - 10https://gerrit.wikimedia.org/r/288881 (owner: 10Giuseppe Lavagetto) [17:51:28] 06Operations, 10ops-codfw: codfw:rack/setup mc2019-mc2036 - https://phabricator.wikimedia.org/T155755#2953504 (10RobH) So we could start putting some of these mc systems in racks with mw systems, which would eliminate the need to have any two mc systems in the same rack. That may be overkill however, as with... [17:51:57] _joe_: does https://gerrit.wikimedia.org/r/#/c/331668/ make sense for start? [17:52:13] it's probably wildly incorrect ;) [17:52:20] <_joe_> brion: heh, I guess so, didn't notice it yet [17:52:22] but i think makes a stab [17:52:23] cool :D [17:52:27] !log rolling restart and upgrade of labsdb1009/10/11 to mariadb 10.1.21 [17:52:28] Krenair: puppet guidelines say that you should not have a template without any variable inside, because it is a file. Not a minor style imho [17:52:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:31] <_joe_> I was knee deep into writing tests [17:52:35] :) [17:52:39] anyhow, godog will review your change in a bit [17:52:53] 06Operations, 10ops-codfw: codfw:rack/setup mc2019-mc2036 - https://phabricator.wikimedia.org/T155755#2953512 (10Papaul) [17:52:55] As I said, I can squash them [17:53:11] <_joe_> brion: so, I will merge that change tomorrow morning [17:53:14] <_joe_> if it's correct [17:53:22] _joe_: thx. feel free to throw it back for fixes :D [17:53:36] should be safe to deploy (once correct) before the php changes in mw [17:53:40] <_joe_> so that you're good to go for enabling the two queues whenever you feel like it [17:53:45] awesome [17:54:27] Krenair: my comment about the commit message was about the fact that from a casual reviewer (e.g. puppet swat) looking at your patch the plan forward isn't obvious at all, the only comment I could find is https://phabricator.wikimedia.org/T1256#2809598 [17:54:46] Krenair: e.g. what will happen to the templates? what variable will be put in? etc [17:55:29] It seems my patches are either so small that they temporarily violate style guides, or so big that they are dangerous and difficult to review [17:56:17] the future patch will introduce <%= @domain_suffix % [17:56:22] (03CR) 10Andrew Bogott: [C: 032] puppetmaster module: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/332105 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [17:56:23] in place of 'org' [17:56:31] (03PS8) 10Andrew Bogott: puppetmaster module: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/332105 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [17:56:50] defaulting to org [17:58:07] the compiler result for this patch is at https://puppet-compiler.wmflabs.org/5170/ [17:59:04] RECOVERY - Host mw2098 is UP: PING OK - Packet loss = 0%, RTA = 36.09 ms [18:00:04] yurik, gwicke, cscott, arlolra, subbu, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170119T1800). [18:00:45] Krenair: ok, thanks, IMO some context on what's going to happen next was needed [18:01:08] well [18:01:12] it's literally too late now [18:01:55] PROBLEM - nutcracker process on mw2098 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (nutcracker), command name nutcracker [18:01:55] PROBLEM - Check systemd state on mw2098 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:01:55] PROBLEM - puppet last run on mw2098 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:02:04] PROBLEM - nutcracker port on mw2098 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused [18:02:04] PROBLEM - Check whether ferm is active by checking the default input chain on mw2098 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [18:02:18] and this was with either 8 hours or 2 months notice, depending on how you look at it [18:02:44] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [18:02:47] 06Operations, 10ops-codfw: codfw:rack/setup mc2019-mc2036 - https://phabricator.wikimedia.org/T155755#2953531 (10Papaul) p:05Triage>03Normal [18:04:36] (03PS2) 10Andrew Bogott: ldaplist: remove host lookup [puppet] - 10https://gerrit.wikimedia.org/r/332645 [18:04:39] ok so i think my ffmpeg processes are dying early because ulimit handles cpu time, not wall clock time, and they're churning away with some threading for ~175% cpu usage [18:04:54] PROBLEM - cassandra-b CQL 10.192.16.159:9042 on restbase-test2003 is CRITICAL: connect to address 10.192.16.159 and port 9042: Connection refused [18:04:54] PROBLEM - cassandra-b SSL 10.192.16.159:7001 on restbase-test2003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [18:05:00] thus the 8-hour limit is reached in 4-6 hours on the big/long files [18:05:04] PROBLEM - restbase endpoints health on restbase-test2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:05:18] Krenair: ultimately what helps getting patches merged is discussing the changes themselves before sending them blindly for review without context [18:05:54] PROBLEM - Disk space on ms-be2018 is CRITICAL: Return code of 255 is out of bounds [18:06:07] And they were available for that for a long, long time before the scheduled time [18:06:14] PROBLEM - dhclient process on ms-be2018 is CRITICAL: Return code of 255 is out of bounds [18:06:38] (03CR) 10Andrew Bogott: [C: 032] ldaplist: remove host lookup [puppet] - 10https://gerrit.wikimedia.org/r/332645 (owner: 10Andrew Bogott) [18:06:54] PROBLEM - swift-object-replicator on ms-be2018 is CRITICAL: Return code of 255 is out of bounds [18:07:04] PROBLEM - puppet last run on ms-be2018 is CRITICAL: Return code of 255 is out of bounds [18:07:04] PROBLEM - swift-container-server on ms-be2018 is CRITICAL: Return code of 255 is out of bounds [18:07:14] RECOVERY - dhclient process on ms-be2018 is OK: PROCS OK: 0 processes with command name dhclient [18:07:24] PROBLEM - restbase endpoints health on restbase-test2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:07:54] RECOVERY - Disk space on ms-be2018 is OK: DISK OK [18:07:55] RECOVERY - swift-object-replicator on ms-be2018 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [18:08:03] if we needed more context it should've been asked for either shortly after upload, or after I gave notice that I'd be scheduling the request today [18:08:04] RECOVERY - swift-container-server on ms-be2018 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [18:08:14] PROBLEM - restbase endpoints health on restbase-test2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:08:24] RECOVERY - restbase endpoints health on restbase-test2001 is OK: All endpoints are healthy [18:09:04] RECOVERY - restbase endpoints health on restbase-test2003 is OK: All endpoints are healthy [18:09:04] RECOVERY - restbase endpoints health on restbase-test2002 is OK: All endpoints are healthy [18:09:55] instead we ran out the clock on the window [18:10:08] someone took a knee? [18:10:12] ignore me... [18:10:54] I'll take a look at mw2098 and ms-be2018 [18:10:58] yeah it doesn't really work [18:11:34] PROBLEM - Check systemd state on labstore1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:11:44] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed [18:11:54] PROBLEM - Disk space on ms-be2018 is CRITICAL: Return code of 255 is out of bounds [18:11:54] PROBLEM - swift-object-replicator on ms-be2018 is CRITICAL: Return code of 255 is out of bounds [18:11:58] Krenair: I'm dropping the ball but I think you are overreacting, e.g. https://gerrit.wikimedia.org/r/#/c/322601/ was fine [18:12:03] godog: mw2098 has issues with the drac, see the related task robh opened yesterday [18:12:04] PROBLEM - swift-container-server on ms-be2018 is CRITICAL: Return code of 255 is out of bounds [18:12:12] to this specific one, maybe [18:12:21] indeed mw2098 is now offline =[ [18:12:29] we have a task in for papaul to troubleshoot it [18:12:29] (03PS1) 10Brion VIBBER: Double $wgTranscodeBackgroundTimeLimit to compensate for threading [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333035 (https://phabricator.wikimedia.org/T155750) [18:12:37] volans: is it bitching about being in dsh? [18:12:54] robh: I fixed it setting it to inactive before going to bed [18:12:54] RECOVERY - Disk space on ms-be2018 is OK: DISK OK [18:12:54] RECOVERY - swift-object-replicator on ms-be2018 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [18:13:03] volans: so inactive is the one that pulls from dsh? [18:13:04] RECOVERY - swift-container-server on ms-be2018 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [18:13:05] ok, I'm downtiming mw2098 for a bit [18:13:07] cool, will know in future! [18:13:10] I'm more irritated about the pattern than this [18:13:25] godog: indeed, i thought i had ack'd the problems yesterday [18:13:30] robh: yes, but you need to run puppet on tin (or mira) to make that effective immediately [18:13:49] or set confctl from there in the first place? [18:14:00] or needs puppet run either way? [18:14:13] something something straws and camels [18:14:20] or was that an idiom [18:15:34] PROBLEM - Host ms-be2010 is DOWN: PING CRITICAL - Packet loss = 100% [18:15:50] robh: AFAIK puppet is needed because I think it runs the script that updates dsh groups, you can run it manually I guess [18:16:01] ahhh, ok yes that makes sense [18:16:27] i assumed that the confctl wrapper maybe pulled those directly but that would be messy i suppose [18:17:47] (03PS9) 10Andrew Bogott: puppetmaster module: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/332105 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [18:18:03] (03PS5) 10Dzahn: dumps: switch to Letsencrypt for TLS cert [puppet] - 10https://gerrit.wikimedia.org/r/332543 (https://phabricator.wikimedia.org/T154940) [18:18:19] (03CR) 10Dzahn: dumps: switch to Letsencrypt for TLS cert (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/332543 (https://phabricator.wikimedia.org/T154940) (owner: 10Dzahn) [18:18:54] RECOVERY - nutcracker process on mw2098 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [18:18:54] RECOVERY - Check systemd state on mw2098 is OK: OK - running: The system is fully operational [18:18:54] RECOVERY - puppet last run on mw2098 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [18:19:04] RECOVERY - nutcracker port on mw2098 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [18:19:04] RECOVERY - Check whether ferm is active by checking the default input chain on mw2098 is OK: OK ferm input default policy is set [18:19:11] papaul: ms-be2010 is you? [18:19:22] !log restbase updating firejail in production [18:19:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:38] something odd is going on btw, I just got booted out of ms-be2018 and on reconnection I get ssh host key mismatch [18:20:44] papaul: ^ [18:20:54] PROBLEM - swift-object-replicator on ms-be2018 is CRITICAL: Return code of 255 is out of bounds [18:21:04] PROBLEM - swift-container-server on ms-be2018 is CRITICAL: Return code of 255 is out of bounds [18:21:14] PROBLEM - dhclient process on ms-be2018 is CRITICAL: Return code of 255 is out of bounds [18:21:54] RECOVERY - swift-object-replicator on ms-be2018 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [18:22:04] RECOVERY - swift-container-server on ms-be2018 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [18:22:14] RECOVERY - dhclient process on ms-be2018 is OK: PROCS OK: 0 processes with command name dhclient [18:22:32] templates/wmnet:ms-be2018 1H IN A 10.192.16.160 [18:22:33] templates/wmnet:mw2255 1H IN A 10.192.16.160 [18:22:44] papaul: please shut mw2255 [18:23:34] godog: ms-2010 is me [18:24:47] godog: :( [18:24:55] 06Operations: DNS repo: add Jenkins job to ensure there are no duplicates - https://phabricator.wikimedia.org/T155761#2953657 (10Volans) [18:25:02] robh: mw2255 down [18:25:05] this is for you godog ^^^ :) [18:25:11] godog: mw2055 down [18:25:29] ? [18:25:34] PROBLEM - Host betelgeuse is DOWN: PING CRITICAL - Packet loss = 100% [18:25:36] robh: not you [18:25:55] jgage: ^ betelgeuse page [18:25:59] frack? [18:25:59] bah [18:26:04] jeff green [18:26:06] not gage, heh [18:26:09] indeed [18:26:23] more like Jeff_Green [18:26:23] Jeff_Green: ^^^ [18:26:39] he is aware [18:26:43] yup [18:26:59] papaul: ok thanks! it got the same ip address as ms-be2018 hence the breakage [18:27:28] godog: yes saw that sorry [18:27:58] ahh [18:28:04] been there done that! [18:28:16] (03CR) 10Dzahn: [C: 04-1] ":p nginx: [emerg] directive "include" is not terminated by ";" in /etc/nginx/sites-enabled/dumps:8" [puppet] - 10https://gerrit.wikimedia.org/r/332543 (https://phabricator.wikimedia.org/T154940) (owner: 10Dzahn) [18:28:20] then you develop an extreme case of dns paranoia [18:28:23] papaul: ditto mw2254, please shut it [18:28:24] which is appropriate ;] [18:28:29] dont shut them down [18:28:31] pull the mgmt cable [18:28:37] shutting down doesnt release the iP [18:28:50] pull mgmt cable on the incorrect new hosts, and change the ip in drac afterwards [18:29:02] that way it doesnt stop mgmt traffic to the correctly ip'd hosts [18:29:12] we're talking about the prod addresses not mgmt [18:29:16] ohhh [18:29:18] (03PS6) 10Dzahn: dumps: switch to Letsencrypt for TLS cert [puppet] - 10https://gerrit.wikimedia.org/r/332543 (https://phabricator.wikimedia.org/T154940) [18:29:22] disregard me =P [18:29:25] godog: sorry. [18:29:26] (03PS2) 10Andrew Bogott: toollabs: Don't use wikitech API to find labs instances in tools-clush-generator [puppet] - 10https://gerrit.wikimedia.org/r/329021 (https://phabricator.wikimedia.org/T104575) (owner: 10Alex Monk) [18:29:51] thats much much worse =P [18:29:55] robh: no worries! figures I was convinced the dns lint checks would also check for duplicate ips [18:30:05] nope! [18:30:12] godog: mw2254 down [18:30:18] it would be nice eh? [18:30:31] papaul: nice, thanks! [18:30:33] would be nice indeed [18:30:43] meh, the weather is shit and i dont wanna drive in it. im delaying my ulsfo day from today to tomorrow. [18:31:20] (03CR) 10Andrew Bogott: [C: 032] toollabs: Don't use wikitech API to find labs instances in tools-clush-generator [puppet] - 10https://gerrit.wikimedia.org/r/329021 (https://phabricator.wikimedia.org/T104575) (owner: 10Alex Monk) [18:31:36] papaul: there might be more duplicate ips like that, I'm checking [18:32:03] yes i am stoping the installs checking too [18:32:15] ok betelgeuse is back online. for some reason it didn't survive a reboot, console just showed a line of non-alpha characters. second reboot was uneventful [18:32:47] also I don't know why it paged so many times, it was supposed to be down-for-maintenance [18:33:02] (03CR) 10Dzahn: [C: 031] "tested nginx config change incl the challenge-nginx snippet" [puppet] - 10https://gerrit.wikimedia.org/r/332543 (https://phabricator.wikimedia.org/T154940) (owner: 10Dzahn) [18:33:34] RECOVERY - Check systemd state on labstore1004 is OK: OK - running: The system is fully operational [18:33:40] papaul: yeah all duplicates the last mw batch [18:33:48] (03PS7) 10Dzahn: dumps: switch to Letsencrypt for TLS cert [puppet] - 10https://gerrit.wikimedia.org/r/332543 (https://phabricator.wikimedia.org/T154940) [18:33:54] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1004 is OK: OK - maintain-dbusers is active [18:34:07] !log aborting rolling restart on labsdb1010, labsdb1011 due to package bug to be fixed on 10.1.21-2 [18:34:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:01] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/5169/" [puppet] - 10https://gerrit.wikimedia.org/r/332954 (owner: 10Dzahn) [18:35:14] RECOVERY - Host betelgeuse is UP: PING OK - Packet loss = 0%, RTA = 36.45 ms [18:35:55] godog: ok will fix that thanks [18:36:04] RECOVERY - puppet last run on ms-be2018 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [18:37:43] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Migrate labsdb1005/1006/1007 to jessie - https://phabricator.wikimedia.org/T123731#2953719 (10yuvipanda) Update: Since I'll be travelling on the 25th, I'm going to push this out to early February instead. I'll ping @jcrespo when he's... [18:38:36] (03CR) 10Dzahn: [C: 04-1] "*meep* another issue when compiling puppet "Error: Duplicate declaration: Monitoring::Service[https] is already declared "" [puppet] - 10https://gerrit.wikimedia.org/r/332543 (https://phabricator.wikimedia.org/T154940) (owner: 10Dzahn) [18:38:54] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[zotero/translators],Package[zotero/translation-server],Exec[chown /srv/deployment/zotero for deploy-service] [18:40:11] 06Operations, 10Analytics, 10ChangeProp, 10Citoid, and 12 others: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2953730 (10Jdforrester-WMF) [18:41:50] (03PS8) 10Dzahn: dumps: switch to Letsencrypt for TLS cert [puppet] - 10https://gerrit.wikimedia.org/r/332543 (https://phabricator.wikimedia.org/T154940) [18:42:04] RECOVERY - Host ms-be2010 is UP: PING OK - Packet loss = 0%, RTA = 36.75 ms [18:42:55] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting extended unicode characters - https://phabricator.wikimedia.org/T145885#2953741 (10demon) >>! In T145885#2951958, @Marostegui wrote: > That will convert ALL tables - ie: tables with charset binary to u... [18:42:59] papaul: no worries! I think it warrants an incident report though, I'm mostly interested in having an action item to check for duplicate ips at review time [18:43:05] dns review time [18:43:38] jouncebot: next [18:43:38] In 0 hour(s) and 16 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170119T1900) [18:44:24] so that doesnt happen very often, but it has in the past [18:44:30] and it was much more painful when it happened then [18:44:53] i recall it being due to one of the hosts that had ip dupes has a more oddball host [18:44:56] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Migrate labsdb1005/1006/1007 to jessie - https://phabricator.wikimedia.org/T123731#2953763 (10jcrespo) +1, let's meet before to clarify impact. [18:45:03] like logging host or something... was years and years and years ago... [18:45:06] * brion hmms: has anybody ever gotten swatted during a swat deploy [18:45:24] every way we can have crashed our cluster has happened in the past decade =P [18:45:44] brion: like people getting swatted on twitch.tv because viewers called the cops on them? [18:45:55] precisely [18:46:00] brion: uh, ive been in a swat where the cluster crashed [18:46:04] so the cluster was swatted down. [18:46:06] that count? [18:46:10] "VIOLATION OF CODING STANDARDS! HANDS UP GET DOWN ON THE FLOOOOOR" [18:46:17] heh [18:46:18] (there are reasons we dont do apache config changes in puppet swat these days ;) [18:46:23] (03PS1) 10Chad: Group2 to wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333042 [18:46:50] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/5172/" [puppet] - 10https://gerrit.wikimedia.org/r/332543 (https://phabricator.wikimedia.org/T154940) (owner: 10Dzahn) [18:46:54] (03CR) 10Chad: [C: 04-2] "Prep for later" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333042 (owner: 10Chad) [18:46:58] robh: heh, granted for production it is easy to catch when it happens but preventing it should be easy too [18:47:01] (03PS9) 10Dzahn: dumps: switch to Letsencrypt for TLS cert [puppet] - 10https://gerrit.wikimedia.org/r/332543 (https://phabricator.wikimedia.org/T154940) [18:47:42] even if its a script that just checks for duplicate A entries in the forward file [18:47:53] exactly [18:48:39] (03PS1) 10Papaul: DNS: fix duplicate IPs for mw2254-mw2260 Bug:https://phabricator.wikimedia.org/T155180 [dns] - 10https://gerrit.wikimedia.org/r/333043 [18:48:40] I have to run, ttyl [18:48:55] also see modules/mgmt/files/getmgmtips [18:48:57] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting extended unicode characters - https://phabricator.wikimedia.org/T145885#2953775 (10Paladox) >>! In T145885#2951958, @Marostegui wrote: > Hi, > > I have been doing some tests with the gerritdb to sum... [18:49:51] godog: https://gerrit.wikimedia.org/r/#/c/333043/1 [18:50:06] mutante: https://gerrit.wikimedia.org/r/#/c/333043/1 [18:50:41] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting extended unicode characters - https://phabricator.wikimedia.org/T145885#2953779 (10demon) >>! In T145885#2953775, @Paladox wrote: > Oh, sorry didn't see your reply until now, thanks for also testing th... [18:51:04] papaul: in a few minutes, just in the middle of switching a cert [18:51:04] RECOVERY - cassandra-b CQL 10.192.16.159:9042 on restbase-test2003 is OK: TCP OK - 3.037 second response time on 10.192.16.159 port 9042 [18:51:16] mutante: no problem [18:51:24] RECOVERY - cassandra-b SSL 10.192.16.159:7001 on restbase-test2003 is OK: SSL OK - Certificate restbase-test2003-b valid until 2017-09-08 16:33:20 +0000 (expires in 231 days) [18:51:48] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting extended unicode characters - https://phabricator.wikimedia.org/T145885#2953780 (10Paladox) Oh, we can ignore converting the tables that were binary as well the connection is already utf8 and we haven'... [18:51:52] (03CR) 10Chad: Rewrite wmf-beta-autoupdate as a scap3 plugin (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325875 (https://phabricator.wikimedia.org/T151519) (owner: 10Chad) [18:51:58] !log dataset1001 - temp disabling puppet, ms1001 - switching to Letsencrypt cert [18:52:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:24] robh: can you try the ipmi command again on ms-be2010? [18:53:01] robh: https://phabricator.wikimedia.org/T155690 [18:53:32] yeah [18:53:34] checking [18:53:53] damn [18:53:56] same internal system error [18:54:15] !log libvmod-header removed from carbon, varnish-modules provides it [18:54:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:52] 06Operations, 10ops-codfw: troubleshoot drac on ms-be2010.codfw.wmnet - https://phabricator.wikimedia.org/T155690#2953787 (10RobH) system host has same issue: robh@ms-be2010:~$ sudo ipmi-chassis --get-chassis-status ipmi_cmd_get_chassis_status: internal system error I'd recommend these next steps: * manuall... [18:54:53] papaul: updated https://phabricator.wikimedia.org/T155690 [18:55:06] i'd advise rebooting to bios, manually checkign the ipmi setting in the drac bios [18:55:09] (03PS1) 10Ema: varnish: stop ensuring libvmod-header is absent [puppet] - 10https://gerrit.wikimedia.org/r/333044 [18:55:16] and then flashing firmware since thats what dell support will tel you to do anyhow [18:55:20] avoid that step from them, heh [18:57:06] (03PS1) 10Dzahn: dumps: LE, remove breaking whitespace in subjects list [puppet] - 10https://gerrit.wikimedia.org/r/333045 [18:57:33] (03PS2) 10Dzahn: dumps: LE, remove breaking whitespace in subjects list [puppet] - 10https://gerrit.wikimedia.org/r/333045 [18:59:06] papaul: you'll notice that there are quite a few odd IPMI tasks for you [18:59:14] so hopefully figuring out one of them helps for the rest =] [18:59:26] volans found them trying to normalize the cluster [18:59:46] (so its not an emergency to fix them, but it is important for centralized fleet control) [18:59:56] (03PS2) 10Chad: docroots: Swap wikidata for wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/330709 [19:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170119T1900). Please do the needful. [19:00:04] kart_ and brion: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [19:00:12] \o/ [19:00:19] (03CR) 10Dzahn: [V: 032 C: 032] dumps: LE, remove breaking whitespace in subjects list [puppet] - 10https://gerrit.wikimedia.org/r/333045 (owner: 10Dzahn) [19:01:28] I can do swat [19:02:03] (03CR) 10Chad: [C: 032] Double $wgTranscodeBackgroundTimeLimit to compensate for threading [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333035 (https://phabricator.wikimedia.org/T155750) (owner: 10Brion VIBBER) [19:02:28] * kart_ is here. [19:02:54] ostriches: ^ [19:03:07] Awesome, doing brion's first then yours [19:03:11] robh, godog that's why I've opened T155761 right away ;) [19:03:11] T155761: DNS repo: add Jenkins job to ensure there are no duplicates - https://phabricator.wikimedia.org/T155761 [19:04:07] 06Operations: DNS repo: add Jenkins job to ensure there are no duplicates - https://phabricator.wikimedia.org/T155761#2953657 (10RobH) As discussed in IRC, there doesn't seem to be any reason to have duplicate A entries in the forward files. [19:04:44] (03Merged) 10jenkins-bot: Double $wgTranscodeBackgroundTimeLimit to compensate for threading [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333035 (https://phabricator.wikimedia.org/T155750) (owner: 10Brion VIBBER) [19:05:07] tx :D [19:06:16] !log demon@tin Synchronized wmf-config/CommonSettings.php: Double $wgTranscodeBackgroundTimeLimit to compensate for threading (duration: 00m 47s) [19:06:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:43] brion: I'm guessing you're looking at graphs and such to measure impact of ^? [19:06:46] (otherwise, done) [19:07:03] (03PS2) 10Chad: ContentTranslation: Enable publishing article in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332798 (https://phabricator.wikimedia.org/T155641) (owner: 10KartikMistry) [19:07:04] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [19:07:26] kart_: Rebasing & merging yours now [19:07:52] ostriches: cool [19:09:32] (03CR) 10Chad: [C: 032] ContentTranslation: Enable publishing article in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332798 (https://phabricator.wikimedia.org/T155641) (owner: 10KartikMistry) [19:09:46] 06Operations, 10DBA, 10Gerrit, 06Release-Engineering-Team: Gerrit: Schedule downtime to migrate db to utf8mb4 - https://phabricator.wikimedia.org/T155764#2953844 (10Paladox) [19:10:10] 06Operations, 10DBA, 10Gerrit, 06Release-Engineering-Team: Gerrit: Schedule downtime to migrate db to utf8mb4 - https://phabricator.wikimedia.org/T155764#2953859 (10Paladox) [19:10:15] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting extended unicode characters - https://phabricator.wikimedia.org/T145885#2894688 (10Paladox) [19:10:31] enwiki-labsdb got stuck, will try to CPR [19:10:59] (03CR) 10jenkins-bot: Double $wgTranscodeBackgroundTimeLimit to compensate for threading [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333035 (https://phabricator.wikimedia.org/T155750) (owner: 10Brion VIBBER) [19:11:02] (03Merged) 10jenkins-bot: ContentTranslation: Enable publishing article in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332798 (https://phabricator.wikimedia.org/T155641) (owner: 10KartikMistry) [19:11:12] (03CR) 10jenkins-bot: ContentTranslation: Enable publishing article in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332798 (https://phabricator.wikimedia.org/T155641) (owner: 10KartikMistry) [19:12:29] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: ContentTranslation: Enable publishing article in testwiki (1/2) (duration: 00m 39s) [19:12:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:12] 06Operations, 10Gerrit: Gerrit: Schedule downtime to migrate all users username to lowercase - https://phabricator.wikimedia.org/T155766#2953883 (10Paladox) [19:13:25] !log demon@tin Synchronized wmf-config/CommonSettings.php: ContentTranslation: Enable publishing article in testwiki (2/2) (duration: 00m 39s) [19:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:46] 06Operations, 10Gerrit: Gerrit: Schedule downtime to migrate all users username to lowercase - https://phabricator.wikimedia.org/T155766#2953898 (10Paladox) Im not sure if this is a #dba task? [19:14:43] (03Abandoned) 10Paladox: Update mysql-connector-java to 5.1.40 [debs/gerrit] - 10https://gerrit.wikimedia.org/r/331863 (owner: 10Paladox) [19:15:11] (03Abandoned) 10Paladox: Gerrit: Remove mysql-connection-java apt package [puppet] - 10https://gerrit.wikimedia.org/r/331864 (owner: 10Paladox) [19:15:22] kart_: You're live :) [19:15:29] cool. testing. [19:15:42] (03PS2) 10Andrew Bogott: Designate: Rename the nova_ldap sink handler to wmf_sink [puppet] - 10https://gerrit.wikimedia.org/r/332646 [19:17:13] (03PS7) 10Paladox: Gerrit: Enable config localUsernameToLowerCase [puppet] - 10https://gerrit.wikimedia.org/r/326150 (https://phabricator.wikimedia.org/T152640) [19:17:21] (03PS8) 10Paladox: Gerrit: Enable config localUsernameToLowerCase [puppet] - 10https://gerrit.wikimedia.org/r/326150 (https://phabricator.wikimedia.org/T152640) [19:17:26] (03CR) 10Andrew Bogott: [C: 032] Designate: Rename the nova_ldap sink handler to wmf_sink [puppet] - 10https://gerrit.wikimedia.org/r/332646 (owner: 10Andrew Bogott) [19:17:46] ostriches: working fine. thanks! [19:17:57] yw [19:18:40] 06Operations, 10DBA, 10Gerrit, 06Release-Engineering-Team: Gerrit: Schedule downtime to migrate db to utf8mb4 - https://phabricator.wikimedia.org/T155764#2953926 (10Paladox) This patch https://gerrit.wikimedia.org/r/#/c/330455/ will need merging before taking gerrit offline. Will need to follow sql migrat... [19:18:41] And thus concludes another exciting adventure in swat. Tune in next time! [19:18:51] (03PS7) 10Paladox: Gerrit: Set useUnicode=true, also change connectionCollation to utf8mb4_unicode_ci [puppet] - 10https://gerrit.wikimedia.org/r/330455 (https://phabricator.wikimedia.org/T145885) [19:19:00] (03PS8) 10Paladox: Gerrit: Set useUnicode=true, also change connectionCollation to utf8mb4_unicode_ci [puppet] - 10https://gerrit.wikimedia.org/r/330455 (https://phabricator.wikimedia.org/T145885) [19:19:17] thanks ostriches :D [19:20:19] !log restarting db1069:3311 due to query being "stuck" on tokudb table [19:20:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:44] ACKNOWLEDGEMENT - puppet last run on dataset1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Exec[acme-setup-acme-dumps] daniel_zahn LE [19:22:28] jouncebot refresh [19:22:31] I refreshed my knowledge about deployments. [19:27:09] (03PS1) 10Dzahn: dumps: LE, remove download.wm.org from subject list [puppet] - 10https://gerrit.wikimedia.org/r/333058 (https://phabricator.wikimedia.org/T154940) [19:27:43] (03CR) 10jerkins-bot: [V: 04-1] dumps: LE, remove download.wm.org from subject list [puppet] - 10https://gerrit.wikimedia.org/r/333058 (https://phabricator.wikimedia.org/T154940) (owner: 10Dzahn) [19:28:06] ostriches: thanks for the merges in wmf-make-branchesss :) [19:28:22] (03PS2) 10Dzahn: dumps: LE, remove download.wm.org from subject list [puppet] - 10https://gerrit.wikimedia.org/r/333058 (https://phabricator.wikimedia.org/T154940) [19:29:45] (03CR) 10Dzahn: [C: 032] dumps: LE, remove download.wm.org from subject list [puppet] - 10https://gerrit.wikimedia.org/r/333058 (https://phabricator.wikimedia.org/T154940) (owner: 10Dzahn) [19:30:30] (03CR) 10Chad: [C: 031] Get rid of mw-deployment-vars.sh [puppet] - 10https://gerrit.wikimedia.org/r/316928 (owner: 10Alex Monk) [19:30:50] addshore: you're welcome [19:32:42] 06Operations, 10ops-eqiad: rack and set up aqs100[7-9] - https://phabricator.wikimedia.org/T155654#2953958 (10Cmjohnson) p:05Normal>03High [19:32:58] (03PS1) 10Cmjohnson: Adding dns entries for aqs1007-9 T155654 [dns] - 10https://gerrit.wikimedia.org/r/333059 [19:37:18] (03PS1) 10Dzahn: dumps: switch cert/key to Letsencrypt, final step [puppet] - 10https://gerrit.wikimedia.org/r/333060 (https://phabricator.wikimedia.org/T154940) [19:37:38] 06Operations, 10Parsoid: Parsoid: fix logrotate - https://phabricator.wikimedia.org/T155768#2953979 (10Volans) [19:38:01] (03CR) 10jerkins-bot: [V: 04-1] dumps: switch cert/key to Letsencrypt, final step [puppet] - 10https://gerrit.wikimedia.org/r/333060 (https://phabricator.wikimedia.org/T154940) (owner: 10Dzahn) [19:38:10] (03PS2) 10Dzahn: dumps: switch cert/key to Letsencrypt, final step [puppet] - 10https://gerrit.wikimedia.org/r/333060 (https://phabricator.wikimedia.org/T154940) [19:39:29] (03CR) 10Dzahn: [C: 032] dumps: switch cert/key to Letsencrypt, final step [puppet] - 10https://gerrit.wikimedia.org/r/333060 (https://phabricator.wikimedia.org/T154940) (owner: 10Dzahn) [19:40:23] !log switching dumps.wikimedia.org to Letsencrypt SSL cert [19:40:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:58] apergos: ^ works for me :) [19:41:34] nice [19:41:44] I see the LetsEncrypt source listed in ff [19:41:48] thanks! [19:41:48] so the status is now: works on dataset1001 [19:41:54] puppet enabled , cert replaced [19:42:00] 06Operations, 10DBA, 10Wikimedia-General-or-Unknown: Spurious completely empty `image` table row on commonswiki - https://phabricator.wikimedia.org/T155769#2953992 (10matmarex) [19:42:02] but todo: fix puppet on ms1001 [19:42:08] where it's disabled right now [19:42:20] will follow-up soon [19:44:06] (03PS1) 10Ema: icinga: use RIPE Atlas API v2 in ripe atlas check [puppet] - 10https://gerrit.wikimedia.org/r/333064 [19:44:54] apergos: Random nit with dumps.wm.o: no favicon :p [19:45:05] (just noticed the 404 in my console when checking the new cert) [19:45:06] dumps needs favicon.ico :) [19:45:12] heh [19:45:15] feel free to add one :-P [19:45:47] PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:50:45] (03PS1) 10Dereckson: aharashtra 'Edit Wikipedia…' workshops (BAMU) throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333077 (https://phabricator.wikimedia.org/T154312) [19:50:47] RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [19:53:41] (03PS1) 10Chad: dumps: Add a favicon (used the generic Wikimedia mark one from meta) [puppet] - 10https://gerrit.wikimedia.org/r/333080 [19:53:49] ask and ye shall receive [19:55:01] ostriches: you're done for SWAT? [19:55:08] Swat finished awhile ago [19:55:11] k [19:55:14] But train starts in 5 mins [19:55:20] * Dereckson hurries up so. [19:55:28] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333077 (https://phabricator.wikimedia.org/T154312) (owner: 10Dereckson) [19:55:36] a throttle rule ^ [19:55:57] PROBLEM - Redis status tcp_6479 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.32.133 on port 6479 [19:56:57] RECOVERY - Redis status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 3033819 keys, up 80 days 11 hours - replication_delay is 0 [19:57:07] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [19:57:49] (03Merged) 10jenkins-bot: aharashtra 'Edit Wikipedia…' workshops (BAMU) throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333077 (https://phabricator.wikimedia.org/T154312) (owner: 10Dereckson) [19:57:49] Zuul... [19:58:01] (03CR) 10jenkins-bot: aharashtra 'Edit Wikipedia…' workshops (BAMU) throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333077 (https://phabricator.wikimedia.org/T154312) (owner: 10Dereckson) [19:58:05] live on mwdebug1002 [19:58:07] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3033737 keys, up 80 days 11 hours - replication_delay is 0 [20:00:09] (03PS1) 10Dereckson: Fix for Maharashtra BAMU throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333083 (https://phabricator.wikimedia.org/T154312) [20:00:23] ostriches: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170119T2000). Please do the needful. [20:01:17] (03CR) 10Dereckson: [C: 032] Fix for Maharashtra BAMU throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333083 (https://phabricator.wikimedia.org/T154312) (owner: 10Dereckson) [20:02:36] 06Operations, 10ops-eqiad, 10netops: asw2-d-eqiad.mgmt.eqiad - JNX_ALARMS CRITICAL - 2 red alarms, - https://phabricator.wikimedia.org/T152182#2954062 (10faidon) 05Open>03Resolved Great, thanks :) [20:02:57] PROBLEM - puppet last run on labtestservices2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:03:34] (03Merged) 10jenkins-bot: Fix for Maharashtra BAMU throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333083 (https://phabricator.wikimedia.org/T154312) (owner: 10Dereckson) [20:03:48] (03CR) 10jenkins-bot: Fix for Maharashtra BAMU throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333083 (https://phabricator.wikimedia.org/T154312) (owner: 10Dereckson) [20:04:04] live on mwdebug1002 [20:04:22] Are you done? [20:04:46] (03PS1) 10Milimetric: Test mediawiki-Dashiki on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333086 (https://phabricator.wikimedia.org/T125403) [20:05:20] ostriches: nearly, syncing [20:05:37] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting extended unicode characters - https://phabricator.wikimedia.org/T145885#2954071 (10Paladox) (i've created this patch against mysql connector, though we won't be using it, just putting it here for other... [20:05:53] !log dereckson@tin Synchronized wmf-config/throttle.php: Add throttle rule for BAMU event (T154312) (duration: 00m 39s) [20:05:56] ostriches: yes I'm done [20:05:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:58] T154312: Request for a temporary lift of account creation cap on IPs (2017-01-04,2017-01-06,2017-01-10,2017-01-12,2017-01-19,2017-01-20 ) - https://phabricator.wikimedia.org/T154312 [20:06:21] (03CR) 10Chad: [C: 032] Group2 to wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333042 (owner: 10Chad) [20:07:11] (03CR) 10Faidon Liambotis: [C: 032] icinga: use RIPE Atlas API v2 in ripe atlas check [puppet] - 10https://gerrit.wikimedia.org/r/333064 (owner: 10Ema) [20:07:19] (03PS2) 10Faidon Liambotis: icinga: use RIPE Atlas API v2 in ripe atlas check [puppet] - 10https://gerrit.wikimedia.org/r/333064 (owner: 10Ema) [20:07:24] (03CR) 10Faidon Liambotis: [V: 032 C: 032] icinga: use RIPE Atlas API v2 in ripe atlas check [puppet] - 10https://gerrit.wikimedia.org/r/333064 (owner: 10Ema) [20:07:51] (03CR) 10Alex Monk: "https://puppet-compiler.wmflabs.org/5170/" [puppet] - 10https://gerrit.wikimedia.org/r/322602 (https://phabricator.wikimedia.org/T1256) (owner: 10Alex Monk) [20:07:53] (03Merged) 10jenkins-bot: Group2 to wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333042 (owner: 10Chad) [20:09:21] (03CR) 10jenkins-bot: Group2 to wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333042 (owner: 10Chad) [20:09:44] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group2 to wmf.8 [20:09:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:46] Dereckson: Notice: Undefined variable: maharashtraEventsWikis in /srv/mediawiki/wmf-config/throttle.php on line 51 [20:11:02] ostriches: you can safely ignore the Undefined variable: maharashtraEventsWikis in /srv/mediawiki/wmf-config/throttle.php on line 51, it only touched mwdebug1002 and didn't reached prod (it was for that the fix) [20:11:24] Ah, I see it now [20:11:29] Thx [20:13:25] (03CR) 10Chad: [C: 04-1] "This doesn't actually install the extension, just adds it to the localization cache (that's what ext-list is for) and some config. You sti" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333086 (https://phabricator.wikimedia.org/T125403) (owner: 10Milimetric) [20:15:08] (03CR) 10jerkins-bot: [V: 04-1] dumps: Add a favicon (used the generic Wikimedia mark one from meta) [puppet] - 10https://gerrit.wikimedia.org/r/333080 (owner: 10Chad) [20:16:16] (03PS2) 10Milimetric: Test mediawiki-Dashiki on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333086 (https://phabricator.wikimedia.org/T125403) [20:16:23] (03PS2) 10Chad: dumps: Add a favicon (used the generic Wikimedia mark one from meta) [puppet] - 10https://gerrit.wikimedia.org/r/333080 [20:18:22] (03CR) 10Chad: [C: 04-1] Test mediawiki-Dashiki on the beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333086 (https://phabricator.wikimedia.org/T125403) (owner: 10Milimetric) [20:18:59] (03PS1) 10Ema: icinga: critical on ripe atlas check exceptions [puppet] - 10https://gerrit.wikimedia.org/r/333093 [20:21:29] (03CR) 10Milimetric: Test mediawiki-Dashiki on the beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333086 (https://phabricator.wikimedia.org/T125403) (owner: 10Milimetric) [20:21:31] (03PS3) 10Milimetric: Test mediawiki-Dashiki on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333086 (https://phabricator.wikimedia.org/T125403) [20:26:44] !log upgrading node to v6 on wtp2002 T149331 [20:26:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:47] T149331: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331 [20:27:04] (03CR) 10Cmjohnson: [C: 032] Adding dns entries for aqs1007-9 T155654 [dns] - 10https://gerrit.wikimedia.org/r/333059 (owner: 10Cmjohnson) [20:29:04] ema: <3 RIPE Atlas [20:29:52] :) [20:33:00] RECOVERY - puppet last run on labtestservices2001 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [20:33:10] (03PS1) 10Andrew Bogott: wmf_sink: Remove an unused file [puppet] - 10https://gerrit.wikimedia.org/r/333097 [20:33:12] (03PS1) 10Andrew Bogott: wmf_sink: Remove all ldap handling [puppet] - 10https://gerrit.wikimedia.org/r/333098 (https://phabricator.wikimedia.org/T148781) [20:39:01] 06Operations, 10Parsoid: Parsoid: fix logrotate - https://phabricator.wikimedia.org/T155768#2954226 (10mobrovac) This is rather bizarre. Parsoid gets its logrotate definition from [`service::node`'s template](https://github.com/wikimedia/operations-puppet/blob/db794d0b941e46d5da1a7dc1ac0915a887ce0224/modules/s... [20:39:10] (03CR) 10Andrew Bogott: [C: 032] wmf_sink: Remove an unused file [puppet] - 10https://gerrit.wikimedia.org/r/333097 (owner: 10Andrew Bogott) [20:39:12] 06Operations, 10Parsoid, 15User-mobrovac: Parsoid: fix logrotate - https://phabricator.wikimedia.org/T155768#2954227 (10mobrovac) [20:41:40] PROBLEM - High load average on labstore1004 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [24.0] [20:42:06] ^ madhuvishy [20:42:16] huh [20:42:21] chasemp: looking [20:42:41] (03PS2) 10Andrew Bogott: wmf_sink: Remove all ldap handling [puppet] - 10https://gerrit.wikimedia.org/r/333098 (https://phabricator.wikimedia.org/T148781) [20:42:45] def a spike in load but comparatively it's not as bad as it seems [20:43:33] chasemp: yeah it's at ~30 [20:43:40] PROBLEM - High load average on labstore1004 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [24.0] [20:44:27] madhuvishy: one interesting thing is that load threshold was set on the hold system [20:44:32] s/hold/old [20:44:54] so while the bump is interesting, it's not the same afa outcome [20:44:57] !log depooling scb2004 for nodejs install [20:44:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:03] chasemp: right - this box should be able to handle more load [20:45:15] !log upgrading node to v6 on wtp2003 T149331 [20:45:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:18] T149331: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331 [20:46:18] !log scb2004 - upgrading nodejs, libuv1 [20:46:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:16] madhuvishy: I would still like to know who the culprit is [20:47:38] yeah looks like spike in activity from one of the nfs clients [20:47:40] looking [20:49:21] (03PS1) 10Dzahn: dumps: fix nginx config for acme-challenge [puppet] - 10https://gerrit.wikimedia.org/r/333103 [20:49:50] (03CR) 10Alex Monk: [C: 031] dumps: fix nginx config for acme-challenge [puppet] - 10https://gerrit.wikimedia.org/r/333103 (owner: 10Dzahn) [20:50:23] (03CR) 10jerkins-bot: [V: 04-1] dumps: fix nginx config for acme-challenge [puppet] - 10https://gerrit.wikimedia.org/r/333103 (owner: 10Dzahn) [20:53:39] (03PS2) 10Dzahn: dumps: fix nginx config for acme-challenge [puppet] - 10https://gerrit.wikimedia.org/r/333103 (https://phabricator.wikimedia.org/T154940) [20:53:47] (03PS3) 10Dzahn: dumps: fix nginx config for acme-challenge [puppet] - 10https://gerrit.wikimedia.org/r/333103 (https://phabricator.wikimedia.org/T154940) [21:03:10] !log upgrading node to v6 on wtp1002 T149331 [21:03:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:15] T149331: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331 [21:11:01] (03CR) 10Dzahn: [C: 032] dumps: fix nginx config for acme-challenge [puppet] - 10https://gerrit.wikimedia.org/r/333103 (https://phabricator.wikimedia.org/T154940) (owner: 10Dzahn) [21:17:11] !log force non tools on NFS to go ro [21:17:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:21] chasemp: madhuvishy am here (checked in before sleeping) if needed [21:21:21] tools-exec-1404.tools.eqiad.wmflabs [21:21:21] i think [21:21:21] madhuvishy: what makes you think that one in particular? [21:21:22] it was dominating iftop for a bit [21:21:22] right now not sure [21:21:22] PROBLEM - tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 20 seconds [21:21:36] what [21:21:57] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.250 second response time [21:22:04] (03PS1) 10Dzahn: dumps: skip Letsencrypt cert creation on ms1001 [puppet] - 10https://gerrit.wikimedia.org/r/333117 (https://phabricator.wikimedia.org/T154940) [21:22:42] !log restart nfs-kernel-server to see if it clears out some queue [21:22:57] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.005 second response time [21:23:15] (03CR) 10Dzahn: [C: 032] dumps: skip Letsencrypt cert creation on ms1001 [puppet] - 10https://gerrit.wikimedia.org/r/333117 (https://phabricator.wikimedia.org/T154940) (owner: 10Dzahn) [21:23:21] (03PS2) 10Dzahn: dumps: skip Letsencrypt cert creation on ms1001 [puppet] - 10https://gerrit.wikimedia.org/r/333117 (https://phabricator.wikimedia.org/T154940) [21:23:27] RECOVERY - tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 3670 bytes in 0.027 second response time [21:23:38] chasemp: bunch of tools instances pretty high on bandwidth [21:23:55] madhuvishy: specific ones? [21:24:03] tools-exec-1404 [21:24:16] tools-exec-gift.tools.eqiad.wmflabs. [21:24:49] bastion-03 seems to be ro for home dir atm possibly transient from restart [21:24:58] on 1404 2171 51279 20 0 91792 39896 4668 S 97.0 0.5 1:02.37 python2.7 [21:25:07] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: puppet fail [21:25:15] few processes hammering cpu at 90+% [21:25:21] on exec-gift? [21:25:26] chasemp: bastion-03 is fine for me (in /home) [21:25:30] on 1404 [21:25:54] madhuvishy: reboot that would you just because [21:26:17] chasemp: depool first? [21:26:27] I would just do it atm [21:26:47] yeah, jdi [21:27:00] done [21:27:41] I don't actually see much activiti from the gift node [21:27:44] (am on there now) [21:28:35] madhuvishy: so atm I'm wondering what happens when we fail over [21:28:48] becaues I'm curious to see if external factors migrate as well [21:29:25] chasemp: it rebooted and came back right on top of iftop [21:29:41] madhuvishy: which? [21:29:41] tools-exec-1415 is high on list too [21:30:05] 1404 [21:30:07] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: puppet fail [21:30:09] some process called ebook-convert [21:30:56] ^^ barium. looking [21:30:57] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.341 second response time [21:31:53] 06Operations, 10Traffic: Letsencrypt all the prod things we can - planning - https://phabricator.wikimedia.org/T133717#2954436 (10Dzahn) [21:31:55] 06Operations, 10Traffic, 13Patch-For-Review: convert dumps to use Letsencrypt for SSL cert (deadline: 2017-04-26) - https://phabricator.wikimedia.org/T154940#2954434 (10Dzahn) 05Open>03Resolved {F5314313} [21:31:57] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.118 second response time [21:32:23] madhuvishy: andrewbogott yuvipanda I thikn let's try fail over, as something here is just in a bad state [21:32:27] 06Operations, 10Traffic: Letsencrypt all the prod things we can - planning - https://phabricator.wikimedia.org/T133717#2240497 (10Dzahn) [21:32:30] or there is an external factor we are not finding [21:32:42] chasemp: ok! I'm standing by [21:32:54] (not sure of anything about new setup yet) [21:35:37] i'm not sure but it just seems like with labs added - the load is really high [21:35:37] sorry, this is happening in too many channels at once… what are we failing over, specifically? [21:35:38] we didn't adjust the per-client limits right? so shouldn't be higher than the old box, which had far less hardware [21:35:38] chasemp: i'm not sure why failover [21:35:38] yuvipanda: yeah also load's going up non stop [21:35:38] went up from 24 to 60's in the last 30 minutes [21:35:38] RECOVERY - check_puppetrun on barium is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [21:35:38] !log mobrovac@tin Starting deploy [trending-edits/deploy@0abcf25]: Canary deploy for switching to node 6 T149331 [21:35:38] !log ppchelko@tin Starting deploy [changeprop/deploy@ffd0b8b]: Canary deploy for switching to node 6 T149331 [21:35:40] !log mobrovac@tin Finished deploy [trending-edits/deploy@0abcf25]: Canary deploy for switching to node 6 T149331 (duration: 00m 32s) [21:35:40] !log ppchelko@tin Finished deploy [changeprop/deploy@ffd0b8b]: Canary deploy for switching to node 6 T149331 (duration: 00m 32s) [21:35:47] PROBLEM - tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 20 seconds [21:35:57] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 51.759 second response time [21:36:03] PROBLEM - NFS read/writeable on labs instances on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/nfs/home - 306 bytes in 15.864 second response time [21:36:03] PROBLEM - check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 272 bytes in 16.258 second response time [21:36:03] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 7.673 second response time [21:36:07] PROBLEM - DRBD role on labstore1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:36:07] PROBLEM - puppet last run on labstore1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:36:07] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 268 bytes in 1.367 second response time [21:36:20] !log ppchelko@tin Starting deploy [citoid/deploy@95df861]: Canary deploy for switching to node 6 T149331 [21:36:22] PROBLEM - DRBD role on labstore1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:36:24] is bad [21:36:26] chasemp: are you failing over? [21:36:35] which would explain^ [21:36:37] !log mobrovac@tin Starting deploy [graphoid/deploy@f872f94]: Canary deploy for switching to node 6 T149331 [21:36:40] madhuvishy: yeah, I think we've no option but to failover since I can't ssh into labstore1004 [21:36:50] you can't? [21:36:52] yuvipanda: it was rebooted [21:36:52] !log ppchelko@tin Finished deploy [citoid/deploy@95df861]: Canary deploy for switching to node 6 T149331 (duration: 00m 33s) [21:36:53] yuvipanda: it's been rebooted [21:36:57] PROBLEM - Host labstore1004 is DOWN: PING CRITICAL - Packet loss = 100% [21:36:58] !log mobrovac@tin Finished deploy [graphoid/deploy@f872f94]: Canary deploy for switching to node 6 T149331 (duration: 00m 21s) [21:37:06] ema: oh interesting. [21:37:08] !log ppchelko@tin Starting deploy [cxserver/deploy@ff0225e]: Canary deploy for switching to node 6 T149331 [21:37:18] !log mobrovac@tin Starting deploy [mathoid/deploy@ba3217e]: Canary deploy for switching to node 6 T149331 [21:37:23] ok that's too many channels [21:37:24] right - drbd primary is in hiera so then that alert is okay [21:37:35] !log ppchelko@tin Finished deploy [cxserver/deploy@ff0225e]: Canary deploy for switching to node 6 T149331 (duration: 00m 27s) [21:37:47] RECOVERY - Host labstore1004 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [21:37:48] !log ppchelko@tin Starting deploy [eventstreams/deploy@fe77f19]: Canary deploy for switching to node 6 T149331 [21:37:55] !log mobrovac@tin Finished deploy [mathoid/deploy@ba3217e]: Canary deploy for switching to node 6 T149331 (duration: 00m 37s) [21:38:07] !log ppchelko@tin Finished deploy [eventstreams/deploy@fe77f19]: Canary deploy for switching to node 6 T149331 (duration: 00m 18s) [21:38:09] !log mobrovac@tin Starting deploy [mobileapps/deploy@cacb3c9]: Canary deploy for switching to node 6 T149331 [21:38:45] papaul: sorry for delay, done now. IPs fixed [21:38:46] !log mobrovac@tin Finished deploy [mobileapps/deploy@cacb3c9]: Canary deploy for switching to node 6 T149331 (duration: 00m 37s) [21:39:07] !log mobrovac@tin Starting deploy [electron-render/deploy@f1df2d3]: Canary deploy for switching to node 6 T149331 [21:39:28] !log mobrovac@tin Finished deploy [electron-render/deploy@f1df2d3]: Canary deploy for switching to node 6 T149331 (duration: 00m 21s) [21:39:57] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is inactive [21:40:37] PROBLEM - drbd service on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit drbd is inactive [21:44:37] RECOVERY - High load average on labstore1004 is OK: OK: Less than 50.00% above the threshold [16.0] [21:45:32] mobrovac, you still working on that? [21:47:17] Krenair: on what? [21:47:17] mutante: thanks [21:47:19] mobrovac, whatever it was you guys are doing... lots of things going on here. [21:47:19] (03PS1) 10Jdlrobson: Wikidata description taglines shown on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333124 (https://phabricator.wikimedia.org/T152743) [21:47:27] RECOVERY - tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 3670 bytes in 0.038 second response time [21:47:32] RECOVERY - NFS read/writeable on labs instances on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 36.392 second response time [21:47:32] PROBLEM - check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 185 bytes in 36.396 second response time [21:47:32] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 27.829 second response time [21:47:37] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 19.578 second response time [21:47:37] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 20.923 second response time [21:47:41] labs NFS and the service deployments it looks like [21:47:56] Krenair: oh that :) yes, migration to node 6 on scb ongoing, waiting on mutante to continue the process [21:48:02] cool [21:48:07] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.494 second response time [21:49:18] just trying to defer a large MW rename request until it's quieter in here [21:50:57] RECOVERY - check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.017 second response time [21:50:58] Krenair: It's never quiet around these parts :p [21:51:02] (you must be new here) [21:51:05] ostriches, relatively :) [21:51:12] <_joe_> Krenair: yes still ongoing [21:52:47] RECOVERY - drbd service on labstore1004 is OK: OK - drbd is active [21:54:07] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.219 second response time [21:55:07] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.337 second response time [22:01:17] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:05:08] (03PS1) 10Madhuvishy: nfs: Make labstore1005 primary post failover [puppet] - 10https://gerrit.wikimedia.org/r/333131 [22:05:58] ^ yuvipanda shoudl we restart tools-checker or soemthing? [22:06:07] PROBLEM - puppet last run on mw1293 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:06:11] not sure if all recovered seems to be big lag in checks [22:06:40] chasemp: looking [22:07:29] chasemp thats probably all the servers reconnecting to everything [22:09:41] chasemp: ok, only precise check seems to say it's still failing, checking if tha's real or flake [22:09:44] precise job submission [22:10:14] (03PS3) 10Krinkle: docroots: Swap wikidata for wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/330709 (owner: 10Chad) [22:10:20] (03CR) 10Krinkle: docroots: Swap wikidata for wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/330709 (owner: 10Chad) [22:10:23] (03CR) 10Krinkle: [C: 031] docroots: Swap wikidata for wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/330709 (owner: 10Chad) [22:10:38] (03CR) 10Rush: [C: 031] nfs: Make labstore1005 primary post failover [puppet] - 10https://gerrit.wikimedia.org/r/333131 (owner: 10Madhuvishy) [22:11:00] (03CR) 10Madhuvishy: [V: 032 C: 032] nfs: Make labstore1005 primary post failover [puppet] - 10https://gerrit.wikimedia.org/r/333131 (owner: 10Madhuvishy) [22:11:53] yuvipanda: thanks [22:12:08] it might be actually failing, so am taking a look [22:12:25] chasemp: okay if i enable puppet on 1004? [22:12:46] madhuvishy: I think yes but I want to touch nothing at all for about an hour if possible [22:13:37] chasemp: okay [22:13:48] chasemp: ok, it's the checker being weird, since the thing it is checking I can perform successfully. so not an outage situation. I'll investigate checker now [22:13:49] i will get lunch then [22:14:01] yuvipanda: it's kinda late [22:14:05] go sleep! i can look [22:14:27] damn, it's almost 4 [22:14:28] yuvipanda: ok thank you for making sure it was a false negative [22:14:33] and for staying up :) [22:15:57] PROBLEM - Host labstore1004 is DOWN: PING CRITICAL - Packet loss = 100% [22:16:11] um [22:16:27] RECOVERY - Host labstore1004 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [22:16:29] it's not tho [22:16:31] is that supposed to happpen? chasemp madhuvishy [22:16:41] did that just go down by itself? [22:16:42] up for 39 minutes [22:16:53] uhhh [22:16:55] no that has to be icinga lagging [22:16:57] or something [22:17:01] idk [22:17:13] it's up and ok to me [22:17:13] yeah [22:17:20] lol [22:17:28] doesn't look like it rebooted [22:17:37] madhuvishy: chasemp ok, am about to crash then. [22:17:54] yuvipanda: please do, that way you're slightly rested if we have to wake you up later :D [22:17:59] going to leave the toolschecker to you. also mysql credential creation seems to be running on 1005 [22:18:03] but really later on, thanks [22:18:06] haven't checked if it is running fine [22:18:10] someone should do that too [22:18:10] gotcha [22:19:15] seems to be running [22:19:15] root 139077 0.0 0.0 102216 42776 ? SNs 2016 2:45 /usr/bin/python3 /usr/local/sbin/maintain-dbusers maintain [22:20:04] madhuvishy: it's set to bascially only do real work if it is on the node w/ teh cluster IP iirc [22:20:13] so in theory that has to all check out and work fyi [22:23:10] yup [22:23:41] i'm going to grab something to eat and then look at toolschecker [22:23:49] i'll respond if anything [22:24:17] PROBLEM - puppet last run on ms-fe3001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:27:47] PROBLEM - puppet last run on kafka1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:30:07] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [22:33:17] PROBLEM - DRBD Cluster IP assignment on labstore1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:33:38] !log scb2004 - re-pooled [22:33:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:35:07] RECOVERY - puppet last run on mw1293 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [22:35:57] !log scb2003 - depool, upgrade nodejs, libuv1 packages [22:36:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:04] (03Abandoned) 10Krinkle: contint: Re-add dir.php to doc.wm.org DirectoryIndex [puppet] - 10https://gerrit.wikimedia.org/r/331558 (https://phabricator.wikimedia.org/T150727) (owner: 10Krinkle) [22:38:28] !log scb2003 - repool, scb2001,scb2002 - upgrade nodejs, libuv1 packages [22:38:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:40:54] (03PS2) 10Addshore: Enable InterwikiSorting on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332917 [22:41:28] !log scb1001-1004 - upgraded nodejs version [22:41:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:44] thcipriani: I have a question :) are you around? [22:42:53] mobrovac, how's it going? [22:42:57] addshore: I am [22:43:35] Krenair: finishing up now, you should see a stream of logs now [22:43:43] cool [22:43:44] chasemp, is it safe to say the labs nfs thing is resolved? [22:43:55] When I come to deploying https://gerrit.wikimedia.org/r/#/c/332917/2 which touches CommonSettings & InitialiseSettings even though the change is aimed at beta, I should still sync them to the cluster right? and how about the -labs file and the extension-list-labs? do they need a sync to the prod cluster? or ignore them as jenkins will sync them to beta [22:43:55] (where they are actually needed) ? [22:43:59] !log ppchelko@tin Starting deploy [changeprop/deploy@ffd0b8b]: Canary deploy for switching to node 6 T149331 [22:44:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:44:03] T149331: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331 [22:44:07] !log mobrovac@tin Starting deploy [graphoid/deploy@f872f94]: Switching to node 6 T149331 [22:44:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:44:41] addshore, we have a practice of syncing those files even in prod [22:44:58] ^ [22:45:08] !log mobrovac@tin Starting deploy [mathoid/deploy@ba3217e]: (no message) [22:45:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:45:13] Krenair: awesome, I expected as much, but couldnt see it written down anywhere specifically! :) [22:45:14] keep all the appservers in sync with what is deployed on tin [22:45:15] live files out of sync with deployment repo is... ugh. even if it's only labs files sitting unused in prod [22:45:18] !log mobrovac@tin Finished deploy [graphoid/deploy@f872f94]: Switching to node 6 T149331 (duration: 01m 10s) [22:45:20] !log mobrovac@tin Starting deploy [graphoid/deploy@f872f94]: Switching to node 6 T149331 [22:45:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:45:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:45:55] yeah [22:45:58] Krenair: I'm still waiting for the other shoe to drop but yes, a controlled failover off of a node in a downward spiral seems to have been the correct call [22:46:00] it's been discussed a bunch of times [22:46:51] chasemp: hm. I think I'm missing some context :) [22:47:03] !log mobrovac@tin Finished deploy [graphoid/deploy@f872f94]: Switching to node 6 T149331 (duration: 01m 43s) [22:47:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:07] PROBLEM - graphoid endpoints health on scb1003 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) [22:47:39] mobrovac, ^ [22:47:47] PROBLEM - changeprop endpoints health on scb1003 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.32.153, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [22:47:47] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: /page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [22:47:54] me ^ [22:47:57] PROBLEM - changeprop endpoints health on scb1004 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.48.29, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [22:47:58] :) [22:48:24] (03PS2) 10Krinkle: Configure RCFeeds to use EventBus extension to send recentchange events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332807 (https://phabricator.wikimedia.org/T152030) (owner: 10Ottomata) [22:48:32] !log mobrovac@tin Finished deploy [mathoid/deploy@ba3217e]: (no message) (duration: 03m 24s) [22:48:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:18] PROBLEM - mathoid endpoints health on scb1003 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.32.153, port=10042): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [22:49:47] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy [22:50:17] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: /page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [22:51:17] RECOVERY - mathoid endpoints health on scb1003 is OK: All endpoints are healthy [22:51:18] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy [22:51:35] !log ppchelko@tin Finished deploy [changeprop/deploy@ffd0b8b]: Canary deploy for switching to node 6 T149331 (duration: 07m 36s) [22:51:38] !log scb1003,scb1004 - upgrade nodejs [22:51:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:39] T149331: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331 [22:51:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:46] !log ppchelko@tin Starting deploy [changeprop/deploy@ffd0b8b]: Deploy for switching to node 6 T149331 [22:51:46] !log mobrovac@tin Starting deploy [graphoid/deploy@f872f94]: Switching to node 6 T149331 [22:51:47] RECOVERY - changeprop endpoints health on scb1003 is OK: All endpoints are healthy [22:51:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:57] RECOVERY - changeprop endpoints health on scb1004 is OK: All endpoints are healthy [22:52:15] !log mobrovac@tin Starting deploy [mathoid/deploy@ba3217e]: (no message) [22:52:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:52:45] !log ppchelko@tin Finished deploy [changeprop/deploy@ffd0b8b]: Deploy for switching to node 6 T149331 (duration: 00m 58s) [22:52:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:07] RECOVERY - graphoid endpoints health on scb1003 is OK: All endpoints are healthy [22:53:09] sorry, i should have done scb1003/1004 earlier too [22:53:14] !log ppchelko@tin Starting deploy [citoid/deploy@95df861]: Deploy for switching to node 6 T149331 [22:53:14] re: recoveries [22:53:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:17] RECOVERY - puppet last run on ms-fe3001 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [22:53:26] !log mobrovac@tin Finished deploy [graphoid/deploy@f872f94]: Switching to node 6 T149331 (duration: 01m 39s) [22:53:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:40] !log mobrovac@tin Starting deploy [mobileapps/deploy@cacb3c9]: Switching to node 6 T149331 [22:53:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:18] !log mobrovac@tin Finished deploy [mathoid/deploy@ba3217e]: (no message) (duration: 02m 02s) [22:54:18] (03PS2) 10Dzahn: nagios: fix check_ssl_http_on_port misnomer [puppet] - 10https://gerrit.wikimedia.org/r/333010 (owner: 10Faidon Liambotis) [22:54:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:53] !log mobrovac@tin Starting deploy [electron-render/deploy@f1df2d3]: Switching to node 6 T149331 [22:54:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:32] !log ppchelko@tin Finished deploy [citoid/deploy@95df861]: Deploy for switching to node 6 T149331 (duration: 02m 18s) [22:55:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:49] !log ppchelko@tin Starting deploy [cxserver/deploy@ff0225e]: Deploy for switching to node 6 T149331 [22:55:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:22] !log mobrovac@tin Finished deploy [mobileapps/deploy@cacb3c9]: Switching to node 6 T149331 (duration: 02m 41s) [22:56:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:31] !log mobrovac@tin Starting deploy [trending-edits/deploy@0abcf25]: Switching to node 6 T149331 [22:56:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:47] RECOVERY - puppet last run on kafka1002 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [22:56:52] !log mobrovac@tin Finished deploy [electron-render/deploy@f1df2d3]: Switching to node 6 T149331 (duration: 01m 58s) [22:56:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:55] T149331: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331 [22:58:02] !log ppchelko@tin Finished deploy [cxserver/deploy@ff0225e]: Deploy for switching to node 6 T149331 (duration: 02m 12s) [22:58:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:21] !log ppchelko@tin Starting deploy [eventstreams/deploy@fe77f19]: Deploy for switching to node 6 T149331 [22:58:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:31] !log mobrovac@tin Finished deploy [trending-edits/deploy@0abcf25]: Switching to node 6 T149331 (duration: 01m 59s) [22:58:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:59:52] !log ppchelko@tin Finished deploy [eventstreams/deploy@fe77f19]: Deploy for switching to node 6 T149331 (duration: 01m 30s) [22:59:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:03] Krenair: {{done}} [23:00:20] thanks [23:08:50] !log mobrovac@tin Starting deploy [graphoid/deploy@da37386]: Bump preq to 0.5.2 for Node v6 [23:08:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:11] !log mobrovac@tin Finished deploy [graphoid/deploy@da37386]: Bump preq to 0.5.2 for Node v6 (duration: 02m 21s) [23:11:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:14:10] !log mobrovac@tin Starting deploy [cxserver/deploy@5ae4f8b]: Bump preq to 0.5.2 for Node v6 [23:14:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:14:33] (03PS4) 10Zppix: docroots: Swap wikidata for wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/330709 (owner: 10Chad) [23:15:17] PROBLEM - DRBD Cluster IP assignment on labstore1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:16:06] !log mobrovac@tin Finished deploy [cxserver/deploy@5ae4f8b]: Bump preq to 0.5.2 for Node v6 (duration: 01m 56s) [23:16:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:42] (03CR) 10Dzahn: [C: 032] nagios: fix check_ssl_http_on_port misnomer [puppet] - 10https://gerrit.wikimedia.org/r/333010 (owner: 10Faidon Liambotis) [23:24:09] !log mobrovac@tin Starting deploy [eventstreams/deploy@0d1d9c6]: Bump preq to 0.5.2 for Node v6 [23:24:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:26:20] !log mobrovac@tin Finished deploy [eventstreams/deploy@0d1d9c6]: Bump preq to 0.5.2 for Node v6 (duration: 02m 11s) [23:26:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:32] !log icinga - replace check command names in puppet_services.cfg for change 333010 [23:31:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:38] !log force puppet run on restbase2* [23:31:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:01] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.442 second response time [23:32:18] !log upgrading node to v6 on wtp1003 T149331 [23:32:21] PROBLEM - Check correctness of the icinga configuration on einsteinium is CRITICAL: Icinga configuration contains errors [23:32:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:22] T149331: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331 [23:32:43] looks like db2034 and db2042 aren't so happy about this rename [23:32:52] mutante: ^^ [23:32:58] ACKNOWLEDGEMENT - Check correctness of the icinga configuration on einsteinium is CRITICAL: Icinga configuration contains errors daniel_zahn gerrit 333010 - in progress [23:32:59] they appear to be the only two based on dbtree [23:33:01] PROBLEM - DRBD role on labstore1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:33:01] PROBLEM - puppet last run on labstore1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:33:01] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.023 second response time [23:33:20] volans: thanks, yes, i got the icinga thing [23:33:21] PROBLEM - DRBD Cluster IP assignment on labstore1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:33:31] they appear to be the only two, and both are in codfw [23:33:42] madhuvishy: labstore1004 seems unresponsive? [23:34:12] chasemp: looking [23:34:24] huh, could be me but icinga keeps flagging it too [23:34:33] !log force puppet run on restbase1* [23:34:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:29] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, 10Elasticsearch: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#2954751 (10RobH) [23:35:32] chasemp: it might be confusion because 1005 puppet run would have made it primary for puppet [23:35:56] 06Operations, 06Discovery, 10Elasticsearch, 10hardware-requests, 06Discovery-Search (Current work): elasticsearch new servers (5x eqiad / 12x codfw) - https://phabricator.wikimedia.org/T149089#2954767 (10RobH) 05stalled>03Resolved All systems for this have been ordered. The codfw systems are in plac... [23:35:58] we can run puppet on 1004 to verify [23:36:20] ah ok, right disabled there still [23:36:28] yup [23:36:29] I will run puppet there [23:36:32] okay [23:36:44] actually tendril shows them as OK, dbtree is stuck with db2034 at lag 72 and db2042 at lag 76 [23:36:51] RECOVERY - puppet last run on labstore1004 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [23:37:11] RECOVERY - DRBD Cluster IP assignment on labstore1004 is OK: Cluster IP assignment OK [23:37:15] yeah [23:37:19] nice [23:37:27] nothing shocking in output [23:37:35] volans: that change only works after puppet ran on icinga server and also all the monitored hosts, but that is all, it will be back in a moment [23:37:52] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1004 is OK: OK - maintain-dbusers is active [23:37:52] RECOVERY - DRBD role on labstore1004 is OK: DRBD role OK [23:38:23] by "back" i also just mean that restarting it works, it's running fine [23:38:41] ok [23:52:02] 06Operations, 06Labs, 13Patch-For-Review, 07Tracking: Migrate misc to secondary labstore HA cluster - https://phabricator.wikimedia.org/T154336#2907785 (10Ocaasi_WMF) Having trouble with Wikipedia Library Card platform suddenly. It wasn't on the planned list, but we're getting server errors: Failure: htt... [23:57:38] * James_F waves in readiness. [23:58:44] early [23:58:49] That's me.