[00:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Evening SWAT (Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180306T0000). [00:00:04] tgr: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:03:38] (03PS3) 10Andrew Bogott: wikitech: use files from swift rather than local uploads. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416607 (https://phabricator.wikimedia.org/T188915) [00:03:40] (03PS5) 10Andrew Bogott: multiversion: add a transitional mapping for newwikitech.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415914 (https://phabricator.wikimedia.org/T168470) [00:04:09] (03CR) 10Andrew Bogott: [C: 04-1] "We might be able to work around this" [puppet] - 10https://gerrit.wikimedia.org/r/416598 (https://phabricator.wikimedia.org/T188915) (owner: 10Andrew Bogott) [00:04:52] (03CR) 10jerkins-bot: [V: 04-1] wikitech: use files from swift rather than local uploads. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416607 (https://phabricator.wikimedia.org/T188915) (owner: 10Andrew Bogott) [00:07:58] (03PS4) 10Andrew Bogott: wikitech: use files from swift rather than local uploads. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416607 (https://phabricator.wikimedia.org/T188915) [00:08:00] (03PS6) 10Andrew Bogott: multiversion: add a transitional mapping for newwikitech.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415914 (https://phabricator.wikimedia.org/T168470) [00:16:42] I can do a self-service SWAT, I have a deploy window afterwards anyway [00:33:08] tgr: I'd like to roll out beta-only https://gerrit.wikimedia.org/r/#/c/416500/ and https://gerrit.wikimedia.org/r/#/c/416584/ today. I can wait though, no rush. What's your current status? [00:34:04] Krinkle: got distracted, was just about to start [00:34:13] k, go ahead :) [00:34:20] feel free to go ahead, I have a bigger deployment to make afterwards anyway [00:34:41] OK [00:34:43] Thanks :) [00:34:51] (03CR) 10Krinkle: [C: 032] beta: Remove redundant CentralNotice overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416500 (owner: 10Krinkle) [00:34:53] (03CR) 10Krinkle: [C: 032] beta: Remove wgCentralBannerRecorder override for old special page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416584 (owner: 10Krinkle) [00:36:05] (03Merged) 10jenkins-bot: beta: Remove redundant CentralNotice overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416500 (owner: 10Krinkle) [00:36:50] (03Merged) 10jenkins-bot: beta: Remove wgCentralBannerRecorder override for old special page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416584 (owner: 10Krinkle) [00:37:17] (03CR) 10Dzahn: icinga: script to send custom SMS to Icinga contacts (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/400615 (https://phabricator.wikimedia.org/T82937) (owner: 10Dzahn) [00:37:40] (03PS9) 10Dzahn: icinga: script to send custom SMS to Icinga contacts [puppet] - 10https://gerrit.wikimedia.org/r/400615 (https://phabricator.wikimedia.org/T82937) [00:37:58] (03CR) 10jenkins-bot: beta: Remove redundant CentralNotice overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416500 (owner: 10Krinkle) [00:38:15] (03CR) 10jerkins-bot: [V: 04-1] icinga: script to send custom SMS to Icinga contacts [puppet] - 10https://gerrit.wikimedia.org/r/400615 (https://phabricator.wikimedia.org/T82937) (owner: 10Dzahn) [00:38:32] !log krinkle@tin Synchronized wmf-config/CommonSettings-labs.php: beta-only: I02a4d41f2386c5e (duration: 00m 57s) [00:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:41:29] !log krinkle@tin Synchronized wmf-config/CommonSettings.php: no-op I33f09b164e7 (duration: 00m 58s) [00:41:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:47:50] tgr: done :) [00:48:03] thanks [00:48:55] (03PS2) 10Gergő Tisza: Enable loginOnly mode for local auth provider on group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416331 (https://phabricator.wikimedia.org/T57420) [00:50:42] (03CR) 10Gergő Tisza: "If someone got unSULed somehow, loginOnly will prevent them from changing passwords but won't lock them out, so there isn't much risk in e" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416331 (https://phabricator.wikimedia.org/T57420) (owner: 10Gergő Tisza) [00:50:48] (03CR) 10Gergő Tisza: [C: 032] Enable loginOnly mode for local auth provider on group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416331 (https://phabricator.wikimedia.org/T57420) (owner: 10Gergő Tisza) [00:52:00] (03Merged) 10jenkins-bot: Enable loginOnly mode for local auth provider on group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416331 (https://phabricator.wikimedia.org/T57420) (owner: 10Gergő Tisza) [00:53:36] (03CR) 10Legoktm: wmcs: Notify legoktm for codesearch alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/415178 (owner: 10Legoktm) [00:54:16] !log tgr@tin Synchronized wmf-config: T57420 Enable loginOnly mode for local auth provider on group 0 (duration: 01m 00s) [00:54:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:54:31] T57420: Remove local wiki password hash when CentralAuth has attached account - https://phabricator.wikimedia.org/T57420 [01:00:04] tgr: (Dis)respected human, time to deploy Reading Infrastructure (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180306T0100). Please do the needful. [01:00:04] No GERRIT patches in the queue for this window AFAICS. [01:00:14] !log tgr@tin Synchronized wmf-config/InitialiseSettings.php: refresh wmf-config/InitialiseSettings, seems to have stuck in old state on some servers after doing the initial sync in the wrong order (duration: 00m 57s) [01:00:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:01:53] (03PS1) 10AndyRussG: Remove obsoloete $wgCentralPagePath CentralNotice global [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416618 [01:03:16] (03PS2) 10AndyRussG: Remove obsoloete $wgCentralPagePath CentralNotice global [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416618 [01:05:35] (03PS5) 10Andrew Bogott: wikitech: use files from swift rather than local uploads. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416607 (https://phabricator.wikimedia.org/T188915) [01:05:37] (03PS7) 10Andrew Bogott: multiversion: add a transitional mapping for newwikitech.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415914 (https://phabricator.wikimedia.org/T168470) [01:08:19] (03PS3) 10Krinkle: Remove obsolete $wgCentralPagePath CentralNotice global [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416618 (owner: 10AndyRussG) [01:09:20] Krinkle: oooops 8p [01:10:36] (03CR) 10Krinkle: [C: 031] Remove obsolete $wgCentralPagePath CentralNotice global (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416618 (owner: 10AndyRussG) [01:10:56] AndyRussG: LGTM. Should be fine to SWAT some time. [01:14:14] Krinkle: cool thx! [01:19:01] (03CR) 10Krinkle: [C: 04-1] "According to that diff, [Installer] is removed from File[/lib/systemd/system/coal.service]. That doesn't seem right?" [puppet] - 10https://gerrit.wikimedia.org/r/415218 (https://phabricator.wikimedia.org/T110903) (owner: 10Imarlier) [01:28:00] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero alive) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [01:28:00] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (Zotero alive) timed out before a response was received [01:28:10] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: /api (Zotero alive) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [01:29:28] (03PS1) 10Andrew Bogott: labweb wikitech: update vhost [puppet] - 10https://gerrit.wikimedia.org/r/416625 (https://phabricator.wikimedia.org/T188915) [01:29:50] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [01:29:51] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [01:30:10] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [01:30:49] (03CR) 10Andrew Bogott: [C: 032] labweb wikitech: update vhost [puppet] - 10https://gerrit.wikimedia.org/r/416625 (https://phabricator.wikimedia.org/T188915) (owner: 10Andrew Bogott) [01:39:02] (03PS6) 10Andrew Bogott: wikitech: use files from swift rather than local uploads. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416607 (https://phabricator.wikimedia.org/T188915) [01:39:04] (03PS8) 10Andrew Bogott: multiversion: add a transitional mapping for newwikitech.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415914 (https://phabricator.wikimedia.org/T168470) [01:40:11] (03CR) 10jerkins-bot: [V: 04-1] wikitech: use files from swift rather than local uploads. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416607 (https://phabricator.wikimedia.org/T188915) (owner: 10Andrew Bogott) [01:40:29] (03CR) 10jerkins-bot: [V: 04-1] multiversion: add a transitional mapping for newwikitech.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415914 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [01:48:24] !log tgr@tin Started scap: T187226#4025352 update ReadingLists [01:48:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:48:40] T187226: Do not use count(*) for reading list size limits - https://phabricator.wikimedia.org/T187226 [01:54:57] (03PS7) 10BryanDavis: wikitech: use files from swift rather than local uploads. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416607 (https://phabricator.wikimedia.org/T188915) (owner: 10Andrew Bogott) [01:57:07] (03CR) 10BryanDavis: "LGTM, but would be nice to get a double check from Reedy or Chad" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416607 (https://phabricator.wikimedia.org/T188915) (owner: 10Andrew Bogott) [02:07:14] !log tgr@tin Finished scap: T187226#4025352 update ReadingLists (duration: 18m 49s) [02:07:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:07:29] T187226: Do not use count(*) for reading list size limits - https://phabricator.wikimedia.org/T187226 [02:17:21] (03CR) 10Krinkle: [C: 04-1] NavigtationTiming: Enable oversampling for Singapore (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415618 (https://phabricator.wikimedia.org/T188652) (owner: 10Imarlier) [02:18:30] (03PS3) 10Gergő Tisza: Increase ReadingLists item limit to 5k [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409712 (https://phabricator.wikimedia.org/T186296) [02:18:43] (03PS4) 10Gergő Tisza: Increase ReadingLists item limit to 5k [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409712 (https://phabricator.wikimedia.org/T186296) [02:18:57] (03CR) 10jerkins-bot: [V: 04-1] Increase ReadingLists item limit to 5k [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409712 (https://phabricator.wikimedia.org/T186296) (owner: 10Gergő Tisza) [02:19:55] (03CR) 10Gergő Tisza: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409712 (https://phabricator.wikimedia.org/T186296) (owner: 10Gergő Tisza) [02:20:10] (03CR) 10Gergő Tisza: "(CI git failure)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409712 (https://phabricator.wikimedia.org/T186296) (owner: 10Gergő Tisza) [02:22:35] (03CR) 10Gergő Tisza: [C: 032] Increase ReadingLists item limit to 5k [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409712 (https://phabricator.wikimedia.org/T186296) (owner: 10Gergő Tisza) [02:23:49] (03Merged) 10jenkins-bot: Increase ReadingLists item limit to 5k [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409712 (https://phabricator.wikimedia.org/T186296) (owner: 10Gergő Tisza) [02:24:47] (03CR) 10Krinkle: [C: 031] "This is imho good to go anytime. Not blocked on 1.31wmf23. If anything, it'd be preferred to roll out in a controlled way first. Instead o" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416330 (https://phabricator.wikimedia.org/T188472) (owner: 10EddieGP) [02:26:33] !log tgr@tin Synchronized wmf-config/CommonSettings.php: T186296 Increase ReadingLists list size limit to 5k (duration: 01m 06s) [02:26:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:26:48] T186296: Increase item limit to 5k - and return limit information in API - https://phabricator.wikimedia.org/T186296 [03:00:18] (03PS1) 10Gergő Tisza: Enable loginOnly mode for local auth provider on group 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416630 (https://phabricator.wikimedia.org/T57420) [03:00:20] (03PS1) 10Gergő Tisza: Enable loginOnly mode for local auth provider on group 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416631 (https://phabricator.wikimedia.org/T57420) [03:08:08] 10Operations, 10ops-eqsin, 10Traffic, 10netops: cp5010 - no link on primary ethernet port - https://phabricator.wikimedia.org/T187158#3966190 (10Papaul) The DAC needed to be seated on the switch side @BBlack please check and see if you can get not to the server [03:15:45] 10Operations, 10ops-eqsin, 10Traffic: rack/setup/install cp50(0[1-9]|1[0-2]) - https://phabricator.wikimedia.org/T181557#4025915 (10BBlack) [03:15:48] 10Operations, 10ops-eqsin, 10Traffic, 10netops: cp5010 - no link on primary ethernet port - https://phabricator.wikimedia.org/T187158#4025912 (10BBlack) 05Open>03Resolved a:03Papaul DHCP works, so interface is fixed, thanks! [03:31:00] (03PS1) 10BBlack: eqsin: add dns5002 macaddr [puppet] - 10https://gerrit.wikimedia.org/r/416633 (https://phabricator.wikimedia.org/T156027) [03:31:52] (03CR) 10BBlack: [C: 032] eqsin: add dns5002 macaddr [puppet] - 10https://gerrit.wikimedia.org/r/416633 (https://phabricator.wikimedia.org/T156027) (owner: 10BBlack) [03:35:05] (03PS4) 10Krinkle: profiler: Add CPU/MEMORY flags to XHProf in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415209 [03:35:10] (03CR) 10Krinkle: [C: 032] profiler: Add CPU/MEMORY flags to XHProf in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415209 (owner: 10Krinkle) [03:35:16] (03PS4) 10Krinkle: profiler: Remove redundant PyBall-XFF check for xwd/profile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415210 [03:35:21] (03CR) 10Krinkle: [C: 032] profiler: Remove redundant PyBall-XFF check for xwd/profile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415210 (owner: 10Krinkle) [03:35:25] (03PS4) 10Krinkle: profiler: Enable xhprof earlier from StartProfile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415211 (https://phabricator.wikimedia.org/T180183) [03:35:28] (03CR) 10Krinkle: [C: 032] profiler: Enable xhprof earlier from StartProfile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415211 (https://phabricator.wikimedia.org/T180183) (owner: 10Krinkle) [03:36:34] 10Operations, 10ops-eqsin, 10Traffic: rack/setup/install dns500[12] - https://phabricator.wikimedia.org/T181556#4025979 (10BBlack) [03:36:39] 10Operations, 10ops-eqsin, 10Traffic: dns5002 mgmt console unreachable - https://phabricator.wikimedia.org/T186902#4025975 (10BBlack) 05Open>03Resolved a:03Papaul @Papaul re-seated mgmt console cable, seems to be working now [03:41:57] (03CR) 10jerkins-bot: [V: 04-1] profiler: Add CPU/MEMORY flags to XHProf in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415209 (owner: 10Krinkle) [03:41:59] (03CR) 10jerkins-bot: [V: 04-1] profiler: Remove redundant PyBall-XFF check for xwd/profile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415210 (owner: 10Krinkle) [03:42:18] (03CR) 10Krinkle: profiler: Add CPU/MEMORY flags to XHProf in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415209 (owner: 10Krinkle) [03:42:21] (03CR) 10Krinkle: [C: 032] profiler: Add CPU/MEMORY flags to XHProf in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415209 (owner: 10Krinkle) [03:43:32] (03Merged) 10jenkins-bot: profiler: Add CPU/MEMORY flags to XHProf in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415209 (owner: 10Krinkle) [03:44:12] (03CR) 10jenkins-bot: beta: Remove wgCentralBannerRecorder override for old special page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416584 (owner: 10Krinkle) [03:44:17] (03CR) 10jenkins-bot: Enable loginOnly mode for local auth provider on group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416331 (https://phabricator.wikimedia.org/T57420) (owner: 10Gergő Tisza) [03:44:24] (03CR) 10Krinkle: [C: 032] profiler: Remove redundant PyBall-XFF check for xwd/profile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415210 (owner: 10Krinkle) [03:44:31] (03CR) 10Krinkle: [C: 032] profiler: Enable xhprof earlier from StartProfile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415211 (https://phabricator.wikimedia.org/T180183) (owner: 10Krinkle) [03:44:53] ^ Staging on mwdebug1002 [03:45:50] (03Merged) 10jenkins-bot: profiler: Remove redundant PyBall-XFF check for xwd/profile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415210 (owner: 10Krinkle) [03:46:47] (03PS1) 10BBlack: cp5010: uncomment in hieradata [puppet] - 10https://gerrit.wikimedia.org/r/416634 (https://phabricator.wikimedia.org/T187158) [03:46:53] (03CR) 10Imarlier: NavigtationTiming: Enable oversampling for Singapore (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415618 (https://phabricator.wikimedia.org/T188652) (owner: 10Imarlier) [03:47:21] (03CR) 10BBlack: [C: 032] cp5010: uncomment in hieradata [puppet] - 10https://gerrit.wikimedia.org/r/416634 (https://phabricator.wikimedia.org/T187158) (owner: 10BBlack) [03:47:49] (03CR) 10Krinkle: [C: 04-1] NavigtationTiming: Enable oversampling for Singapore (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415618 (https://phabricator.wikimedia.org/T188652) (owner: 10Imarlier) [03:52:47] ACKNOWLEDGEMENT - IPsec on cp1068 is CRITICAL: Strongswan CRITICAL - ok: 54 connecting: cp5010_v4, cp5010_v6 Brandon Black bringing up cp5010... [03:52:47] ACKNOWLEDGEMENT - IPsec on cp2007 is CRITICAL: Strongswan CRITICAL - ok: 78 connecting: cp5010_v4, cp5010_v6 Brandon Black bringing up cp5010... [03:52:47] ACKNOWLEDGEMENT - IPsec on cp3032 is CRITICAL: Strongswan CRITICAL - ok: 54 connecting: (unnamed) not-conn: cp1055_v4, cp1055_v6 Brandon Black bringing up cp5010... [03:52:47] ACKNOWLEDGEMENT - IPsec on cp4027 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp1055_v4, cp1055_v6 Brandon Black bringing up cp5010... [04:01:34] PROBLEM - Freshness of zerofetch successful run file on cp5010 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [04:01:42] !log krinkle@tin Synchronized wmf-config/profiler.php: T180183 (duration: 01m 33s) [04:01:43] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp5010 is CRITICAL: connect to address 10.132.0.110 and port 3121: Connection refused [04:01:43] PROBLEM - Varnish traffic logger - varnishrls on cp5010 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [04:01:43] PROBLEM - statsv Varnishkafka log producer on cp5010 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [04:01:45] PROBLEM - Recursive DNS on 103.102.166.9 is CRITICAL: CRITICAL - Plugin timed out while executing system call [04:01:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:01:59] T180183: Profiling for X-Wikimedia-Debug seems to start fairly late - https://phabricator.wikimedia.org/T180183 [04:02:26] (03PS2) 10Krinkle: multiversion: Remove support for MW_LANG env override (2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413646 [04:02:34] RECOVERY - Recursive DNS on 103.102.166.9 is OK: DNS OK: 0.454 seconds response time. www.wikipedia.org returns 208.80.154.224 [04:02:45] sorry for the noise [04:02:57] all the above is me bringing up some eqsin stuff (not in production service) [04:03:43] RECOVERY - Freshness of zerofetch successful run file on cp5010 is OK: OK [04:03:43] RECOVERY - Varnish traffic logger - varnishrls on cp5010 is OK: PROCS OK: 1 process with args /usr/local/bin/varnishrls, UID = 0 (root) [04:03:43] RECOVERY - statsv Varnishkafka log producer on cp5010 is OK: PROCS OK: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf [04:04:31] (03CR) 10Krinkle: [C: 032] multiversion: Remove support for MW_LANG env override (2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413646 (owner: 10Krinkle) [04:04:43] PROBLEM - Host 2001:df2:e500:1:f6e9:d4ff:fed0:7870 is DOWN: PING CRITICAL - Packet loss = 100% [04:05:44] (03Merged) 10jenkins-bot: multiversion: Remove support for MW_LANG env override (2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413646 (owner: 10Krinkle) [04:05:56] ^ staging on mwdebug1002 [04:08:39] !log krinkle@tin Synchronized multiversion/MWMultiVersion.php: Ia2acf57c6 (duration: 00m 57s) [04:08:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:15:27] PROBLEM - Host cp5010.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [04:15:37] PROBLEM - Host dns5002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [04:15:46] RECOVERY - Varnish HTTP text-frontend - port 3121 on cp5010 is OK: HTTP OK: HTTP/1.1 200 OK - 498 bytes in 0.465 second response time [04:16:46] ACKNOWLEDGEMENT - Host cp5010.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black eqsin mgmt network not yet routable --bblack [04:16:46] ACKNOWLEDGEMENT - Host dns5002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black eqsin mgmt network not yet routable --bblack [04:29:08] !log eqsin router maintenance starting soon-ish. all of eqsin will be offline and isn't in production service to begin with. We've tried to downtime all the things, but don't be shocked at spurious alerts! - T187807 [04:29:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:29:25] T187807: cr1-eqsin faulty interfaces - https://phabricator.wikimedia.org/T187807 [04:33:03] oh of course, ipsec [04:33:07] trying to get those quickly... [04:35:13] [done] [04:42:53] (03PS1) 10Krinkle: profiler: Sync remaining differences with profiler-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416636 (https://phabricator.wikimedia.org/T180183) [04:43:59] (03CR) 10jerkins-bot: [V: 04-1] profiler: Sync remaining differences with profiler-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416636 (https://phabricator.wikimedia.org/T180183) (owner: 10Krinkle) [04:45:01] (03PS2) 10Krinkle: profiler: Sync remaining differences with profiler-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416636 (https://phabricator.wikimedia.org/T180183) [04:46:57] (03PS1) 10Krinkle: [WIP] mediawiki: Enable auto_prepend_file on canary_appserver [puppet] - 10https://gerrit.wikimedia.org/r/416637 (https://phabricator.wikimedia.org/T180183) [04:47:21] (03CR) 10Krinkle: [C: 04-1] "Awaiting creation of said wmf-config file first. Currently the labs equiv exists only." [puppet] - 10https://gerrit.wikimedia.org/r/416637 (https://phabricator.wikimedia.org/T180183) (owner: 10Krinkle) [04:47:28] (03CR) 10Krinkle: [C: 032] profiler: Sync remaining differences with profiler-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416636 (https://phabricator.wikimedia.org/T180183) (owner: 10Krinkle) [04:49:48] (03PS3) 10Krinkle: profiler: Sync remaining differences with profiler-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416636 (https://phabricator.wikimedia.org/T180183) [04:50:17] (03CR) 10Krinkle: [C: 032] profiler: Sync remaining differences with profiler-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416636 (https://phabricator.wikimedia.org/T180183) (owner: 10Krinkle) [04:51:55] (03Merged) 10jenkins-bot: profiler: Sync remaining differences with profiler-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416636 (https://phabricator.wikimedia.org/T180183) (owner: 10Krinkle) [04:53:16] (03PS1) 10Krinkle: Add PhpAutoPrepend.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416638 (https://phabricator.wikimedia.org/T180183) [04:53:44] (03CR) 10Krinkle: [C: 032] "no-op (not yet used by puppet), matches beta." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416638 (https://phabricator.wikimedia.org/T180183) (owner: 10Krinkle) [04:55:08] (03Merged) 10jenkins-bot: Add PhpAutoPrepend.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416638 (https://phabricator.wikimedia.org/T180183) (owner: 10Krinkle) [04:57:22] !log krinkle@tin Synchronized wmf-config/PhpAutoPrepend-labs.php: beta: no-op (duration: 00m 57s) [04:57:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:58:17] !log krinkle@tin Synchronized wmf-config/profiler-labs.php: beta: no-op (duration: 00m 54s) [04:58:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:59:30] !log krinkle@tin Synchronized wmf-config/profiler.php: T180183 - Ie5a164a9e2b (duration: 00m 57s) [04:59:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:59:44] T180183: Profiling for X-Wikimedia-Debug seems to start fairly late - https://phabricator.wikimedia.org/T180183 [04:59:44] (03PS2) 10Krinkle: mediawiki: Enable auto_prepend_file on canary_appserver [puppet] - 10https://gerrit.wikimedia.org/r/416637 (https://phabricator.wikimedia.org/T180183) [05:00:52] !log krinkle@tin Synchronized wmf-config/PhpAutoPrepend.php: T180183: I6d72873b9d3 (duration: 00m 56s) [05:01:01] * Krinkle is done [05:01:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:29:20] (03CR) 10Dzahn: [C: 04-1] "i fixed some more comments but also broke it, dont bother with review yet, i'll amend tomorrow anyways" [puppet] - 10https://gerrit.wikimedia.org/r/400615 (https://phabricator.wikimedia.org/T82937) (owner: 10Dzahn) [06:11:07] 10Operations, 10ops-eqsin, 10Traffic, 10netops: replace eqsin SFP-T/SFP+ - https://phabricator.wikimedia.org/T188923#4026116 (10ayounsi) [06:25:11] 10Operations, 10netops: cr1-eqsin faulty interfaces - https://phabricator.wikimedia.org/T187807#4026126 (10ayounsi) 05Open>03Resolved Unit replaced by Papaul, all interfaces are up! No alarms. [06:42:34] (03PS1) 10Marostegui: db-eqiad.php: Depool db1101:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416643 (https://phabricator.wikimedia.org/T187089) [06:44:26] (03CR) 10jerkins-bot: [V: 04-1] db-eqiad.php: Depool db1101:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416643 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [06:46:18] (03PS1) 10Marostegui: install_server: Allow db2037 to install to stretch [puppet] - 10https://gerrit.wikimedia.org/r/416644 [06:47:14] (03PS2) 10Marostegui: db-eqiad.php: Depool db1101:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416643 (https://phabricator.wikimedia.org/T187089) [06:48:56] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1101:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416643 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [06:49:16] (03CR) 10Marostegui: [C: 032] install_server: Allow db2037 to install to stretch [puppet] - 10https://gerrit.wikimedia.org/r/416644 (owner: 10Marostegui) [06:50:08] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1101:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416643 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [06:51:30] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1101:3317 for alter table (duration: 00m 58s) [06:51:33] !log Stop mysql on db2037 to upgrade it [06:51:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:16] !log Deploy schema change on db1101:3317 - T187089 T185128 T153182 [06:56:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:33] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [06:56:33] T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182 [06:56:34] T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128 [06:58:11] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1068 - https://phabricator.wikimedia.org/T188187#4026142 (10Marostegui) Almost there: ``` root@db1068:~# megacli -PDRbld -ShowProg -PhysDrv [32:9] -aALL Rebuild Progress on Device at Enclosure 32, Slot 9 Completed 97% in 920 Minutes. ``` [07:08:53] 10Operations, 10procurement: Buy spare disks - https://phabricator.wikimedia.org/T188978#4026146 (10Marostegui) [07:18:42] !log Stop MySQL on db2093 to get some data from the event scheduler [07:18:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:26] RECOVERY - MegaRAID on db1068 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [07:28:45] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1068 - https://phabricator.wikimedia.org/T188187#4026164 (10Marostegui) 05Open>03Resolved ``` ˜/icinga-wm 8:26> RECOVERY - MegaRAID on db1068 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy ``` ``` root@db1068:~# megacli -LDPDInfo -aAll A... [07:34:35] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4026166 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` ['db2037.codfw.wmnet'] ``` Th... [07:36:00] !log rebooting remaining swift backend servers in codfw for kernel security update [07:36:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:49] (03PS1) 10Marostegui: db-codfw.php: Depool some hosts: kernel upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416646 [07:39:31] (03PS1) 10Muehlenhoff: Extended access for siddarth11 [puppet] - 10https://gerrit.wikimedia.org/r/416647 [07:40:37] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool some hosts: kernel upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416646 (owner: 10Marostegui) [07:41:24] (03CR) 10Muehlenhoff: [C: 032] Extended access for siddarth11 [puppet] - 10https://gerrit.wikimedia.org/r/416647 (owner: 10Muehlenhoff) [07:41:49] (03Merged) 10jenkins-bot: db-codfw.php: Depool some hosts: kernel upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416646 (owner: 10Marostegui) [07:43:18] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool some codfw hosts for kernel and mariadb upgrade (duration: 00m 58s) [07:43:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:58] !log Stop mysql on db2090 db2080 db2076 db2073 db2067 for mariadb and kernel upgrade [07:44:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:38] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool some hosts: kernel upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416648 [07:59:59] !log rebooting tegmen for kernel security update [08:00:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:33] !log drain+reboot analytics[61,63,64] for kernel updates [08:01:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:59] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4026189 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db2037.codfw.wmnet'] ``` and were **ALL** successful. [08:10:32] !log rebooting bast5001 for kernel security update [08:10:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:52] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool some hosts: kernel upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416648 (owner: 10Marostegui) [08:13:04] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool some hosts: kernel upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416648 (owner: 10Marostegui) [08:14:27] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Revert depool some codfw hosts for kernel and mariadb upgrade (duration: 00m 57s) [08:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:55] !log rebooting ruthenium for kernel security update [08:16:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:49] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Update db1069 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416649 [08:23:10] (03CR) 10Giuseppe Lavagetto: [C: 031] db-eqiad,db-codfw.php: Update db1069 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416649 (owner: 10Marostegui) [08:23:19] (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw.php: Update db1069 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416649 (owner: 10Marostegui) [08:24:30] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Update db1069 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416649 (owner: 10Marostegui) [08:25:52] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Update db1069 IP (duration: 00m 57s) [08:26:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:55] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Update db1069 IP (duration: 00m 57s) [08:27:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:26] !log drain+reboot analytics[1065-1067] for kernel updates [08:30:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:34] !log rebooting wasat for kernel security update [08:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:26] PROBLEM - Host furud is DOWN: PING CRITICAL - Packet loss = 100% [08:38:44] !log rebooting furud [08:38:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:06] RECOVERY - Host furud is UP: PING OK - Packet loss = 0%, RTA = 36.09 ms [08:39:19] (03PS3) 10Giuseppe Lavagetto: Fetch the last modified index in etcd.php, and expose it via siteinfo. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416470 (https://phabricator.wikimedia.org/T182597) [08:39:21] (03PS2) 10Giuseppe Lavagetto: Fetch data from etcd on every server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416482 (https://phabricator.wikimedia.org/T182597) [08:39:23] (03PS2) 10Giuseppe Lavagetto: Enable use of EtcdConfig everywhere. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416483 (https://phabricator.wikimedia.org/T182597) [08:40:26] (03CR) 10jerkins-bot: [V: 04-1] Fetch data from etcd on every server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416482 (https://phabricator.wikimedia.org/T182597) (owner: 10Giuseppe Lavagetto) [08:40:28] (03CR) 10jerkins-bot: [V: 04-1] Enable use of EtcdConfig everywhere. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416483 (https://phabricator.wikimedia.org/T182597) (owner: 10Giuseppe Lavagetto) [08:40:36] PROBLEM - DPKG on oxygen is CRITICAL: DPKG CRITICAL dpkg reports broken packages [08:42:42] (03CR) 10Ema: [C: 031] "Ship it!" [debs/pybal] (1.15) - 10https://gerrit.wikimedia.org/r/416412 (owner: 10Vgutierrez) [08:43:12] (03CR) 10Vgutierrez: [C: 032] Release PyBal 1.15.0 [debs/pybal] (1.15) - 10https://gerrit.wikimedia.org/r/416412 (owner: 10Vgutierrez) [08:44:00] (03Merged) 10jenkins-bot: Release PyBal 1.15.0 [debs/pybal] (1.15) - 10https://gerrit.wikimedia.org/r/416412 (owner: 10Vgutierrez) [08:44:28] (03PS1) 10Marostegui: site.pp: Make db1073 master [puppet] - 10https://gerrit.wikimedia.org/r/416650 (https://phabricator.wikimedia.org/T183469) [08:44:52] (03CR) 10jenkins-bot: Increase ReadingLists item limit to 5k [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409712 (https://phabricator.wikimedia.org/T186296) (owner: 10Gergő Tisza) [08:45:22] (03CR) 10jenkins-bot: profiler: Add CPU/MEMORY flags to XHProf in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415209 (owner: 10Krinkle) [08:47:52] (03CR) 10jenkins-bot: multiversion: Remove support for MW_LANG env override (2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413646 (owner: 10Krinkle) [08:48:16] (03CR) 10EddieGP: "I probably can't be here for midday swat (the only one before the next train), but in case anyone else wants to do it, please feel free to" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416330 (https://phabricator.wikimedia.org/T188472) (owner: 10EddieGP) [08:49:21] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/compiler02/10275/db1073.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/416650 (https://phabricator.wikimedia.org/T183469) (owner: 10Marostegui) [08:50:28] !log reboot meitnerium (archiva) for kernel updates [08:50:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:56] (03CR) 10jenkins-bot: profiler: Sync remaining differences with profiler-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416636 (https://phabricator.wikimedia.org/T180183) (owner: 10Krinkle) [08:54:10] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1101:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416643 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [08:55:35] 10Operations, 10DC-Ops, 10monitoring: memory errors not showing in icinga - https://phabricator.wikimedia.org/T183177#4026233 (10fgiunchedi) @akosiaris suggested also `edac-tools` and that reminded me we're exporting edac metrics from node-exporter: ``` tin:~$ curl -s localhost:9100/metrics | grep -i ^node_... [08:56:39] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1101:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416651 [08:56:44] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1101:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416651 [09:05:33] !log rolling restart of maps* for kernel upgrade [09:05:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:15] (03CR) 10Filippo Giunchedi: [C: 031] Add a thirdparty/php71 component for use by Phabricator [puppet] - 10https://gerrit.wikimedia.org/r/415856 (owner: 10Muehlenhoff) [09:08:33] PROBLEM - Host maps-test2001 is DOWN: PING CRITICAL - Packet loss = 100% [09:08:58] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/415857 (owner: 10Muehlenhoff) [09:09:12] (03PS1) 10Ema: varnish: cleanup after upgrade to v5 [puppet] - 10https://gerrit.wikimedia.org/r/416652 (https://phabricator.wikimedia.org/T188545) [09:09:22] (03CR) 10Elukey: [C: 031] Add repository configuration for thirdparty/php71 [puppet] - 10https://gerrit.wikimedia.org/r/415857 (owner: 10Muehlenhoff) [09:10:36] (03CR) 10Filippo Giunchedi: [C: 031] hhvm: remove legacy diamond collectors [puppet] - 10https://gerrit.wikimedia.org/r/415828 (owner: 10Giuseppe Lavagetto) [09:10:42] RECOVERY - Host maps-test2001 is UP: PING OK - Packet loss = 0%, RTA = 36.14 ms [09:10:58] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1101:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416651 (owner: 10Marostegui) [09:11:30] (03CR) 10Giuseppe Lavagetto: "I would honestly prefer to either backport php 7.2 from sid ourselves, or to use any official backport of it. And do that for *all* of our" [puppet] - 10https://gerrit.wikimedia.org/r/415856 (owner: 10Muehlenhoff) [09:12:12] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1101:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416651 (owner: 10Marostegui) [09:13:31] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1101:3317 after alter table (duration: 00m 57s) [09:13:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:57] (03PS1) 10Marostegui: db-eqiad.php: Depool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416654 (https://phabricator.wikimedia.org/T187089) [09:15:26] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416654 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [09:16:39] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416654 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [09:17:22] !log rebooting wwift backend servers in eqiad for kernel security update [09:17:27] !log rebooting swift backend servers in eqiad for kernel security update [09:17:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:54] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1086 for alter table (duration: 00m 57s) [09:18:06] !log Stop and reboot db1086 for kernel and mariadb upgrade [09:18:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:48] !log Deploy schema change on db1086 - T187089 T185128 T153182 [09:25:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:06] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [09:25:06] T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182 [09:25:06] T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128 [09:27:23] !log reboot kafka2001 (eventbus codfw) for kernel updates [09:27:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:03] (03CR) 10jenkins-bot: db-codfw.php: Depool some hosts: kernel upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416646 (owner: 10Marostegui) [09:32:22] PROBLEM - proxysql processes on wasat is CRITICAL: PROCS CRITICAL: 0 processes with command name proxysql [09:36:16] <_joe_> marostegui: ^^ [09:36:37] let's see [09:38:48] !log rebooting wezen for kernel security update [09:39:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:15] !log Start proxysql on wasat [09:40:23] RECOVERY - proxysql processes on wasat is OK: PROCS OK: 1 process with command name proxysql [09:40:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:36] !log pybal_1.15.0_all.deb to apt.wikimedia.org jessie-wikimedia [09:41:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:33] !log Stop MySQL on db1107 for mariadb and kernel upgrade [09:42:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:02] RECOVERY - DPKG on oxygen is OK: All packages OK [09:45:22] PROBLEM - haproxy failover on dbproxy1009 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [09:45:23] PROBLEM - haproxy failover on dbproxy1004 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [09:45:58] ^ that is expected because of db1107 mention above [09:46:29] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool some hosts: kernel upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416648 (owner: 10Marostegui) [09:46:34] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Update db1069 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416649 (owner: 10Marostegui) [09:46:38] (03CR) 10jenkins-bot: profiler: Remove redundant PyBall-XFF check for xwd/profile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415210 (owner: 10Krinkle) [09:46:43] (03CR) 10jenkins-bot: profiler: Enable xhprof earlier from StartProfile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415211 (https://phabricator.wikimedia.org/T180183) (owner: 10Krinkle) [09:46:49] (03CR) 10jenkins-bot: Add PhpAutoPrepend.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416638 (https://phabricator.wikimedia.org/T180183) (owner: 10Krinkle) [09:46:54] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1101:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416651 (owner: 10Marostegui) [09:47:01] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416654 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [09:48:59] (03CR) 10Jcrespo: [C: 04-2] "This is the firewall used by all production hosts, this shouldn't be touched. This is not in use by m5." [puppet] - 10https://gerrit.wikimedia.org/r/416598 (https://phabricator.wikimedia.org/T188915) (owner: 10Andrew Bogott) [09:51:16] !log rebooting graphite hosts in codfw for kernel security update [09:51:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:54] (03PS1) 10Vgutierrez: Release PyBal 1.15.0 [debs/pybal] - 10https://gerrit.wikimedia.org/r/416657 [10:09:24] (03CR) 10Vgutierrez: [C: 032] Release PyBal 1.15.0 [debs/pybal] - 10https://gerrit.wikimedia.org/r/416657 (owner: 10Vgutierrez) [10:09:53] (03Merged) 10jenkins-bot: Release PyBal 1.15.0 [debs/pybal] - 10https://gerrit.wikimedia.org/r/416657 (owner: 10Vgutierrez) [10:10:20] \o/ [10:20:51] !log reboot completed for maps2* and maps-test* [10:21:03] maps1* (eqiad) will be after lunch [10:21:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:30] (03PS3) 10Gehel: wdqs: enable kafka poller on all production nodes [puppet] - 10https://gerrit.wikimedia.org/r/416475 (https://phabricator.wikimedia.org/T188252) [10:22:11] (03CR) 10Gehel: [C: 032] wdqs: enable kafka poller on all production nodes [puppet] - 10https://gerrit.wikimedia.org/r/416475 (https://phabricator.wikimedia.org/T188252) (owner: 10Gehel) [10:22:32] PROBLEM - puppet last run on ms-be1013 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 15 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[mkfs-/dev/sdk1],Exec[xfs_label-/dev/sdm3],Exec[xfs_label-/dev/sdm4] [10:22:48] gehel: how dare you preferring lunch to reboots :) [10:23:07] elukey: yeah, I know, strange priorities ... [10:23:37] there are a few other things I want to do before lunch and I prefer not to stop in the middle of the cluster restart... [10:24:04] * gehel has heard about a nice tool coming up to automate all that (volans 😉) [10:24:24] (03PS1) 10Marostegui: dbproxy1005: Make db1073 master instead of db1009 [puppet] - 10https://gerrit.wikimedia.org/r/416658 (https://phabricator.wikimedia.org/T183469) [10:24:36] (03CR) 10Marostegui: [C: 04-2] "This is not yet ready" [puppet] - 10https://gerrit.wikimedia.org/r/416658 (https://phabricator.wikimedia.org/T183469) (owner: 10Marostegui) [10:24:45] (03CR) 10ArielGlenn: [C: 031] "This looks fine; I'd like to coordinate the merge with you unless you are ok with me "just doing it"." [puppet] - 10https://gerrit.wikimedia.org/r/416502 (https://phabricator.wikimedia.org/T168486) (owner: 10Madhuvishy) [10:24:46] I need to spend time on automating hadoop reboots, the cluster is getting bigger and bigger, manually is not feasible anymore [10:25:31] elukey: btw, I'm activating kafka poller on all wdqs nodes [10:26:32] gehel: super [10:28:30] !log rebooting tin for kernel security update [10:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:10] !log kafka poller active on all production wdqs nodes - T188252 [10:30:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:27] T188252: Activate kafka-based recent change poller for wikidata query service - https://phabricator.wikimedia.org/T188252 [10:32:44] !log rearming keyholder on tin after reboot [10:32:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:31] !log rebooting naos for kernel security update [10:35:37] gehel: /me hiding [10:35:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:20] <_joe_> elukey: if volans ever delivered on his promise of a switchdc spinoff [10:37:23] <_joe_> ... [10:38:07] it is always Riccardo's fault I know [10:38:35] (03PS1) 10Alexandros Kosiaris: Enable CAPTCHA in metawiki contactpages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416659 [10:38:55] if you had me include it in any of current goals... [10:39:37] !log reboot ms-be1013 to try fix disk ordering [10:39:41] !log emergency add a captcha in metawiki contact pages like https://meta.wikimedia.org/wiki/Special:Contact/Stewards to stop bot abuse. phab Task to be filed later on [10:39:44] (03CR) 10jerkins-bot: [V: 04-1] Enable CAPTCHA in metawiki contactpages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416659 (owner: 10Alexandros Kosiaris) [10:39:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:11] (03PS2) 10Alexandros Kosiaris: Enable CAPTCHA in metawiki contactpages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416659 [10:43:15] !log rearming keyholder on naos after reboot [10:43:28] (03CR) 10Alexandros Kosiaris: [C: 032] Enable CAPTCHA in metawiki contactpages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416659 (owner: 10Alexandros Kosiaris) [10:43:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:42] !log akosiaris@tin Synchronized wmf-config/CommonSettings.php: (no justification provided) (duration: 01m 22s) [10:45:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:22] !log powercycling ms-be1021, stuck after reboot [10:46:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:58] (03PS4) 10Giuseppe Lavagetto: Fetch the last modified index in etcd.php, and expose it via siteinfo. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416470 (https://phabricator.wikimedia.org/T182597) [10:47:00] (03PS3) 10Giuseppe Lavagetto: Fetch data from etcd on every server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416482 (https://phabricator.wikimedia.org/T182597) [10:47:02] (03PS3) 10Giuseppe Lavagetto: Enable use of EtcdConfig everywhere. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416483 (https://phabricator.wikimedia.org/T182597) [10:47:37] RECOVERY - puppet last run on ms-be1013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:48:15] (03CR) 10jenkins-bot: Enable CAPTCHA in metawiki contactpages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416659 (owner: 10Alexandros Kosiaris) [10:48:24] (03CR) 10jerkins-bot: [V: 04-1] Fetch data from etcd on every server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416482 (https://phabricator.wikimedia.org/T182597) (owner: 10Giuseppe Lavagetto) [10:48:26] (03CR) 10jerkins-bot: [V: 04-1] Enable use of EtcdConfig everywhere. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416483 (https://phabricator.wikimedia.org/T182597) (owner: 10Giuseppe Lavagetto) [11:05:27] 10Operations, 10OTRS: https://meta.wikimedia.org/wiki/Special:Contact/Stewards is being abused by spammers - https://phabricator.wikimedia.org/T188985#4026480 (10akosiaris) [11:05:49] !log reboot analytics10[28,35,52] for kernel updates (one at the time, hadoop hdfs journal nodes) [11:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:28] 10Operations, 10OTRS: https://meta.wikimedia.org/wiki/Special:Contact/Stewards is being abused by spammers - https://phabricator.wikimedia.org/T188985#4026492 (10akosiaris) p:05Triage>03Normal The above has been done in https://gerrit.wikimedia.org/r/#/c/416659/ and now the contact pages on metawiki have c... [11:08:19] (03PS4) 10Giuseppe Lavagetto: Fetch data from etcd on every server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416482 (https://phabricator.wikimedia.org/T182597) [11:08:21] (03PS4) 10Giuseppe Lavagetto: Enable use of EtcdConfig everywhere. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416483 (https://phabricator.wikimedia.org/T182597) [11:27:32] PROBLEM - Check status of defined EventLogging jobs on eventlog1001 is CRITICAL: CRITICAL: Stopped EventLogging jobs: consumer/mysql-m4-master-00 consumer/mysql-eventbus [11:29:12] !log rebooting k8s masters for kernel security update [11:29:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:06] 10Operations, 10OTRS, 10Stewards-and-global-tools: https://meta.wikimedia.org/wiki/Special:Contact/Stewards is being abused by spammers - https://phabricator.wikimedia.org/T188985#4026543 (10MarcoAurelio) [11:35:22] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [11:35:23] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [11:37:20] peak that should be recovered --^ [11:38:18] <_joe_> elukey: any Idea what happened? [11:38:39] nope sorry didn't check, I was busy with el :( [11:40:48] <_joe_> yeah don't worry [11:42:22] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [11:42:32] (03CR) 10Volans: [C: 04-1] "I think that the current proposed structure of dependencies doesn't follow our Puppet coding style/standards, see comments inline." (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/413881 (https://phabricator.wikimedia.org/T187258) (owner: 10Herron) [11:43:32] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [11:57:01] (03CR) 10Volans: [C: 031] "I'm definitely no expert on MW-config but LGTM. It would be nice to have the whole patch series reviewed by someone else more used to MW-c" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416470 (https://phabricator.wikimedia.org/T182597) (owner: 10Giuseppe Lavagetto) [11:57:55] (03CR) 10Volans: "I like the approach that would allow us to test it and test the Icinga checks. See one comment inline though." (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416482 (https://phabricator.wikimedia.org/T182597) (owner: 10Giuseppe Lavagetto) [11:58:18] (03PS1) 10Filippo Giunchedi: wmflib: port role and nuyaml to hiera3 [puppet] - 10https://gerrit.wikimedia.org/r/416664 (https://phabricator.wikimedia.org/T188623) [11:58:20] (03PS1) 10Filippo Giunchedi: Use hiera3 role/nuyaml backends on >= stretch [puppet] - 10https://gerrit.wikimedia.org/r/416665 (https://phabricator.wikimedia.org/T188623) [11:58:27] (03CR) 10Volans: [C: 04-1] "I think is missing some changes in CommonSettings.php, see inline." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416483 (https://phabricator.wikimedia.org/T182597) (owner: 10Giuseppe Lavagetto) [11:59:18] (03CR) 10jerkins-bot: [V: 04-1] Use hiera3 role/nuyaml backends on >= stretch [puppet] - 10https://gerrit.wikimedia.org/r/416665 (https://phabricator.wikimedia.org/T188623) (owner: 10Filippo Giunchedi) [12:02:46] (03PS2) 10Filippo Giunchedi: Use hiera3 role/nuyaml backends on >= stretch [puppet] - 10https://gerrit.wikimedia.org/r/416665 (https://phabricator.wikimedia.org/T188623) [12:03:29] (03CR) 10jerkins-bot: [V: 04-1] Use hiera3 role/nuyaml backends on >= stretch [puppet] - 10https://gerrit.wikimedia.org/r/416665 (https://phabricator.wikimedia.org/T188623) (owner: 10Filippo Giunchedi) [12:04:23] !log rebooting graphite hosts in eqiad for kernel security update [12:04:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:54] (03PS3) 10Filippo Giunchedi: Use hiera3 role/nuyaml backends on >= stretch [puppet] - 10https://gerrit.wikimedia.org/r/416665 (https://phabricator.wikimedia.org/T188623) [12:05:20] 10Operations, 10Wikimedia-Mailing-lists: Have a conversation about migrating from GNU Mailman 2.1 to GNU Mailman 3.0 - https://phabricator.wikimedia.org/T52864#4026618 (10MarcoAurelio) I think {T181803} is reason enough not to delay this further. Some of our mailing lists are used to discuss sensitive topics.... [12:14:33] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler02/10281/" [puppet] - 10https://gerrit.wikimedia.org/r/416665 (https://phabricator.wikimedia.org/T188623) (owner: 10Filippo Giunchedi) [12:15:50] (03CR) 10Ladsgroup: [C: 031] "Do you want to deploy it? Let me know :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415861 (https://phabricator.wikimedia.org/T181159) (owner: 10Awight) [12:17:02] (03CR) 10Filippo Giunchedi: "See https://gerrit.wikimedia.org/r/c/416665/ and https://gerrit.wikimedia.org/r/c/416664 for stretch+jessie compat" [puppet] - 10https://gerrit.wikimedia.org/r/402346 (owner: 10Giuseppe Lavagetto) [12:17:07] (03CR) 10Filippo Giunchedi: "See https://gerrit.wikimedia.org/r/c/416665/ and https://gerrit.wikimedia.org/r/c/416664 for stretch+jessie compat" [puppet] - 10https://gerrit.wikimedia.org/r/415896 (https://phabricator.wikimedia.org/T188623) (owner: 10Filippo Giunchedi) [12:33:22] !log rebooting mwlog* for kernel security update [12:33:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:21] (03PS3) 10MarcoAurelio: Disable abusefilter from collecting private data on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416346 (https://phabricator.wikimedia.org/T188862) [12:37:12] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1086" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416671 [12:37:17] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1086" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416671 [12:40:03] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1086" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416671 (owner: 10Marostegui) [12:41:57] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1086" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416671 (owner: 10Marostegui) [12:42:10] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1086" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416671 (owner: 10Marostegui) [12:43:14] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1086 after alter table (duration: 00m 58s) [12:43:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:11] 10Operations, 10Analytics-Kanban: Eventlogging mysql consumers inserted rows on the analytics slave (db1108) for two hours - https://phabricator.wikimedia.org/T188991#4026711 (10elukey) p:05Triage>03High [12:50:46] !log mobrovac@tin Started deploy [cpjobqueue/deploy@9b0b947]: refreshLinks: Increase concurrency to 100 - T185052 [12:51:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:03] T185052: Migrate RefreshLinks job to kafka - https://phabricator.wikimedia.org/T185052 [12:51:20] !log mobrovac@tin Finished deploy [cpjobqueue/deploy@9b0b947]: refreshLinks: Increase concurrency to 100 - T185052 (duration: 00m 34s) [12:51:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:52] !log rebooting URL downloaders for kernel security update [12:55:58] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4026799 (10Marostegui) I would like to failover m5 to get rid of db1009 and make db1073 the master. This is what I had in mind 1- merge: https://gerri... [12:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:07] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Add tiller image used to run services [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/416473 (owner: 10Alexandros Kosiaris) [13:01:24] 10Operations, 10Analytics-Kanban: Eventlogging mysql consumers inserted rows on the analytics slave (db1108) for two hours - https://phabricator.wikimedia.org/T188991#4026817 (10elukey) [13:05:43] marostegui: any more upcoming depoolings in the next hour or so? [13:05:43] (03PS1) 10Marostegui: db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416677 (https://phabricator.wikimedia.org/T187089) [13:05:46] ha [13:05:50] mobrovac: XDDDD [13:06:10] I can wait [13:06:11] no worries [13:06:31] marostegui: you can go ahead with this one if you want, then i can take over? [13:06:36] sure! [13:06:37] thanks [13:06:54] cool [13:06:59] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416677 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [13:08:11] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416677 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [13:08:26] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416677 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [13:09:42] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1094 for alter table (duration: 00m 58s) [13:09:49] mobrovac: I am done [13:09:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:24] kk thnx marostegui [13:10:52] !log Deploy schema change on db1094 - T187089 T185128 T153182 [13:11:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:09] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [13:11:09] T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182 [13:11:09] T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128 [13:15:07] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4026866 (10jcrespo) Please note m5-master is not pointing to dbproxy1005. Ideally it should (after the failover happens), but with T188589 ongoing, I d... [13:16:26] marostegui: could i enlist your help real quick? i'd need a chmod -R g+x on /srv/mediawiki-staging/php-1.31.0-wmf.23/.git [13:16:34] on tin [13:16:50] !log powercycling ms-be1038, stuck after reboot [13:16:59] g+x ? [13:17:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:12] if there is an ownership problem [13:17:25] that should be fixed, not +x ? [13:17:46] mobrovac: do you have problems fetching or rebasing? [13:17:48] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4026868 (10Marostegui) >>! In T183469#4026866, @jcrespo wrote: > Please note m5-master is not pointing to dbproxy1005. Ideally it should (after the fai... [13:18:00] jynus: some dirs there don't have the group write bit set, so git can't unpack objects [13:18:14] fetching [13:18:43] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4026873 (10jcrespo) > Yeah, that is why I thought about completely stop MySQL on db1009 apart from the read_only=ON Depending on the implementation th... [13:19:44] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4026875 (10Marostegui) >>! In T183469#4026873, @jcrespo wrote: >> Yeah, that is why I thought about completely stop MySQL on db1009 apart from the read... [13:20:05] ah sorry marostegui, jynus, i meant g+w :P [13:20:13] ah, ok [13:20:29] * mobrovac hasn't had enough coffee yet [13:23:03] I can fetch with no problem as a regular user, which error are you seeing? [13:23:20] sorry to ask, I just want to make sure everthing is fixed [13:23:32] "error: insufficient permission for adding an object to repository database .git/objects" [13:23:38] ok, let me see [13:24:15] most dirs have drwxrwsr-x, but some have drwxr-sr-x which creates the problem [13:25:06] (03PS1) 10Marostegui: wmnet: Promote db1073 to become m5 master [dns] - 10https://gerrit.wikimedia.org/r/416680 (https://phabricator.wikimedia.org/T183469) [13:25:17] 10Operations, 10Contributors-Analysis, 10Mail, 10Surveys: Qualtrics cannot send email to wikimedia.org addresses - https://phabricator.wikimedia.org/T176666#4026906 (10Neil_P._Quinn_WMF) p:05Triage>03Normal [13:25:19] (03CR) 10Marostegui: [C: 04-2] "Not ready" [dns] - 10https://gerrit.wikimedia.org/r/416680 (https://phabricator.wikimedia.org/T183469) (owner: 10Marostegui) [13:25:55] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4026912 (10Marostegui) To reflect the latest comments: merge: https://gerrit.wikimedia.org/r/#/c/416650/ move db2037 under db1073 merge: https://gerri... [13:26:35] mobrovac: jynus@tin:/srv/mediawiki-staging/.git$ sudo ls -lR . | grep 'drwxr-sr-x' -> no results [13:27:01] jynus: the dir is /srv/mediawiki-staging/php-1.31.0-wmf.23 :) [13:27:04] that s is probably an error [13:27:08] ok [13:27:22] well, /srv/mediawiki-staging/php-1.31.0-wmf.23/.git that is [13:27:52] start by there [13:27:57] that is indeed broken [13:28:08] sorry about that [13:29:02] i appreciate the due diligence [13:29:52] try now [13:30:14] works now [13:30:17] thnx jynus! [13:32:39] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4026941 (10jcrespo) s/grab binlog position for db1073 just in case/grab it from both in case we want to repoint the original master "just in case"/ Af... [13:34:20] so why are there a loooot of commits in wmf.23 not pulled on tin? what is going on here? [13:35:13] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4026946 (10Marostegui) >>! In T183469#4026941, @jcrespo wrote: > s/grab binlog position for db1073 just in case/grab it from both in case we want to re... [13:36:37] hashar, zeljkof: ping? [13:38:42] (03CR) 10Vgutierrez: [C: 031] "FWIW, I've been unable to find orphan references to varnish 4 --> 5 migration, LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/416652 (https://phabricator.wikimedia.org/T188545) (owner: 10Ema) [13:40:54] 10Operations, 10ops-codfw, 10User-Elukey: rack/setup/install mw2259-mw2290 - https://phabricator.wikimedia.org/T188301#4026968 (10MoritzMuehlenhoff) >>! In T188301#4023857, @Joe wrote: > My racking recommendation would be to put the new servers in row B in place of the ones we're decommissioning. > > That w... [13:41:20] mobrovac: hello. What is happening? [13:41:36] mobrovac: pong [13:45:12] <_joe_> zeljkof: < mobrovac> so why are there a loooot of commits in wmf.23 not pulled on tin? what is going on here? [13:45:39] <_joe_> mobrovac: can I ask how did you check that [13:46:40] <_joe_> mobrovac: last change I see there is Ie59f62dcb52ea12f [13:46:59] mobrovac _joe_ don't know, did not visit tin since last week, hashar was in charge of swat yesterday, maybe he knows [13:48:22] <_joe_> ok so let's ping him on the phone, if he doesn't respond here/hangouts [13:48:41] know what? [13:49:11] ah commits [13:49:17] I have no clue? :D [13:49:44] <_joe_> ok let's just sort this out, or I hereby decree a complete freeze on all deployments [13:50:06] <_joe_> mobrovac: can i ask again how did you check that? [13:50:07] yeah, let's fix this [13:50:12] so camo on tin [13:50:19] hmm [13:50:26] did git fetch and git log ..origin/wmf/bla-23 [13:50:31] <_joe_> mobrovac: no, let's figure out what happened if anything [13:50:46] <_joe_> can i get the command and outputs pasted somewhere? [13:51:32] mobrovac: and please describe what is wrong? [13:51:55] there's something very wrong here, the list given by git log is never-ending [13:52:00] wait a sec, checking something [13:53:02] lemme paste it [13:54:01] here it is _joe_ hashar: https://phabricator.wikimedia.org/P6800 [13:54:20] hashar: and the huge list of commits that you see there is the problem [13:54:51] <_joe_> from what I see there is some problem somewhere indeed, not sure the issue is tin though [13:56:28] mobrovac: yeah your git command is wrong? [13:56:44] euh? [13:56:48] kidding :D [13:57:13] * wmf/1.31.0-wmf.23 a28f14cba8 [origin/wmf/1.31.0-wmf.23: ahead 2, behind 92] SECURITY: [13:57:23] yeah i'm not in the fun mood, honestly, i spent the whole window of 1h fixing other problems on tin [13:57:47] so somehow it got rebased to a wrong branch [13:58:02] beautiful [13:58:04] you can get the history with: git reflog [13:58:39] <_joe_> hashar: it == what? [13:58:42] all of them are wmf.23 though [13:59:33] <_joe_> so lemme test if production is on the right branch or not [13:59:38] <_joe_> first of all [14:00:05] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Time to snap out of that daydream and deploy European Mid-day SWAT(Max 8 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180306T1400). [14:00:05] _joe_ and James_F: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:16] !log rebooting oxygen for kernel security update [14:00:19] <_joe_> yeah swat is suspended [14:00:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:38] Hey. [14:00:43] Sure. [14:00:45] <_joe_> !log SWAT is suspended for investigation on tin's git status [14:01:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:03] <_joe_> James_F: sorry :( [14:01:05] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:01:15] PROBLEM - puppet last run on conf1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:01:22] _joe_: No rush. [14:01:25] PROBLEM - puppet last run on analytics1051 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:01:35] PROBLEM - puppet last run on db1082 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:01:35] PROBLEM - puppet last run on elastic1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:01:35] PROBLEM - puppet last run on druid1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:01:35] PROBLEM - puppet last run on aqs1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:01:35] PROBLEM - puppet last run on elastic1048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:01:55] PROBLEM - puppet last run on ms-be1028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:02:05] PROBLEM - puppet last run on db1097 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:02:45] PROBLEM - puppet last run on silver is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:02:45] PROBLEM - puppet last run on logstash1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:02:52] <_joe_> ok so [14:03:03] <_joe_> what is on tin is what's deployed in production [14:03:19] <_joe_> looking at the file RELEASE-NOTES-1.31 [14:03:25] PROBLEM - puppet last run on cp4025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:03:47] yeah well [14:04:06] PROBLEM - puppet last run on restbase1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:04:06] PROBLEM - puppet last run on db1073 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:04:15] 10Operations, 10DBA, 10cloud-services-team: Failover m5 master from db1009 to db1073 - https://phabricator.wikimedia.org/T189005#4027038 (10Marostegui) p:05Triage>03Normal [14:04:17] 88b947427ce989eacf11888b96ffeec69307b189 [14:04:25] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:04:29] that is a merge of "I have no clue what branch" into wmf/1.31.0-wmf.23 [14:04:39] done at 13:18:48 today [14:04:45] (utc obviously) [14:04:58] <_joe_> hashar: ok, any further deployments after that? [14:05:12] I dont even know whether that has been deployed [14:05:13] <_joe_> hashar: you mean on tin or on gerrit? [14:05:17] on tin [14:05:21] <_joe_> please be specific [14:05:27] <_joe_> it has, according to that one file [14:05:35] <_joe_> but should I check another? [14:05:44] well the origin/wmf/1.31.0-wmf.23 branch tip points at that merge commit 88b947427c [14:06:09] and that is in Gerrit apparently [14:06:10] https://gerrit.wikimedia.org/r/#/c/416675/ got merged at 13:18 [14:06:13] <_joe_> ok so that's gerrit [14:06:19] <_joe_> not tin [14:06:24] <_joe_> please be specific [14:06:29] and that's a simple cherry-pick [14:06:30] <_joe_> so the error is in gerrit, right? [14:07:07] !log rebooting pybal-test for kernel security update [14:07:12] <_joe_> mobrovac: maybe gerrit did fuckup? [14:07:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:26] so drastically? for a simple cherry-pick? [14:07:33] !log rebooting maps1* (eqiad) for kernel security update [14:07:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:31] jynus marostegui revi https://tools.wmflabs.org/replag/ <--- weird, is something happening ? [14:09:09] arturo: at a first glance I am not seeing any issues with the labs hosts [14:09:12] <_joe_> is anyone doing *things* on tin? [14:09:12] Going to check in deep [14:09:21] <_joe_> meaning trying to fix issues? [14:09:21] _joe_: not me [14:09:28] <_joe_> if so, log it please [14:09:32] <_joe_> and let's coordinate [14:09:50] thanks marostegui [14:09:57] so [14:09:59] tin is fine [14:10:04] as far as I can tell [14:10:10] BUT DO NOT DEPLOY FROM IT! [14:10:19] arturo: I don't see any lag on any of the hosts [14:10:38] the mediawiki/core branch "wmf/1.31.0-wmf.23" ended up having a merge commit of the master branch [14:10:46] <_joe_> hashar: yes [14:10:56] <_joe_> and it seems it's jenkins-bot who did the merge? [14:10:58] and tin reflects that accordingly [14:11:01] marostegui: does it makes sense if the replag app is unable to do proper calculation of the data? [14:11:10] ok, let's revert 88b947427c then hashar _joe_? [14:11:16] nop [14:11:23] arturo: I have checked the lag on the hosts :) [14:11:25] lets figure out what happened [14:11:34] <_joe_> mobrovac: that's not enough, we need to fix the whole branch [14:12:19] did I break someting, I am around, but didn't touch tin except for the commend he asked me to do? [14:12:25] I think that's my fault - seems like I've done cherry-pik in master and then git review into wmf.23 [14:12:26] AHHHHH [14:12:27] i got it :) [14:12:35] <_joe_> so yeah for example, https://gerrit.wikimedia.org/r/#/c/416220/ is merged into wmf.23 [14:12:39] <_joe_> hashar: do tell? [14:12:41] so here what happened [14:13:00] <_joe_> this is very very serious btw [14:13:04] a patch got proposed for the master branch [14:13:05] there should be backups of mediawiki-staging since the las time [14:13:05] https://gerrit.wikimedia.org/r/#/c/416499/ [14:13:12] then sent for review with refs/for/master [14:13:15] the patch get merged [14:13:24] <_joe_> jynus: mediawiki-staging is fine [14:13:29] ah, ok [14:13:33] (btw: I realized replag is gone rogue after tools.wmflabs.org/guc/ was down, but other tools with toolforge replica and my script that pulls data from replica works fine) [14:13:44] !log rebooting sca* for kernel security update [14:13:45] <_joe_> jynus: the issues are in gerrit [14:13:54] <_joe_> everyone: an outage is ongoing [14:13:55] then the same patch got sent for review to refs/for/wmf/1.31.0-wmf.23 with all its ancestry (which are merged in master branch) [14:13:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:01] <_joe_> well, an incident [14:14:14] <_joe_> let's stop talking about other things [14:14:21] <_joe_> hashar: where is the CR? [14:14:30] then when one tries to merge the patch, Gerrit is behing helpful and merge the whole ancestry leading to the proposed patch [14:14:37] since the parents got merged in master [14:14:38] it's https://gerrit.wikimedia.org/r/#/c/416675/ [14:14:41] <_joe_> who merged it? why this was not catched before? how can we prevent this? [14:14:44] https://gerrit.wikimedia.org/r/#/c/416675/ [14:15:01] i missed the topic [14:15:28] (03PS1) 10Ottomata: Migrate webrequest text varnishkafka to Kafka jumbo [puppet] - 10https://gerrit.wikimedia.org/r/416683 (https://phabricator.wikimedia.org/T185136) [14:15:39] hiii Jeff_Green [14:16:01] it sounds to me like we need to revert the merge commit [14:16:08] <_joe_> mobrovac: indeed [14:16:16] and the parent of that change is a merge change of https://gerrit.wikimedia.org/r/#/c/416450/ (which is only in master branch) [14:16:17] so yeah [14:16:20] ok, lemme do that [14:16:26] the side effect is that the master branch ended up being merged in the wmf one [14:16:39] <_joe_> hashar: but if that's a cherry-pick, how can that happen? [14:16:46] that's my question too [14:16:48] it was not a cherry pick [14:16:51] <_joe_> it's not a cherry-pick, I guess [14:16:54] <_joe_> yeah it's not [14:16:59] so say you have in master the commits: [14:17:01] <_joe_> chasemp: <3 [14:17:07] A -> B -> C (master) [14:17:10] you create a new commit D [14:17:22] get it merged and now you have: A -> B -> C -> D (master) [14:17:24] ok, it's clear now how it happened [14:17:25] so far so good [14:17:45] <_joe_> hashar: how are you sure it's not a cherry-pick in gerrit? [14:17:46] '-R' was not used in git review, so it automatically pushed all of it [14:17:51] then wmf branch got forked of A so you get that wmf branch being: A -> X -> Y -> Z [14:17:55] <_joe_> mobrovac: ahhhh [14:17:58] <_joe_> ok [14:18:01] then one proposed D (from master) to the wmf branch [14:18:04] <_joe_> mystery solved [14:18:22] <_joe_> it's pretty damining we didn't catch this if not by good practice, though [14:18:45] <_joe_> say mobrovac only checked the file he was updating for diffs [14:18:53] * _joe_ shivers [14:19:07] <_joe_> ok, we got lucky [14:19:12] and that's my fault.. Sorry for taking everyone's time [14:19:15] <_joe_> and mobrovac was good :) [14:19:33] <_joe_> Pchelolo: it's a human error, that's the kind of things tools should maybe protect against? [14:19:37] -R ?? [14:19:44] <_joe_> tbh, the gerrit UI isn't very clear [14:20:01] 10Operations, 10Analytics-Kanban: Eventlogging mysql consumers inserted rows on the analytics slave (db1108) for two hours - https://phabricator.wikimedia.org/T188991#4027078 (10Ottomata) > We run a proxy in front of the eventlogging database, called m4-master If we can't write to the failover, then we probab... [14:20:05] <_joe_> and git-review is bad :P [14:20:21] _joe_: ye, but it's still an error.. [14:20:22] <_joe_> anyways, let's keep these considerations for later [14:20:25] PROBLEM - eventlogging_sync processes on db1108 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /bin/bash /usr/local/bin/eventlogging_sync.sh [14:20:29] <_joe_> let's fix the issues [14:20:33] that is not an issue with git-review [14:20:41] it is gerrit being friendly [14:20:44] <_joe_> hashar: do you think reverting that change should do it? [14:20:57] <_joe_> hashar: that's why ops/puppet is ff-only [14:21:11] !log rebooting labweb* for kernel security update [14:21:12] that is slightly different :D [14:21:19] but yeah reverting 88b947427c - (origin/wmf/1.31.0-wmf.23) Merge "[JobQueueSecondTestQueue] Support read-only mode." into wmf/1.31.0-wmf.23 [14:21:21] that should do it [14:21:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:30] <_joe_> ok [14:21:35] then one has to cherry pick again e0fc520953 - [JobQueueSecondTestQueue] Support read-only mode. [14:21:36] <_joe_> who's going to manage that? [14:21:44] sending the revert to gerrit now [14:21:46] who ever broke it fix it? :D [14:21:56] heh :) [14:21:59] <_joe_> hashar: well if I went by that logic [14:22:08] <_joe_> y'all would have a much more miserable life [14:22:22] <_joe_> so let's not say such things, thanks a lot. [14:22:30] but I don't think the commits got pushed to the wmf branch [14:23:00] the proposed wmf patch had ancestors that got merged in master but were not in wmf branch [14:23:23] and my understanding is that in such case Gerrit kindly merge all the ancestors commits for you [14:23:45] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - labweb_80: Servers labweb1001.wikimedia.org are marked down but pooled [14:23:45] PROBLEM - PyBal IPVS diff check on lvs1003 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([labweb1001.wikimedia.org]) [14:23:46] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - CRITICAL - labweb_80: Servers labweb1001.wikimedia.org are marked down but pooled [14:23:56] PROBLEM - LVS HTTP IPv4 on labweb.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.40 and port 80: No route to host [14:24:02] <_joe_> hashar: because you have a non-ff-only merge strategy [14:24:05] PROBLEM - PyBal IPVS diff check on lvs1010 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([labweb1001.wikimedia.org]) [14:24:15] PROBLEM - PyBal backends health check on lvs1010 is CRITICAL: PYBAL CRITICAL - CRITICAL - labweb_80: Servers labweb1001.wikimedia.org are marked down but pooled [14:24:33] andrewbogott: ^ [14:24:35] chasemp: anything we should worry about? ^^^ [14:24:39] hm sending the patch timed out, trying again [14:24:42] (03CR) 10Revi: [C: 031] "\o/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416489 (https://phabricator.wikimedia.org/T188869) (owner: 10Subramanya Sastry) [14:24:45] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [14:24:46] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [14:24:53] volans: I believe moritzm rebooted labweb? [14:24:55] RECOVERY - LVS HTTP IPv4 on labweb.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 453 bytes in 0.001 second response time [14:24:56] ah it's recovering [14:25:04] that needs to be silenced [14:25:06] yeah, moritz was going to reboot this morning [14:25:11] ok [14:25:15] RECOVERY - PyBal backends health check on lvs1010 is OK: PYBAL OK - All pools are healthy [14:25:29] ack, I didn't see the related !log [14:26:07] _joe_: mobrovac: the issue is https://gerrit.wikimedia.org/r/Documentation/config-project-config.html#receive.rejectImplicitMerges [14:26:16] <_joe_> yep [14:26:23] <_joe_> let's not change it now on a hurry [14:26:39] (03Abandoned) 10Andrew Bogott: role::mariadb::ferm: Allow db access to labweb [puppet] - 10https://gerrit.wikimedia.org/r/416598 (https://phabricator.wikimedia.org/T188915) (owner: 10Andrew Bogott) [14:26:40] <_joe_> let's fix the issue, start swat [14:26:52] <_joe_> I've some things to cherry-pick myself [14:27:16] hashar: _joe_: i can't send the change to gerrit, it times out, any of you want to try git revert -m 1 88b947427ce989eacf11888b96ffeec69307b189 and send it? [14:27:34] on wmf.23, ofc :) [14:27:39] <_joe_> mobrovac: ok, lemme clone mediawiki/core on the new computer :P [14:27:45] looooooool [14:27:56] <_joe_> installed yesterday [14:27:57] <_joe_> :P [14:28:26] RECOVERY - puppet last run on cp4025 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [14:28:43] (03PS2) 10Ottomata: Parameterize kafka_cluster_name for streams_check job [puppet] - 10https://gerrit.wikimedia.org/r/415636 (https://phabricator.wikimedia.org/T185136) [14:28:45] RECOVERY - PyBal IPVS diff check on lvs1003 is OK: OK: no difference between hosts in IPVS/PyBal [14:29:05] RECOVERY - PyBal IPVS diff check on lvs1010 is OK: OK: no difference between hosts in IPVS/PyBal [14:29:06] RECOVERY - puppet last run on restbase1011 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [14:29:06] RECOVERY - puppet last run on db1073 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [14:29:25] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:29:32] <_joe_> James_F: we'll start SWAT soon(TM) [14:29:35] 10Operations, 10Analytics-Kanban: Eventlogging mysql consumers inserted rows on the analytics slave (db1108) for two hours - https://phabricator.wikimedia.org/T188991#4027191 (10Marostegui) >>! In T188991#4027078, @Ottomata wrote: >> We run a proxy in front of the eventlogging database, called m4-master > > I... [14:29:50] * James_F grins. [14:30:03] <_joe_> we'll start with your patches, I can do mine off-window too [14:30:47] (03CR) 10Ema: "pcc output looks good https://puppet-compiler.wmflabs.org/compiler02/10282/" [puppet] - 10https://gerrit.wikimedia.org/r/416652 (https://phabricator.wikimedia.org/T188545) (owner: 10Ema) [14:31:05] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:31:07] <_joe_> oh this is interesting [14:31:09] (03CR) 10Ottomata: [C: 032] Parameterize kafka_cluster_name for streams_check job [puppet] - 10https://gerrit.wikimedia.org/r/415636 (https://phabricator.wikimedia.org/T185136) (owner: 10Ottomata) [14:31:15] <_joe_> git revert fails [14:31:15] RECOVERY - puppet last run on conf1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:31:19] <_joe_> how nice [14:31:25] RECOVERY - puppet last run on analytics1051 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:31:35] RECOVERY - puppet last run on db1082 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:31:35] RECOVERY - puppet last run on elastic1026 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:31:35] RECOVERY - puppet last run on druid1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:31:35] RECOVERY - puppet last run on elastic1048 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:31:35] RECOVERY - puppet last run on aqs1006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:31:45] <_joe_> both modified: includes/api/i18n/pt.json [14:31:45] <_joe_> both modified: languages/i18n/be-tarask.json [14:31:45] <_joe_> both modified: languages/i18n/de.json [14:31:45] <_joe_> both modified: languages/i18n/el.json [14:31:55] RECOVERY - puppet last run on ms-be1028 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:32:05] RECOVERY - puppet last run on db1097 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:32:10] _joe_: are you on the wmf.23 branch? it didn't fail for me [14:32:15] mobrovac: _joe_: https://gerrit.wikimedia.org/r/416686 Revert "Merge "[JobQueueSecondTestQueue] Support read-only mode." into ... [14:32:15] <_joe_> mobrovac: I am [14:32:45] RECOVERY - puppet last run on silver is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:32:45] RECOVERY - puppet last run on logstash1005 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [14:33:07] <_joe_> oh ok I see what I did miss [14:33:09] <_joe_> heh [14:33:14] lgtm hashar [14:34:03] _joe_: for future ref, what did you miss? (so as to know the pitfall) [14:34:09] and the other change gotta be send again [14:34:20] <_joe_> mobrovac: I have an alias co which does checkout -b [14:34:20] 10Operations, 10Analytics-Kanban: Eventlogging mysql consumers inserted rows on the analytics slave (db1108) for two hours - https://phabricator.wikimedia.org/T188991#4027223 (10elukey) >>! In T188991#4027191, @Marostegui wrote: >>>! In T188991#4027078, @Ottomata wrote: >>> We run a proxy in front of the event... [14:34:23] <_joe_> and I used that [14:34:25] (03PS2) 10Ottomata: Migrate webrequest text varnishkafka to Kafka jumbo [puppet] - 10https://gerrit.wikimedia.org/r/416683 (https://phabricator.wikimedia.org/T185136) [14:34:28] <_joe_> out of habit [14:34:36] ah kk [14:34:50] (03CR) 10Jcrespo: [C: 032] Consider as busy all queries that are not in Sleep state [software] - 10https://gerrit.wikimedia.org/r/415888 (https://phabricator.wikimedia.org/T188505) (owner: 10Jcrespo) [14:35:06] (03CR) 10Jcrespo: [C: 032] Add Proxysql creation debian package script [software] - 10https://gerrit.wikimedia.org/r/404153 (owner: 10Jcrespo) [14:35:43] are we ok now? [14:35:53] <_joe_> jynus: no [14:36:06] <_joe_> jynus: as in, we can't deploy for now [14:36:18] <_joe_> until https://gerrit.wikimedia.org/r/#/c/416686/ is merged [14:36:33] !log beginning migration of webrequest text varnishkafka logs from Kafka analytics to Kafka jumbo-eqiad T185136 [14:36:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:50] T185136: Move webrequest varnishkafka and consumers to Kafka jumbo cluster. - https://phabricator.wikimedia.org/T185136 [14:36:54] <_joe_> jenkins is being lazy or what? [14:37:04] Pchelolo: mobrovac: and then https://gerrit.wikimedia.org/r/416687 :) [14:37:09] (03Merged) 10jenkins-bot: Consider as busy all queries that are not in Sleep state [software] - 10https://gerrit.wikimedia.org/r/415888 (https://phabricator.wikimedia.org/T188505) (owner: 10Jcrespo) [14:37:41] Pchelolo: mobrovac the big question I have now is whether that huge merge commit ended up being deployed [14:37:51] <_joe_> hashar: no. [14:37:57] heh hashar, indeed, but once the first one is merged, we should let James_F see his patch deployed [14:37:59] <_joe_> hashar: I confirmed that ~ 30 mins ago [14:38:16] <_joe_> what's in prod is the sane wmf23 branch [14:38:21] hashar: no, it didn't even get pulled on tin [14:38:23] after merging, a diff could be made between mediawiki and mediawiki-staging to be sure? [14:38:36] <_joe_> jynus: we don't even need that [14:38:39] ok [14:38:41] excellent [14:38:45] <_joe_> what's on mediawiki-staging right now is sane [14:38:49] <_joe_> and correct [14:38:58] <_joe_> it's gerrit that's broken [14:39:05] so the deployment halted when one on tin looked at the diff between HEAD and origin/wmf/xxx [14:39:07] so that is good [14:39:16] yes [14:39:19] yeah, but eventually you want to reconciliate that :-) [14:39:20] _joe_: s/broken/misconfigured/ :D [14:39:40] <_joe_> hashar: w/e, can we merge that patch? [14:39:45] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler02/10284/" [puppet] - 10https://gerrit.wikimedia.org/r/416683 (https://phabricator.wikimedia.org/T185136) (owner: 10Ottomata) [14:39:50] (03CR) 10Ottomata: [C: 032] Migrate webrequest text varnishkafka to Kafka jumbo [puppet] - 10https://gerrit.wikimedia.org/r/416683 (https://phabricator.wikimedia.org/T185136) (owner: 10Ottomata) [14:40:04] <_joe_> jenkins seems not to have picked it up [14:40:52] oh it did [14:40:57] it is waiting on php codesniffer https://integration.wikimedia.org/ci/job/mediawiki-core-phpcs-docker/6282/console [14:41:21] eeek almost 4 minutes spend on downloading submodules bah [14:41:30] <_joe_> sigh [14:41:43] <_joe_> https://cdn.meme.am/cache/instances/folder46/65289046.jpg never goes out of fashion :D [14:41:57] hahahaha [14:41:59] indeed [14:42:00] ahaha [14:42:13] <_joe_> I don't use that much anymore, tbh [14:42:16] <_joe_> we got better [14:42:27] <_joe_> and now this seems a problem with an external source [14:42:32] !log rebooting maps1* (eqiad) for kernel security update completed [14:42:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:47] bah git submodule process them serially [14:43:02] <_joe_> hashar: it's also spending like 5 minutes running codesniffer [14:43:03] (and php codesniffer is super slow as well) [14:43:25] <_joe_> and it took like 1 minute on my old underpowered nuc [14:43:31] <_joe_> :P [14:43:37] come on [14:43:41] we have the exact same NUC [14:43:48] I can still play games on it!!! [14:44:02] <_joe_> hashar: me, not anymore [14:44:27] ok I merged the revert commit [14:44:34] so we can pull it on tin [14:44:40] then eventually do a diff ? [14:44:54] <_joe_> hashar: yeah [14:44:57] <_joe_> lemme do that [14:45:57] <_joe_> sigh what a shitshow [14:46:11] 10Operations, 10Analytics-Kanban: Eventlogging mysql consumers inserted rows on the analytics slave (db1108) for two hours - https://phabricator.wikimedia.org/T188991#4027272 (10Marostegui) >>! In T188991#4027223, @elukey wrote: >>>! In T188991#4027191, @Marostegui wrote: >>>>! In T188991#4027078, @Ottomata wr... [14:46:13] <_joe_> I would go as far as to say we should've forcibly pushed the correct branch [14:46:14] !log tin: /srv/mediawiki-staging/php-1.31.0-wmf.23 rebased on tip of https://gerrit.wikimedia.org/r/#/c/416686/ (that revert a merge of master branch) [14:46:28] <_joe_> hashar: I said I was doing that [14:46:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:33] doing what? [14:46:36] force merge? no [14:46:45] <_joe_> nope [14:46:56] <_joe_> remote update on tin [14:47:07] <_joe_> then tin:/srv/mediawiki-staging/php-1.31.0-wmf.23$ git log HEAD~2..origin/wmf/1.31.0-wmf.23 [14:47:17] git log --oneline --first-parent -n 5 [14:47:31] you would get the reference / good commit: f46c818fae Update git submodules [14:47:37] <_joe_> yes [14:47:43] which should be a noop diff with the revert merge one a1b503f556 Revert "Merge "[JobQueueSecondTestQueue] Support read-only mode." into wmf/1.31.0-wmf.23" [14:47:56] (if git diff works properly) [14:48:04] so anyway, it is rebased [14:48:10] <_joe_> ok [14:48:13] <_joe_> so, SWAT [14:48:18] yay [14:48:28] <_joe_> James_F is up [14:48:33] thnx _joe_ hashar, sorry for the mess (really didn't see it coming) [14:48:37] <_joe_> I can manage my changes w/mobrovac [14:48:39] Pchelolo: mobrovac: what do we do with https://gerrit.wikimedia.org/r/#/c/416687/ [JobQueueSecondTestQueue] Support read-only mode. [14:48:47] Sure. [14:48:50] <_joe_> hashar: let's first free up James [14:48:52] Pchelolo: mobrovac because that patch never got deployed :) [14:48:55] Mine is an untestable no-op however. [14:48:57] <_joe_> marko said so before [14:49:24] lol James_F [14:49:31] we've got only quality patches today :D [14:49:49] <_joe_> well mine are potentially destructive [14:49:52] (03CR) 10Hashar: [C: 032] Article counts: Change 'comma' method to 'any' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416330 (https://phabricator.wikimedia.org/T188472) (owner: 10EddieGP) [14:50:08] James_F: note that I have ZERO clue how to refresh the article count :D [14:50:28] hashar: There's a cron script in puppet that uses these. [14:50:30] !log update pybal to 1.15.0 on lvs1010 [14:50:37] sorry for the mess everyone.. [14:50:42] Pchelolo: no worries [14:50:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:20] back sorry [14:57:39] Pchelolo: mobrovac: _joe_: I created a place holder at https://wikitech.wikimedia.org/wiki/Incident_documentation/20180306-20180306-MasterMergedInWMFBranch [14:58:04] (03PS2) 10Hashar: Article counts: Change 'comma' method to 'any' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416330 (https://phabricator.wikimedia.org/T188472) (owner: 10EddieGP) [14:58:06] ok great thnx hashar! we'll fill the voids :) [14:58:08] <_joe_> hashar: thanks [14:58:12] (03CR) 10Hashar: [C: 032] Article counts: Change 'comma' method to 'any' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416330 (https://phabricator.wikimedia.org/T188472) (owner: 10EddieGP) [14:58:40] mobrovac: and thanks for the git revert -m 1 command :] [14:58:52] np [14:58:54] James_F: had to rebase :d [14:59:58] (03Merged) 10jenkins-bot: Article counts: Change 'comma' method to 'any' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416330 (https://phabricator.wikimedia.org/T188472) (owner: 10EddieGP) [15:00:12] (03CR) 10jenkins-bot: Article counts: Change 'comma' method to 'any' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416330 (https://phabricator.wikimedia.org/T188472) (owner: 10EddieGP) [15:01:23] James_F: syncing [15:01:49] (03PS1) 10Milimetric: Add pingback reportupdater job [puppet] - 10https://gerrit.wikimedia.org/r/416698 [15:01:56] Thanks. [15:02:17] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: Article counts: Change 'comma' method to 'any' - T188472 (duration: 01m 00s) [15:02:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:34] T188472: Give up on 'comma' article-count method - https://phabricator.wikimedia.org/T188472 [15:02:34] (03CR) 10Ottomata: [C: 032] Add pingback reportupdater job [puppet] - 10https://gerrit.wikimedia.org/r/416698 (owner: 10Milimetric) [15:03:16] <_joe_> ok [15:03:16] (03PS8) 10ArielGlenn: dumps: Refactor profiles and hierakeys in web/ [puppet] - 10https://gerrit.wikimedia.org/r/416502 (https://phabricator.wikimedia.org/T168486) (owner: 10Madhuvishy) [15:03:23] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1094" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416699 [15:03:27] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1094" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416699 [15:03:53] <_joe_> mobrovac: I think you can re-submit your change if you want to [15:04:04] <_joe_> hashar: it's you SWATting then? [15:04:05] <_joe_> :P [15:04:20] <_joe_> I have three changes of mine in the list [15:04:28] mine is a no-op [15:04:33] <_joe_> marostegui: we're still doing merges for deploys, there was a problem [15:04:38] (03CR) 10ArielGlenn: [C: 032] dumps: Refactor profiles and hierakeys in web/ [puppet] - 10https://gerrit.wikimedia.org/r/416502 (https://phabricator.wikimedia.org/T168486) (owner: 10Madhuvishy) [15:04:38] so [15:04:39] <_joe_> mobrovac: so definitely do deploy yours [15:04:43] _joe_: yeah, I was not planning to deploy :) [15:04:54] k [15:05:14] hashar: will you merge/sync or shall i? [15:05:20] https://gerrit.wikimedia.org/r/#/c/416687/ [15:05:23] I did sync James patch [15:05:38] https://gerrit.wikimedia.org/r/#/c/416687/ yeah you can do it [15:05:42] kk going [15:05:43] I have a meeting in 20 minutes [15:06:10] _joe_: for the rest of your patches, I guess you can deploy them yourself cant you? [15:06:17] <_joe_> I guess, yes [15:06:41] good! [15:06:49] I will still be connected to irc anyway :] [15:09:29] _joe_: still waiting on jenkins for my patch ... [15:09:33] !log update to pybal 1.15.0 on lvs5003 [15:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:51] <_joe_> mobrovac: I can see your hair got a tad whiter since you merged the patch [15:10:00] lol [15:10:04] i think so too [15:10:34] i'm usually not hungry at this hour, but now i'm starving, it must have something to do with this morning :) [15:14:59] <_joe_> fun fact: you can listen to 15 songs by the Minutemen while waiting for a mediawiki/core patch to be merged by jenkins [15:16:07] <_joe_> mobrovac: it's merged!!! [15:16:20] yaaaay [15:16:23] (03PS2) 10Gehel: [WIP] wdqs: configure the new internal cluster [puppet] - 10https://gerrit.wikimedia.org/r/415872 (https://phabricator.wikimedia.org/T187766) [15:16:24] ok, going to sync now [15:16:26] <_joe_> quick, deploy before jenkins changes its mind [15:16:33] So I think I understand why the merge confusion, but why was the revert so large?- was it a very old patch [15:16:56] <_joe_> jynus: rather the contrary [15:17:03] (03CR) 10jerkins-bot: [V: 04-1] [WIP] wdqs: configure the new internal cluster [puppet] - 10https://gerrit.wikimedia.org/r/415872 (https://phabricator.wikimedia.org/T187766) (owner: 10Gehel) [15:17:08] it merged lots of new things, then? [15:17:10] <_joe_> the patch was new, and master is 92 commits ahead of wmf23 [15:17:19] ah, ok, gotcha now [15:17:21] <_joe_> well, at that patchset at least [15:17:27] yes, I see [15:17:52] thanks [15:19:20] !log mobrovac@tin Synchronized php-1.31.0-wmf.23/includes/jobqueue/JobQueueSecondTestQueue.php: [JobQueueSecondTestQueue] Support read-only mode - T185052 (duration: 00m 58s) [15:19:29] kk done _joe_ ^ [15:19:30] (03PS1) 10Gehel: wdqs: replace ::base::firewall with the appropriate profile [puppet] - 10https://gerrit.wikimedia.org/r/416701 [15:19:34] <_joe_> ok [15:19:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:38] T185052: Migrate RefreshLinks job to kafka - https://phabricator.wikimedia.org/T185052 [15:19:54] _joe_: need help with your patches? (i can't help you with the wait on jenkins, though :P) [15:19:55] <_joe_> let's go then [15:20:34] <_joe_> mobrovac: well if you want to double-check what I do, that'd be great [15:20:42] k [15:21:31] <_joe_> mobrovac: I just skipped jenkins, FYI. The patch was correctly rebased and passed jenkins already [15:21:44] sounds good [15:21:50] <_joe_> I'm not waiting another 10 minutes for no good reason [15:23:04] yeah and this one is a no-op without the config change mostly anyway [15:23:13] <_joe_> no it's not [15:23:17] <_joe_> but mostly, yes [15:23:24] <_joe_> so lemme test it on mwdebug1001 [15:23:25] <_joe_> anyways [15:27:20] 10Operations, 10ops-eqiad, 10User-Eevans: Degraded RAID on restbase-dev1006 - https://phabricator.wikimedia.org/T185494#4027411 (10Cmjohnson) the ssd for /dev/sdc has been replaced. the raid needs to be fixed. resolve ticket once you're satisfied. [15:28:06] !log rebooting bromine for kernel security update [15:28:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:36] 10Operations, 10ops-eqiad, 10Discovery, 10Wikidata, and 2 others: wdqs1004 broken - https://phabricator.wikimedia.org/T188045#4027415 (10Cmjohnson) The system board for this is scheduled to be changed out Wednesday 07FEB18 [15:28:39] <_joe_> mobrovac: ok I'm deploying everywhere now [15:28:52] _joe_: mobrovac and I have added a few more details about the Gerrit internals https://wikitech.wikimedia.org/wiki/Incident_documentation/20180306-MasterMergedInWMFBranch :D [15:29:14] k _joe_ [15:29:53] impressive hashar :) [15:29:55] !log oblivian@tin Started scap: Deploying Expose the latest modified index seen by EtcdConfig [15:30:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:14] <_joe_> I'm doing a scap sync, btw [15:30:33] <_joe_> 15:30:22 Copying to tin.eqiad.wmnet from tin.eqiad.wmnet [15:30:36] <_joe_> wtf [15:30:43] <_joe_> what's that hashar ? [15:30:45] yeah [15:30:48] <_joe_> that's very very wrong [15:30:54] that copies from /srv/mediawiki-staging to /srv/mediawiki I guess [15:30:58] <_joe_> nope [15:31:04] <_joe_> that's the sync-master stage IIRC [15:31:36] <_joe_> also, didn't scap tell you when it was syncing to individual hosts [15:31:37] <_joe_> ? [15:32:53] 10Operations, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Memory test failure on elastic1021 - https://phabricator.wikimedia.org/T188595#4027432 (10Cmjohnson) Swapped DIMM A3 to B2 to see if the error follows the DIMM. Powered on [15:33:05] RECOVERY - DPKG on restbase-dev1006 is OK: All packages OK [15:33:48] the sync-masters stage happens after that copying, but i have no idea where it is being copied to [15:35:35] RECOVERY - puppet last run on restbase-dev1006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:35:42] <_joe_> mobrovac: yeah it's being super-slow compared to what I was used to [15:35:45] <_joe_> just that [15:36:29] <_joe_> like more than 4 minutes already [15:37:52] wow [15:38:15] RECOVERY - Host analytics1062 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [15:38:44] <_joe_> we're at 7 minutes and counting [15:39:44] !log oblivian@tin Finished scap: Deploying Expose the latest modified index seen by EtcdConfig (duration: 09m 49s) [15:39:53] 10Operations, 10ops-eqiad, 10Analytics-Kanban: DIMM errors for analytics1062 - https://phabricator.wikimedia.org/T187164#4027495 (10Cmjohnson) I swapped the B side DIIMM to the A side to see if the error returns and follows the DIMM. Powered server on, let's check back in a day or so. [15:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:51] <_joe_> mobrovac: I'm going with https://gerrit.wikimedia.org/r/#/c/416470/ now [15:41:13] k [15:41:19] (03CR) 10Giuseppe Lavagetto: [C: 032] Fetch the last modified index in etcd.php, and expose it via siteinfo. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416470 (https://phabricator.wikimedia.org/T182597) (owner: 10Giuseppe Lavagetto) [15:41:32] <_joe_> let jenkins do his thing here though [15:41:37] 10Operations, 10ops-eqiad, 10hardware-requests: Decommission host erbium - https://phabricator.wikimedia.org/T185226#4027513 (10Cmjohnson) [15:41:39] 10Operations, 10ops-eqiad, 10hardware-requests: Decommission host erbium - https://phabricator.wikimedia.org/T185226#3910067 (10Cmjohnson) 05Open>03Resolved [15:41:50] should be quicker [15:42:28] 10Operations, 10ops-eqiad: Check serial console of rhenium - https://phabricator.wikimedia.org/T188905#4027519 (10Cmjohnson) @MoritzMuehlenhoff I will need to power the server off for 1-2mins...let me know a good time [15:42:53] (03Merged) 10jenkins-bot: Fetch the last modified index in etcd.php, and expose it via siteinfo. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416470 (https://phabricator.wikimedia.org/T182597) (owner: 10Giuseppe Lavagetto) [15:42:57] _joe_: can I deploy or should I wait? [15:43:13] <_joe_> marostegui: can you hold on for some more time? [15:43:14] (03CR) 10jenkins-bot: Fetch the last modified index in etcd.php, and expose it via siteinfo. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416470 (https://phabricator.wikimedia.org/T182597) (owner: 10Giuseppe Lavagetto) [15:43:19] sure, even till tomorrow! [15:43:23] <_joe_> I'm still doing the backlog [15:43:29] <_joe_> not too long hopefully [15:43:46] ok, just ping me when you are done, but keep in mind that I can deploy tomorrow if needed, so no rush from my side [15:44:40] <_joe_> syncing to mwdebug1002 [15:45:00] * volans checking [15:45:13] "wmfEtcdLastModifiedIndex": null [15:45:44] !log rebooting ununpentium for kernel security update [15:45:57] _joe_: i'll rebase https://gerrit.wikimedia.org/r/#/c/416482/ in the meantime [15:45:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:11] mobrovac: there are some comments on that one [15:46:16] <_joe_> mobrovac: wait [15:46:23] k [15:46:32] <_joe_> volans: ah ofc [15:46:36] volans: yup, your nitpicks :) [15:46:41] <_joe_> you didn't set the header X-Wikimedia-Debug [15:47:01] <_joe_> "wmfEtcdLastModifiedIndex": 156674 [15:47:04] <_joe_> :) [15:47:05] <_joe_> it works [15:47:26] arturo: what was the issue with replag? [15:47:44] yep, confirmed, I just used the same for prod hosts for this too, sorry [15:47:53] <_joe_> ok cool [15:47:57] <_joe_> I'm syncing [15:48:21] and I can confirm the id is correct [15:48:48] (03PS1) 10Muehlenhoff: Add library hint for libvpx [puppet] - 10https://gerrit.wikimedia.org/r/416703 [15:49:22] (03CR) 10Muehlenhoff: [C: 032] Add library hint for libvpx [puppet] - 10https://gerrit.wikimedia.org/r/416703 (owner: 10Muehlenhoff) [15:49:24] marostegui: unrelated to DB [15:49:42] https://phabricator.wikimedia.org/T189018 [15:50:15] Thanks [15:50:41] !log oblivian@tin Synchronized wmf-config: Expose etcd last modified index (duration: 01m 00s) [15:50:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:22] <_joe_> marostegui: you can go on and deploy your change while I amend volans's nitpicks [15:51:29] _joe_: Thanks! [15:51:40] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1094" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416699 (owner: 10Marostegui) [15:51:54] !log installing libvpx security updates [15:52:08] 10Operations, 10ops-eqiad, 10Analytics-Kanban: DIMM errors for analytics1062 - https://phabricator.wikimedia.org/T187164#4027595 (10elukey) Thanks! [15:52:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:21] arturo: one thing you may find is that becase we are lately load-balancing the analytics service [15:52:43] it may flap between 2 values (the 2 backends) [15:53:12] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1094" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416699 (owner: 10Marostegui) [15:53:28] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1094" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416699 (owner: 10Marostegui) [15:54:24] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1094 after alter table (duration: 00m 57s) [15:54:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:50] !log deploying new query killer logic to all wikidata (s8) db replicas T188505 [15:55:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:05] T188505: Investigate why query killer didn't kill 1-hour long queries - https://phabricator.wikimedia.org/T188505 [15:55:52] (03PS1) 10Marostegui: db-eqiad.php: Depool db1069 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416706 (https://phabricator.wikimedia.org/T187089) [15:57:11] _joe_: that would be my last deploy for the day ^ [15:57:20] <_joe_> ok [15:57:29] <_joe_> I'll finish my last deploy of the day then [15:57:41] Cool - will wait for you [15:58:23] <_joe_> marostegui: no please go on [15:58:27] <_joe_> I meant after you [15:58:27] ah, ok :) [15:58:32] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1069 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416706 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [15:59:00] (03PS1) 10Cmjohnson: Removing mgmt dns wtp1001-1024 [dns] - 10https://gerrit.wikimedia.org/r/416707 (https://phabricator.wikimedia.org/T177374) [15:59:45] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1069 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416706 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [15:59:59] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1069 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416706 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [16:00:24] (03PS1) 10Zoranzoki21: Deploy TemplateStyles to zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416708 (https://phabricator.wikimedia.org/T189022) [16:00:54] (03PS2) 10Zoranzoki21: Deploy TemplateStyles to zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416708 (https://phabricator.wikimedia.org/T189022) [16:01:06] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1069 for alter table (duration: 00m 57s) [16:01:10] _joe_: I am done [16:01:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:25] <_joe_> marostegui: cool, thanks [16:01:42] !log Deploy schema change on db1069 - T187089 T185128 T153182 [16:01:49] (03PS5) 10Giuseppe Lavagetto: Fetch data from etcd on every server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416482 (https://phabricator.wikimedia.org/T182597) [16:01:54] (03CR) 10Giuseppe Lavagetto: Fetch data from etcd on every server (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416482 (https://phabricator.wikimedia.org/T182597) (owner: 10Giuseppe Lavagetto) [16:01:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:58] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [16:01:58] T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182 [16:01:59] T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128 [16:01:59] <_joe_> volans: ^^ [16:02:03] looking [16:03:45] (03CR) 10Volans: [C: 031] "LGTM at the best of my knowledge of wmf-config ;)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416482 (https://phabricator.wikimedia.org/T182597) (owner: 10Giuseppe Lavagetto) [16:04:03] (03CR) 10Giuseppe Lavagetto: [C: 032] Fetch data from etcd on every server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416482 (https://phabricator.wikimedia.org/T182597) (owner: 10Giuseppe Lavagetto) [16:04:09] <_joe_> let's go volans :P [16:04:20] ok, going to the pub :-P [16:05:28] (03Merged) 10jenkins-bot: Fetch data from etcd on every server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416482 (https://phabricator.wikimedia.org/T182597) (owner: 10Giuseppe Lavagetto) [16:08:24] <_joe_> so, testing on mwdebug1002 [16:08:42] ok [16:08:57] should be a noop there [16:09:13] <_joe_> not exactly [16:09:26] <_joe_> now you get the etcd index correctly even if you don't use x-wm-debug [16:09:52] yes ofc [16:09:55] (03CR) 10jenkins-bot: Fetch data from etcd on every server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416482 (https://phabricator.wikimedia.org/T182597) (owner: 10Giuseppe Lavagetto) [16:10:07] I'm navigating with the extension, so far so good [16:10:36] do you see any error around? [16:10:51] <_joe_> nope [16:11:41] !log oblivian@tin Synchronized wmf-config: Fetch data from etcd on all appservers (duration: 01m 01s) [16:11:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:58] <_joe_> ok synced [16:12:07] yep, seeing hte index on random mw hosts [16:12:26] checking grafana dashboard for etcd cluster [16:12:42] on both DCs [16:12:49] <_joe_> mee too [16:12:53] <_joe_> I'm checking hosts [16:13:06] <_joe_> so the first good thing is - the randomization seems to work welll [16:13:41] the cluster-wide dashboard doesn't have the number of connections :( [16:13:55] <_joe_> network usage is up, but not my as much as I'd fear [16:14:12] yep [16:14:33] and evenly distributed [16:14:37] <_joe_> yes [16:14:41] <_joe_> that's great [16:15:12] <_joe_> ok, I'd say let it stew for the day, and by morning swat tomorrow we might be able to deploy the actual change everywhere [16:15:47] yeah, I need to adapt a bit the monitoring patch and I can deploy it, probably today too I'd say [16:16:14] <_joe_> oh right, or it will error out in codfw [16:16:19] <_joe_> yeah, do that :) [16:16:50] yep the bash one almost done, the python one will be trickier because icinga doesn't know the DC, I might have to pass it to the check as a parameter [16:17:10] <_joe_> I was about to suggest that [16:17:25] 10Operations, 10Gerrit, 10ORES, 10Scoring-platform-team, 10Patch-For-Review: Plan migration of ORES repos to git-lfs - https://phabricator.wikimedia.org/T181678#4027714 (10Halfak) [16:17:40] network levelling up at a very reasonable value [16:18:12] <_joe_> yup [16:18:50] and no other metric affected on the etcd cluster, all looks good so far [16:22:34] (03CR) 10Filippo Giunchedi: initial commit of 4.4.0-1 (032 comments) [debs/puppetdb] (4.4.0-1) - 10https://gerrit.wikimedia.org/r/415591 (owner: 10Herron) [16:27:35] so it seems etcd work is going well? [16:29:13] so far :) [16:29:47] is there something I can do to help? [16:30:29] thanks, but none that I can think of , I'm just keeping an eye on the graphs [16:31:08] as of now MW is only polling the data from etcd, but still using the hardcoded master DC [16:31:14] we plan to deploy the final patch tomorrow [16:32:33] jynus: looking at kibana there was a quick spike of: [16:32:36] Wikimedia\Rdbms\LoadBalancer::getRandomNonLagged: server db1114 is not replicating? [16:32:39] 10Operations, 10ops-eqiad, 10Analytics-Cluster, 10Analytics-Kanban, and 2 others: rack/setup/install analytics107[0-7] - https://phabricator.wikimedia.org/T188294#4027752 (10Ottomata) Bump, what's the status on these? [16:32:49] volans: checking [16:33:01] around 16:30 UTC [16:33:04] very quick [16:33:08] so right now? [16:33:18] 3 minutes ago :D [16:33:37] lasted few seconds honestly now that I've zoomed in [16:33:44] yeah the server looks fine [16:33:49] db1114? [16:33:59] but generated like 4k error lines in the logs [16:34:08] yeah, that is an api server in s1 [16:35:28] I was checking if I touch it, but no, I only did s8 [16:35:32] *touched [16:36:00] (03CR) 10BBlack: [C: 031] varnish: cleanup after upgrade to v5 [puppet] - 10https://gerrit.wikimedia.org/r/416652 (https://phabricator.wikimedia.org/T188545) (owner: 10Ema) [16:36:06] is it the new pooled api server? [16:36:10] yep [16:36:15] how many server are there now there? [16:36:18] *servers [16:36:19] 3 apis [16:36:25] all large? [16:36:30] 2 large + 1 160G [16:36:37] 10Operations, 10DBA, 10cloud-services-team: Failover m5 master from db1009 to db1073 - https://phabricator.wikimedia.org/T189005#4027759 (10chasemp) a:03Andrew @andrew volunteered to help steer this change :) [16:36:39] We have basically exchange 1 60GB for one large [16:37:41] (03CR) 10Giuseppe Lavagetto: [C: 031] "I checked both files and the port seems correct." [puppet] - 10https://gerrit.wikimedia.org/r/416664 (https://phabricator.wikimedia.org/T188623) (owner: 10Filippo Giunchedi) [16:37:43] it doesn't have to be new, but maybe we should give more weight to the large ones? [16:37:46] *now [16:38:03] jynus: they both have 300 of main traffic + api [16:38:04] although that would create "more errors" not less [16:38:09] oh [16:38:27] I see what you mean [16:38:51] !log sbisson@tin Started deploy [kartotherian/deploy@255401a]: Testing update-deps2 branch [16:38:53] I would reduce then normal traffic on all except one [16:39:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:15] something like 100+api maybe? [16:39:26] e.g. maybe 1066 move it mainly to main [16:39:42] (03CR) 10Giuseppe Lavagetto: "as long as we remove that awful copy/paste from the frontends once we're done with the production migration, this is ok for me." [puppet] - 10https://gerrit.wikimedia.org/r/416665 (https://phabricator.wikimedia.org/T188623) (owner: 10Filippo Giunchedi) [16:39:43] and the others with lower main [16:39:43] db1066 will be gone soon hopefully from s1 anyways [16:39:56] we can discuss it at other point, I am just thinking aloud [16:40:02] I know :) [16:40:18] I don't think there is any imediate actionable [16:40:32] but I would have waited to pool the new host a bit more- specially for api [16:40:42] (03PS1) 10Vgutierrez: Fix MPReachNLRIAttribute generation [debs/pybal] - 10https://gerrit.wikimedia.org/r/416711 (https://phabricator.wikimedia.org/T165764) [16:41:59] marostegui: one last question [16:42:14] how did you build the new s5 host? [16:42:20] (03PS2) 10Filippo Giunchedi: wmflib: port role and nuyaml to hiera3 [puppet] - 10https://gerrit.wikimedia.org/r/416664 (https://phabricator.wikimedia.org/T188623) [16:42:22] which s5 host? [16:42:23] from where: backups, replica, etc.? [16:42:25] sorry [16:42:27] m5 [16:42:33] Ah, from the replica [16:43:07] I am worried only about 1 thing- the testreduce databases [16:43:20] I loaded them manually without much regard for consistency [16:43:39] so just FYI [16:44:09] at least the new master and the new replica will have the same data :) [16:44:19] well, the "same" [16:44:29] (03CR) 10Filippo Giunchedi: [C: 032] wmflib: port role and nuyaml to hiera3 [puppet] - 10https://gerrit.wikimedia.org/r/416664 (https://phabricator.wikimedia.org/T188623) (owner: 10Filippo Giunchedi) [16:44:38] !log sbisson@tin Finished deploy [kartotherian/deploy@255401a]: Testing update-deps2 branch (duration: 05m 47s) [16:44:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:10] (03CR) 10Mark Bergsma: [C: 031] Fix MPReachNLRIAttribute generation [debs/pybal] - 10https://gerrit.wikimedia.org/r/416711 (https://phabricator.wikimedia.org/T165764) (owner: 10Vgutierrez) [16:45:32] (03Abandoned) 10Andrew Bogott: Change the labvirt kvm test to allow for many more processes [puppet] - 10https://gerrit.wikimedia.org/r/416380 (owner: 10Andrew Bogott) [16:46:14] (03CR) 10Vgutierrez: [C: 032] Fix MPReachNLRIAttribute generation [debs/pybal] - 10https://gerrit.wikimedia.org/r/416711 (https://phabricator.wikimedia.org/T165764) (owner: 10Vgutierrez) [16:46:39] (03CR) 10Filippo Giunchedi: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/416665 (https://phabricator.wikimedia.org/T188623) (owner: 10Filippo Giunchedi) [16:46:45] (03Merged) 10jenkins-bot: Fix MPReachNLRIAttribute generation [debs/pybal] - 10https://gerrit.wikimedia.org/r/416711 (https://phabricator.wikimedia.org/T165764) (owner: 10Vgutierrez) [16:46:47] (03PS4) 10Filippo Giunchedi: Use hiera3 role/nuyaml backends on >= stretch [puppet] - 10https://gerrit.wikimedia.org/r/416665 (https://phabricator.wikimedia.org/T188623) [16:48:13] (03CR) 10Filippo Giunchedi: [C: 032] Use hiera3 role/nuyaml backends on >= stretch [puppet] - 10https://gerrit.wikimedia.org/r/416665 (https://phabricator.wikimedia.org/T188623) (owner: 10Filippo Giunchedi) [16:48:36] 10Operations, 10ops-eqiad, 10Analytics-Cluster, 10Analytics-Kanban, and 2 others: rack/setup/install analytics107[0-7] - https://phabricator.wikimedia.org/T188294#4027808 (10Cmjohnson) @ottomata: these needs installs if you have the spare cycles feel free. On-site work is done [16:51:59] (03PS1) 10Vgutierrez: Fix MPReachNLRIAttribute generation [debs/pybal] (1.15) - 10https://gerrit.wikimedia.org/r/416712 (https://phabricator.wikimedia.org/T165764) [16:52:05] (03PS3) 10Herron: initial commit of 4.4.0-1 [debs/puppetdb] (4.4.0-1) - 10https://gerrit.wikimedia.org/r/415591 [16:54:41] (03CR) 10Vgutierrez: [C: 032] Fix MPReachNLRIAttribute generation [debs/pybal] (1.15) - 10https://gerrit.wikimedia.org/r/416712 (https://phabricator.wikimedia.org/T165764) (owner: 10Vgutierrez) [16:55:19] (03Merged) 10jenkins-bot: Fix MPReachNLRIAttribute generation [debs/pybal] (1.15) - 10https://gerrit.wikimedia.org/r/416712 (https://phabricator.wikimedia.org/T165764) (owner: 10Vgutierrez) [16:55:23] (03CR) 10Herron: initial commit of 4.4.0-1 (032 comments) [debs/puppetdb] (4.4.0-1) - 10https://gerrit.wikimedia.org/r/415591 (owner: 10Herron) [16:55:26] (03PS7) 10Volans: Icinga: add sync check for MW config on etcd [puppet] - 10https://gerrit.wikimedia.org/r/413355 (https://phabricator.wikimedia.org/T182597) [16:55:28] (03PS7) 10Volans: Icinga: add EtcdConfig sync check on MW hosts [puppet] - 10https://gerrit.wikimedia.org/r/413356 (https://phabricator.wikimedia.org/T182597) [16:56:10] (03CR) 10jerkins-bot: [V: 04-1] Icinga: add EtcdConfig sync check on MW hosts [puppet] - 10https://gerrit.wikimedia.org/r/413356 (https://phabricator.wikimedia.org/T182597) (owner: 10Volans) [16:57:02] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: dbstore1001 crashed: Multibit ECC errors were detected on the RAID controller. - https://phabricator.wikimedia.org/T186596#4027828 (10jcrespo) a:05jcrespo>03None [16:57:53] (03PS1) 10Ottomata: Fix spark2 oozie sharelib install command [puppet] - 10https://gerrit.wikimedia.org/r/416713 (https://phabricator.wikimedia.org/T159962) [16:58:22] (03PS2) 10Ottomata: Fix spark2 oozie sharelib install command [puppet] - 10https://gerrit.wikimedia.org/r/416713 (https://phabricator.wikimedia.org/T159962) [16:58:41] !log powering off rhenium to reset the idrac [16:58:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:04] godog, moritzm, and _joe_: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet SWAT(Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180306T1700). [17:00:04] No GERRIT patches in the queue for this window AFAICS. [17:00:16] (03PS8) 10Volans: Icinga: add EtcdConfig sync check on MW hosts [puppet] - 10https://gerrit.wikimedia.org/r/413356 (https://phabricator.wikimedia.org/T182597) [17:00:35] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: dbstore1001 crashed: Multibit ECC errors were detected on the RAID controller. - https://phabricator.wikimedia.org/T186596#4027837 (10jcrespo) p:05High>03Normal We have copied away everything we needed to keep- we are blocked on DC ops to do the ful... [17:01:30] (03PS1) 10Vgutierrez: Release 1.15.1: Fix MPReachNLRIAttribute generation [debs/pybal] - 10https://gerrit.wikimedia.org/r/416714 [17:01:38] (03CR) 10Ottomata: role::analytics_cluster::client: force remount of HDFS mountpoint (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/416442 (https://phabricator.wikimedia.org/T187073) (owner: 10Elukey) [17:01:45] (03CR) 10Ottomata: [C: 032] Fix spark2 oozie sharelib install command [puppet] - 10https://gerrit.wikimedia.org/r/416713 (https://phabricator.wikimedia.org/T159962) (owner: 10Ottomata) [17:02:02] (03CR) 10Volans: "Patch not anymore blocked, the MW-config ones were merged. I've changed slightly the bash script to loop over the etcd clusters (in each D" [puppet] - 10https://gerrit.wikimedia.org/r/413355 (https://phabricator.wikimedia.org/T182597) (owner: 10Volans) [17:02:05] PROBLEM - Host rhenium is DOWN: PING CRITICAL - Packet loss = 100% [17:04:08] mmmh, was it scheduled? ^^^ [17:04:45] RECOVERY - Host rhenium is UP: PING OK - Packet loss = 0%, RTA = 1.14 ms [17:04:51] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1069" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416715 [17:05:06] (03PS1) 10Filippo Giunchedi: wmflib: fix role3_backend for no answer found [puppet] - 10https://gerrit.wikimedia.org/r/416716 (https://phabricator.wikimedia.org/T188623) [17:05:08] (03PS1) 10Filippo Giunchedi: puppetmaster: fix hiera3 config [puppet] - 10https://gerrit.wikimedia.org/r/416717 (https://phabricator.wikimedia.org/T188623) [17:06:21] (03CR) 10Volans: "Patch not anymore blocked, the MW-config related one was merged." [puppet] - 10https://gerrit.wikimedia.org/r/413356 (https://phabricator.wikimedia.org/T182597) (owner: 10Volans) [17:07:09] (03CR) 10Filippo Giunchedi: [C: 032] puppetmaster: fix hiera3 config [puppet] - 10https://gerrit.wikimedia.org/r/416717 (https://phabricator.wikimedia.org/T188623) (owner: 10Filippo Giunchedi) [17:07:17] (03CR) 10Filippo Giunchedi: [C: 032] wmflib: fix role3_backend for no answer found [puppet] - 10https://gerrit.wikimedia.org/r/416716 (https://phabricator.wikimedia.org/T188623) (owner: 10Filippo Giunchedi) [17:07:26] PROBLEM - Host rhenium is DOWN: PING CRITICAL - Packet loss = 100% [17:07:34] (03PS2) 10Filippo Giunchedi: puppetmaster: fix hiera3 config [puppet] - 10https://gerrit.wikimedia.org/r/416717 (https://phabricator.wikimedia.org/T188623) [17:07:48] (03PS2) 10Vgutierrez: Release 1.15.1: Fix MPReachNLRIAttribute generation [debs/pybal] - 10https://gerrit.wikimedia.org/r/416714 [17:08:03] what is rhenium? [17:08:41] netinsights? [17:08:49] network monitoring related [17:08:58] is it a vm? [17:09:00] (03PS2) 10Filippo Giunchedi: wmflib: fix role3_backend for no answer found [puppet] - 10https://gerrit.wikimedia.org/r/416716 (https://phabricator.wikimedia.org/T188623) [17:09:03] (so, not something directly in the flow of production living or dying) [17:09:03] or real host [17:09:17] real, I believe [17:09:17] (03CR) 10Ema: [C: 031] Release 1.15.1: Fix MPReachNLRIAttribute generation [debs/pybal] - 10https://gerrit.wikimedia.org/r/416714 (owner: 10Vgutierrez) [17:09:43] oh sorry [17:09:47] I just saw the log [17:09:55] RECOVERY - Host rhenium is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [17:10:14] (03CR) 10Vgutierrez: [C: 032] Release 1.15.1: Fix MPReachNLRIAttribute generation [debs/pybal] - 10https://gerrit.wikimedia.org/r/416714 (owner: 10Vgutierrez) [17:10:54] (03Merged) 10jenkins-bot: Release 1.15.1: Fix MPReachNLRIAttribute generation [debs/pybal] - 10https://gerrit.wikimedia.org/r/416714 (owner: 10Vgutierrez) [17:11:07] 10Operations, 10ops-eqiad: Check serial console of rhenium - https://phabricator.wikimedia.org/T188905#4027861 (10Cmjohnson) I rebooted the server checked the serial console, it was set to re-direction enabled...not sure if that was it but I can confirm it is fixed. Please consider replacing this server soon... [17:11:11] 10Operations, 10ops-eqiad: Check serial console of rhenium - https://phabricator.wikimedia.org/T188905#4027862 (10Cmjohnson) 05Open>03Resolved [17:12:19] (03PS1) 10Vgutierrez: Release 1.15.1: Fix MPReachNLRIAttribute generation [debs/pybal] (1.15) - 10https://gerrit.wikimedia.org/r/416719 [17:13:30] (03CR) 10Vgutierrez: [C: 032] Release 1.15.1: Fix MPReachNLRIAttribute generation [debs/pybal] (1.15) - 10https://gerrit.wikimedia.org/r/416719 (owner: 10Vgutierrez) [17:13:57] (03Merged) 10jenkins-bot: Release 1.15.1: Fix MPReachNLRIAttribute generation [debs/pybal] (1.15) - 10https://gerrit.wikimedia.org/r/416719 (owner: 10Vgutierrez) [17:15:17] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1069" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416715 (owner: 10Marostegui) [17:16:30] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1069" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416715 (owner: 10Marostegui) [17:17:41] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1069 after alter table (duration: 00m 58s) [17:17:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:15] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1069" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416715 (owner: 10Marostegui) [17:25:03] 10Operations, 10Ops-Access-Requests, 10Research, 10Research-collaborations, 10Research-management: Request access to data for Wikimedia Donation Patterns research - https://phabricator.wikimedia.org/T188945#4027934 (10Capt_Swing) a:05Capt_Swing>03None [17:25:26] (03PS9) 10Andrew Bogott: multiversion: add a transitional mapping for newwikitech.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415914 (https://phabricator.wikimedia.org/T168470) [17:27:56] 10Operations, 10Ops-Access-Requests, 10Research, 10Research-collaborations, 10Research-management: Request access to data for Wikimedia Donation Patterns research - https://phabricator.wikimedia.org/T188945#4027966 (10Capt_Swing) Added #ops-access-requests. Removed myself as assignee now that next steps... [17:28:39] !log uploaded pybal_1.15.1_all.deb to apt.wikimedia.org jessie-wikimedia [17:28:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:54] anyone touching db2057 ? [17:31:21] uptime 50, so probably not [17:32:24] !log update pybal to 1.15.1 on lvs1010 [17:32:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:17] false alarm, apparently [17:34:31] !log update pybal to 1.15.1 on lvs5003 [17:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:10] !log rebooting hassaleh for kernel security update [17:38:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:57] (03PS1) 10Matthias Mullie: Add 3d-patents page to wgForceUIMsgAsContentMsg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416730 [17:42:00] !log rebooting dbmonitor2001 for kernel security update [17:42:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:08] !log rebooting dbmonitor1001 for kernel security update [17:47:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:49] 10Operations, 10DBA, 10Patch-For-Review: Create less overhead on bacula jobs when dumping production databases - https://phabricator.wikimedia.org/T162789#3174710 (10jcrespo) a:03jcrespo This is happening at the same time than T184696 [17:53:14] !log disabling puppet and apache on labpuppetmatser1001 and 1002 [17:53:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:58] !log starting branch cut for 1.31.0-wmf.24 [17:54:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:15] PROBLEM - puppetmaster https on labpuppetmaster1001 is CRITICAL: connect to address 208.80.154.158 and port 8140: Connection refused [17:55:45] PROBLEM - puppetmaster backend https on labpuppetmaster1001 is CRITICAL: connect to address 208.80.154.158 and port 8141: Connection refused [17:55:55] PROBLEM - puppetmaster backend https on labpuppetmaster1002 is CRITICAL: connect to address 208.80.155.120 and port 8141: Connection refused [17:57:26] PROBLEM - HHVM rendering on mw2131 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:58:12] !log Reload haproxy on dbproxy1004 and dbproxy1009 [17:58:16] RECOVERY - HHVM rendering on mw2131 is OK: HTTP OK: HTTP/1.1 200 OK - 73667 bytes in 0.315 second response time [17:58:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:56] RECOVERY - haproxy failover on dbproxy1004 is OK: OK check_failover servers up 2 down 0 [17:59:25] RECOVERY - haproxy failover on dbproxy1009 is OK: OK check_failover servers up 2 down 0 [18:00:04] cscott, arlolra, subbu, halfak, and Amir1: How many deployers does it take to do Services – Graphoid / Parsoid / Citoid / ORES deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180306T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:01:49] bearND: should we deploy the latest? [18:08:23] (03PS1) 10Volans: Revert "Use hiera3 role/nuyaml backends on >= stretch" [puppet] - 10https://gerrit.wikimedia.org/r/416734 [18:08:38] (03CR) 10jerkins-bot: [V: 04-1] Revert "Use hiera3 role/nuyaml backends on >= stretch" [puppet] - 10https://gerrit.wikimedia.org/r/416734 (owner: 10Volans) [18:08:52] (03CR) 10Filippo Giunchedi: [C: 031] Revert "Use hiera3 role/nuyaml backends on >= stretch" [puppet] - 10https://gerrit.wikimedia.org/r/416734 (owner: 10Volans) [18:09:23] no parsoid deploy today [18:10:16] 10Operations, 10ops-eqiad, 10Discovery, 10Wikidata, and 2 others: wdqs1004 broken - https://phabricator.wikimedia.org/T188045#4028159 (10Smalyshev) @Cmjohnson ITYM March 7? [18:11:24] 10Operations, 10ops-eqiad, 10Discovery, 10Wikidata, and 2 others: wdqs1004 broken - https://phabricator.wikimedia.org/T188045#4028160 (10Cmjohnson) yes march! I don't even know what month it is anymore..thx [18:11:42] bearND: looks like no reason not to. summary is already deployed at 1.3.4. [18:12:14] mdholloway: that's fine if you'd like to do it [18:12:22] 10Operations, 10ops-eqiad, 10Analytics-Cluster, 10Analytics-Kanban, and 2 others: rack/setup/install analytics107[0-7] - https://phabricator.wikimedia.org/T188294#4028162 (10RobH) I'm working on getting these installed today. [18:12:29] bearND: sure, i'll do it now [18:12:47] (03PS1) 10RobH: new analytics hosts dns entries [dns] - 10https://gerrit.wikimedia.org/r/416736 (https://phabricator.wikimedia.org/T188294) [18:13:23] (03PS1) 10Filippo Giunchedi: Revert: Use hiera3 role/nuyaml backends on >= stretch [puppet] - 10https://gerrit.wikimedia.org/r/416737 [18:14:39] (03CR) 10RobH: [C: 032] new analytics hosts dns entries [dns] - 10https://gerrit.wikimedia.org/r/416736 (https://phabricator.wikimedia.org/T188294) (owner: 10RobH) [18:16:20] (03CR) 10Dzahn: [C: 031] Enable reusable TC on HHVM on canary appservers [puppet] - 10https://gerrit.wikimedia.org/r/414876 (https://phabricator.wikimedia.org/T103886) (owner: 10Chad) [18:19:11] (03PS5) 10Ppchelko: Remove special jobrunners for refreshLinks and htmlCacheUpdate. [puppet] - 10https://gerrit.wikimedia.org/r/416481 (https://phabricator.wikimedia.org/T185052) [18:20:05] (03CR) 10Volans: [C: 031] "compiler: https://puppet-compiler.wmflabs.org/compiler02/10288/labpuppetmaster1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/416737 (owner: 10Filippo Giunchedi) [18:20:23] (03CR) 10Filippo Giunchedi: [C: 032] Revert: Use hiera3 role/nuyaml backends on >= stretch [puppet] - 10https://gerrit.wikimedia.org/r/416737 (owner: 10Filippo Giunchedi) [18:22:54] !log puppet-merge Revert: Use hiera3 role/nuyaml backends on >= stretch [18:23:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:58] where is mediawiki_singlenode used i wonder [18:25:14] nowhere in prod, probably in WMCS [18:26:16] RECOVERY - puppetmaster https on labpuppetmaster1001 is OK: HTTP OK: Status line output matched 400 - 399 bytes in 0.036 second response time [18:26:55] RECOVERY - puppetmaster backend https on labpuppetmaster1001 is OK: HTTP OK: Status line output matched 400 - 398 bytes in 0.020 second response time [18:27:05] RECOVERY - puppetmaster backend https on labpuppetmaster1002 is OK: HTTP OK: Status line output matched 400 - 398 bytes in 0.032 second response time [18:27:26] (03Abandoned) 10Volans: Revert "Use hiera3 role/nuyaml backends on >= stretch" [puppet] - 10https://gerrit.wikimedia.org/r/416734 (owner: 10Volans) [18:30:17] (03PS1) 10Lucas Werkmeister (WMDE): Load Wikibase Quality extensions using extension registration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416739 (https://phabricator.wikimedia.org/T106104) [18:32:39] !log mholloway-shell@tin Started deploy [mobileapps/deploy@5986ab7]: Update mobileapps to afbe9af [18:32:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:51] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 4 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748#4028238 (10phuedx) I've reached out to SRE and Services to clarify where we are with deploying the mediawiki-services-chromium-render and wha... [18:34:03] (03PS1) 10Ppchelko: Enable grafana alerts for jobqueue-eventbus dashboard. [puppet] - 10https://gerrit.wikimedia.org/r/416740 (https://phabricator.wikimedia.org/T189038) [18:37:30] (03PS1) 10Jdlrobson: Re-enable Wikidata descriptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416741 (https://phabricator.wikimedia.org/T188182) [18:37:33] (03PS5) 10Imarlier: NavigtationTiming: Enable oversampling for Singapore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415618 (https://phabricator.wikimedia.org/T188652) [18:38:07] !log mholloway-shell@tin Finished deploy [mobileapps/deploy@5986ab7]: Update mobileapps to afbe9af (duration: 05m 28s) [18:38:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:36] (03CR) 10Imarlier: NavigtationTiming: Enable oversampling for Singapore (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415618 (https://phabricator.wikimedia.org/T188652) (owner: 10Imarlier) [18:39:53] (03PS1) 10Dzahn: superset: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/416742 [18:40:17] (03CR) 10jerkins-bot: [V: 04-1] superset: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/416742 (owner: 10Dzahn) [18:43:27] (03PS2) 10Dzahn: superset: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/416742 [18:45:12] 10Operations, 10TemplateStyles, 10Traffic, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#4028292 (10ggellerman) [18:48:43] (03PS13) 10Imarlier: coal: Process from Kafka instead of from ZMQ [puppet] - 10https://gerrit.wikimedia.org/r/415218 (https://phabricator.wikimedia.org/T110903) [18:49:02] (03CR) 10Mobrovac: [C: 031] Enable grafana alerts for jobqueue-eventbus dashboard. [puppet] - 10https://gerrit.wikimedia.org/r/416740 (https://phabricator.wikimedia.org/T189038) (owner: 10Ppchelko) [18:51:48] 10Operations, 10Packaging, 10Scap: Install git-lfs client (at least on scap targets & masters) - https://phabricator.wikimedia.org/T180628#4028393 (10mmodell) [18:53:10] (03CR) 10Imarlier: "> According to that diff, [Installer] is removed from" [puppet] - 10https://gerrit.wikimedia.org/r/415218 (https://phabricator.wikimedia.org/T110903) (owner: 10Imarlier) [18:55:05] (03PS1) 10Lucas Werkmeister (WMDE): Enable caching of constraint check results [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416748 (https://phabricator.wikimedia.org/T184812) [18:56:31] (03CR) 10Chad: [C: 031] "Lgtm." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416607 (https://phabricator.wikimedia.org/T188915) (owner: 10Andrew Bogott) [18:57:42] (03CR) 10Chad: [C: 04-1] wikitech: use files from swift rather than local uploads. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416607 (https://phabricator.wikimedia.org/T188915) (owner: 10Andrew Bogott) [18:58:24] (03CR) 10Chad: [C: 04-1] wikitech: use files from swift rather than local uploads. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416607 (https://phabricator.wikimedia.org/T188915) (owner: 10Andrew Bogott) [18:58:58] 10Operations, 10Packaging, 10Scap: Install git-lfs client (at least on scap targets & masters) - https://phabricator.wikimedia.org/T180628#4028541 (10mmodell) @akosiaris: What would it take to get the git-lfs package back-ported to stretch? It's written in go, however, I am unsure if it will work with the ve... [19:00:05] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180306T1900) [19:00:05] No GERRIT patches in the queue for this window AFAICS. [19:00:28] (03PS1) 10Dzahn: bugzilla_static: replace comment about ::apache class [puppet] - 10https://gerrit.wikimedia.org/r/416749 [19:00:52] (03CR) 10jerkins-bot: [V: 04-1] bugzilla_static: replace comment about ::apache class [puppet] - 10https://gerrit.wikimedia.org/r/416749 (owner: 10Dzahn) [19:03:52] (03PS1) 10Dzahn: noc: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/416751 [19:04:40] (03CR) 10Dzahn: [C: 04-1] "apache::def does not have corresponding httpd:def, needs to be replaced by httpd::conf instead" [puppet] - 10https://gerrit.wikimedia.org/r/416751 (owner: 10Dzahn) [19:05:34] (03PS2) 10Dzahn: bugzilla_static: replace comment about apache class [puppet] - 10https://gerrit.wikimedia.org/r/416749 [19:06:03] (03CR) 10jerkins-bot: [V: 04-1] bugzilla_static: replace comment about apache class [puppet] - 10https://gerrit.wikimedia.org/r/416749 (owner: 10Dzahn) [19:06:06] (03CR) 10Dzahn: [C: 032] "README-only" [puppet] - 10https://gerrit.wikimedia.org/r/416749 (owner: 10Dzahn) [19:06:53] (03CR) 10Andrew Bogott: wikitech: use files from swift rather than local uploads. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416607 (https://phabricator.wikimedia.org/T188915) (owner: 10Andrew Bogott) [19:07:14] (03CR) 10Dzahn: [C: 032] "lol, jenkins "Line 1: Do not define bug in the header" this is about Bugzilla :p" [puppet] - 10https://gerrit.wikimedia.org/r/416749 (owner: 10Dzahn) [19:08:53] (03CR) 10Dzahn: [V: 032 C: 032] bugzilla_static: replace comment about apache class [puppet] - 10https://gerrit.wikimedia.org/r/416749 (owner: 10Dzahn) [19:22:35] (03CR) 1020after4: "I think Giuseppe's right. 7.2 is a clear win. Wouldn't it just be a matter of importing the packages from https://packages.sury.org/php/ " [puppet] - 10https://gerrit.wikimedia.org/r/415856 (owner: 10Muehlenhoff) [19:23:15] (03CR) 1020after4: "sury.org is the closest you can get to an official backport, I believe." [puppet] - 10https://gerrit.wikimedia.org/r/415856 (owner: 10Muehlenhoff) [19:25:45] RECOVERY - eventlogging_sync processes on db1108 is OK: PROCS OK: 1 process with UID = 0 (root), args /bin/bash /usr/local/bin/eventlogging_sync.sh [19:32:25] RECOVERY - Check status of defined EventLogging jobs on eventlog1001 is OK: OK: All defined EventLogging jobs are runnning. [19:37:37] (03PS1) 10RobH: set new analytics hosts to role spare [puppet] - 10https://gerrit.wikimedia.org/r/416757 (https://phabricator.wikimedia.org/T188294) [19:37:53] (03PS2) 10RobH: set new analytics hosts to role spare [puppet] - 10https://gerrit.wikimedia.org/r/416757 (https://phabricator.wikimedia.org/T188294) [19:38:04] (03CR) 10RobH: [C: 032] set new analytics hosts to role spare [puppet] - 10https://gerrit.wikimedia.org/r/416757 (https://phabricator.wikimedia.org/T188294) (owner: 10RobH) [19:38:22] (03CR) 10jerkins-bot: [V: 04-1] set new analytics hosts to role spare [puppet] - 10https://gerrit.wikimedia.org/r/416757 (https://phabricator.wikimedia.org/T188294) (owner: 10RobH) [19:39:08] !log thcipriani@tin Started scap: testwiki to php-1.31.0-wmf.24 and rebuild l10n cache [19:39:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:41] (03PS3) 10RobH: set new analytics hosts to role spare [puppet] - 10https://gerrit.wikimedia.org/r/416757 (https://phabricator.wikimedia.org/T188294) [19:40:08] 10Operations, 10Analytics-Kanban: Eventlogging mysql consumers inserted rows on the analytics slave (db1108) for two hours - https://phabricator.wikimedia.org/T188991#4028840 (10elukey) It seems that we were (kind of) lucky. For some reason that we don't know (it predates most of us), the tables on the slave d... [19:40:12] (03CR) 10jerkins-bot: [V: 04-1] set new analytics hosts to role spare [puppet] - 10https://gerrit.wikimedia.org/r/416757 (https://phabricator.wikimedia.org/T188294) (owner: 10RobH) [19:40:47] 10Operations, 10TemplateStyles, 10Traffic, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3331578 (10Tgr) [19:40:49] (03CR) 10Gergő Tisza: [C: 04-2] "See task." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416708 (https://phabricator.wikimedia.org/T189022) (owner: 10Zoranzoki21) [19:41:27] 10Operations, 10Analytics-Kanban: Eventlogging mysql consumers inserted rows on the analytics slave (db1108) for two hours - https://phabricator.wikimedia.org/T188991#4028849 (10Ottomata) For posterity, here's the script I used: https://gist.github.com/ottomata/d69ba72313c44e8e45e6453f4ea97074 [19:41:40] 10Operations, 10Analytics-Kanban: Eventlogging mysql consumers inserted rows on the analytics slave (db1108) for two hours - https://phabricator.wikimedia.org/T188991#4028858 (10elukey) [19:42:24] (03PS4) 10RobH: set new analytics hosts to role spare [puppet] - 10https://gerrit.wikimedia.org/r/416757 (https://phabricator.wikimedia.org/T188294) [19:42:28] 10Operations, 10Analytics-Kanban: Eventlogging mysql consumers inserted rows on the analytics slave (db1108) for two hours - https://phabricator.wikimedia.org/T188991#4026711 (10elukey) Before closing this task: 1) review the m4-master failover policy. 2) document this procedure on wikitech [19:43:26] (03CR) 1020after4: [C: 032] scap sync-canary plugin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413640 (owner: 1020after4) [19:43:28] (03CR) 10RobH: [C: 032] set new analytics hosts to role spare [puppet] - 10https://gerrit.wikimedia.org/r/416757 (https://phabricator.wikimedia.org/T188294) (owner: 10RobH) [19:44:42] (03Merged) 10jenkins-bot: scap sync-canary plugin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413640 (owner: 1020after4) [19:48:02] (03CR) 10jenkins-bot: scap sync-canary plugin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413640 (owner: 1020after4) [19:57:41] (03CR) 10BryanDavis: wikitech: use files from swift rather than local uploads. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416607 (https://phabricator.wikimedia.org/T188915) (owner: 10Andrew Bogott) [20:00:04] thcipriani: #bothumor I � Unicode. All rise for MediaWiki train deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180306T2000). [20:00:04] No GERRIT patches in the queue for this window AFAICS. [20:00:19] * thcipriani working on it [20:05:32] no_justification: Question about gerrit… If we rewrite history with a git-lfs conversion, will pushing that new history to gerrit have the desired effect—is it going to drop the old history? [20:06:25] PROBLEM - HHVM rendering on mw2229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:06:39] awight: I think for the most part, yes. no_justification is sick today, not sure if he'll respond. [20:07:14] twentyafterfour: oops, thanks for the note. [20:07:15] RECOVERY - HHVM rendering on mw2229 is OK: HTTP OK: HTTP/1.1 200 OK - 73724 bytes in 0.295 second response time [20:07:29] Cool, I’ll just tag this question onto our request. [20:08:21] !log thcipriani@tin Finished scap: testwiki to php-1.31.0-wmf.24 and rebuild l10n cache (duration: 29m 13s) [20:08:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:29] awight: yes. Existing clones won't be able to fetch (have to reclone) and existing code reviews would point to wrong sha1s. [20:11:03] no_justification: ooh funky. OK thanks. Feel better! halfak ^ [20:11:03] The latter problem can be fixed but it's very laborious [20:13:15] 10Operations, 10Gerrit, 10ORES, 10Scoring-platform-team, 10Patch-For-Review: Plan migration of ORES repos to git-lfs - https://phabricator.wikimedia.org/T181678#4029020 (10awight) We're currently thinking that we want to normalize our repo locations in gerrit, and introduce git-lfs in the new locations.... [20:13:43] 10Operations, 10ops-eqiad: Kernels errors on ganeti1005- ganeti1008 under high I/O - https://phabricator.wikimedia.org/T181121#3784283 (10chasemp) > The following reproduces successfully > > * Heavy IO on a DRBD backed device in a VM. Is this on 4.4 or only 4.9? [20:15:19] (03CR) 10Andrew Bogott: wikitech: use files from swift rather than local uploads. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416607 (https://phabricator.wikimedia.org/T188915) (owner: 10Andrew Bogott) [20:18:26] 10Operations, 10cloud-services-team, 10Upstream: New anti-stackclash (4.9.25-1~bpo8+3 ) kernel super bad for NFS - https://phabricator.wikimedia.org/T169290#4029040 (10chasemp) I wonder if related {T181121} [20:21:43] (03PS1) 10Ottomata: Remove Kafka analytics-eqiad webrequest camus and kafkatee instances [puppet] - 10https://gerrit.wikimedia.org/r/416761 (https://phabricator.wikimedia.org/T185136) [20:22:05] PROBLEM - cassandra-a SSL 10.64.48.168:7001 on restbase-dev1006 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [20:22:28] that's me ^^^ [20:22:45] PROBLEM - cassandra-a service on restbase-dev1006 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [20:22:55] PROBLEM - cassandra-b SSL 10.64.48.169:7001 on restbase-dev1006 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [20:23:03] (03PS2) 10Ottomata: Remove Kafka analytics-eqiad webrequest camus and kafkatee instances [puppet] - 10https://gerrit.wikimedia.org/r/416761 (https://phabricator.wikimedia.org/T185136) [20:24:07] 10Operations, 10Ops-Access-Requests, 10Research, 10Research-collaborations, 10Research-management: Request access to data for Wikimedia Donation Patterns research - https://phabricator.wikimedia.org/T188945#4029062 (10DYNKM) That'd be: > ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC28nrvknbeIAlF31jJCw1ucjaT... [20:26:20] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler02/10290/" [puppet] - 10https://gerrit.wikimedia.org/r/416761 (https://phabricator.wikimedia.org/T185136) (owner: 10Ottomata) [20:26:22] (03CR) 10Ottomata: [C: 032] Remove Kafka analytics-eqiad webrequest camus and kafkatee instances [puppet] - 10https://gerrit.wikimedia.org/r/416761 (https://phabricator.wikimedia.org/T185136) (owner: 10Ottomata) [20:26:56] 10Operations, 10Gerrit, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current): Plan migration of ORES repos to git-lfs - https://phabricator.wikimedia.org/T181678#4029065 (10awight) [20:33:10] (03PS1) 10Andrew Bogott: labs common.yaml: reformat k8s_infrastructure_users data [labs/private] - 10https://gerrit.wikimedia.org/r/416763 [20:33:17] (03PS4) 10Ottomata: Point Mediawiki Monolog at Kafka jumbo in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413796 (https://phabricator.wikimedia.org/T188136) [20:33:35] (03CR) 10Andrew Bogott: [V: 032 C: 032] labs common.yaml: reformat k8s_infrastructure_users data [labs/private] - 10https://gerrit.wikimedia.org/r/416763 (owner: 10Andrew Bogott) [20:33:45] PROBLEM - Host restbase-dev1006 is DOWN: PING CRITICAL - Packet loss = 100% [20:34:54] phabricator reboot is coming. very short downtime [20:35:15] !log pointing mediawiki monolog kafka producers at kafka jumbo-eqiad cluster: T188136 [20:35:24] (03CR) 10Ottomata: [C: 032] Point Mediawiki Monolog at Kafka jumbo in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413796 (https://phabricator.wikimedia.org/T188136) (owner: 10Ottomata) [20:35:25] ACKNOWLEDGEMENT - Host restbase-dev1006 is DOWN: PING CRITICAL - Packet loss = 100% eevans Rebooting [20:35:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:30] T188136: Migrate Mediawiki Monolog Kafka producer to Kafka Jumbo - https://phabricator.wikimedia.org/T188136 [20:36:19] !log phab1001 (phabricator) - rebooting for maintenance [20:36:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:42] ottomata: are you merging things on mediawiki-config right now? [20:37:07] thcipriani: i was ya, i can revert, was following https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment#Change_wiki_configuration [20:37:15] noticed a scap change too [20:37:22] was looking at git logs to see who to poke [20:38:00] thcipriani: should I revert? [20:38:04] yeah, if it's not urgent I'd rather not push it out at the moment, deploying a new train version at the moment. Change freaked me out a bit :) [20:38:09] sure [20:38:13] reverting, not urgent [20:38:20] cool, thank you :) [20:38:29] (03PS1) 10Ottomata: Revert "Point Mediawiki Monolog at Kafka jumbo in production" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416764 [20:38:45] (03CR) 10Ottomata: [V: 032 C: 032] Revert "Point Mediawiki Monolog at Kafka jumbo in production" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416764 (owner: 10Ottomata) [20:38:46] I'll poke you when I'm done here. (hopefully not too much longer) [20:38:55] RECOVERY - Host restbase-dev1006 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [20:39:15] (03CR) 10jenkins-bot: Point Mediawiki Monolog at Kafka jumbo in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413796 (https://phabricator.wikimedia.org/T188136) (owner: 10Ottomata) [20:39:28] k [20:40:23] thcipriani: added https://wikitech.wikimedia.org/w/index.php?title=Heterogeneous_deployment&type=revision&diff=1784661&oldid=1781051 :) [20:41:10] heh, thanks :) [20:41:57] phab is back [20:44:29] !log reverted change to point mediawiki monolog kafka producers at kafka jumbo-eqiad until deployment train is done T188136 [20:44:41] (03PS1) 10Thcipriani: Group0 to 1.31.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416767 [20:44:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:46] T188136: Migrate Mediawiki Monolog Kafka producer to Kafka Jumbo - https://phabricator.wikimedia.org/T188136 [20:45:38] (03CR) 10Thcipriani: [C: 032] Group0 to 1.31.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416767 (owner: 10Thcipriani) [20:47:11] (03Merged) 10jenkins-bot: Group0 to 1.31.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416767 (owner: 10Thcipriani) [20:48:46] mutante: if you find yourself Not Busy/bored, let me know :) [20:49:16] !log thcipriani@tin rebuilt and synchronized wikiversions files: Group0 to 1.31.0-wmf.24 [20:49:18] 10Operations, 10Discovery-Wikidata-Query-Service-Sprint, 10Patch-For-Review, 10User-Smalyshev: Activate kafka-based recent change poller for wikidata query service - https://phabricator.wikimedia.org/T188252#4029107 (10Smalyshev) 05Open>03Resolved [20:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:53] 10Operations, 10ops-eqiad, 10User-Eevans: Degraded RAID on restbase-dev1006 - https://phabricator.wikimedia.org/T185494#3917526 (10Eevans) I removed sdc1, sdc2, and sdc3 from md0, md1, and md2 respectively, and rebooted believing that might be the easiest way to correct the device ordering (the new drive sho... [20:51:02] mutante: basically, that ^^^ [20:52:09] 10Operations, 10ops-eqiad, 10User-Eevans: Degraded RAID on restbase-dev1006 - https://phabricator.wikimedia.org/T185494#4029114 (10Eevans) >>! In T185494#4029109, @Eevans wrote: > I removed sdc1, sdc2, and sdc3 from md0, md1, and md2 respectively, and rebooted believing that might be the easiest way to corre... [20:54:10] ottomata: I'm done on tin, FYI, if you want to deploy your config change [20:55:58] 10Operations, 10Beta-Cluster-Infrastructure: Beta cluster Obama page often responds with 503 - https://phabricator.wikimedia.org/T188913#4029124 (10Niedzielski) BC continues to be very slow. 503s appear to always emit from deployment-cache-text04 deployment-cache-text04: ``` Request from 73.252.38.252 via dep... [21:10:02] (03CR) 10jenkins-bot: Revert "Point Mediawiki Monolog at Kafka jumbo in production" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416764 (owner: 10Ottomata) [21:10:17] (03CR) 10Bstorm: "If we are good with me keeping the somewhat excessive comments for now, I'll merge after another plus one. Then I'll have to coordinate a" [puppet] - 10https://gerrit.wikimedia.org/r/416496 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [21:10:30] (03CR) 10jenkins-bot: Group0 to 1.31.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416767 (owner: 10Thcipriani) [21:16:57] (03CR) 10Legoktm: [C: 031] Load Wikibase Quality extensions using extension registration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416739 (https://phabricator.wikimedia.org/T106104) (owner: 10Lucas Werkmeister (WMDE)) [21:17:30] urandom: looking now [21:19:49] mutante: thanks! [21:20:13] Login incorrect. [21:20:13] Give root password for maintenance [21:20:13] (or type Control-D to continue): [21:20:22] uh oh [21:21:16] mutante: it's probably not worth spending a bunch of time on, if it's bad it can be reimaged [21:21:45] (03PS1) 10Rush: wmcs: update wmcs-team contacts for wmcs respective [puppet] - 10https://gerrit.wikimedia.org/r/416844 (https://phabricator.wikimedia.org/T178405) [21:21:47] PROBLEM - Host restbase-dev1006 is DOWN: PING CRITICAL - Packet loss = 100% [21:22:02] urandom: ok, i'm trying one powercycle to see the full thing [21:22:06] k [21:22:27] mutante: either way i am curious where i went wrong [21:22:50] (03CR) 10Rush: [C: 032] wmcs: update wmcs-team contacts for wmcs respective [puppet] - 10https://gerrit.wikimedia.org/r/416844 (https://phabricator.wikimedia.org/T178405) (owner: 10Rush) [21:22:56] (03PS2) 10Rush: wmcs: update wmcs-team contacts for wmcs respective [puppet] - 10https://gerrit.wikimedia.org/r/416844 (https://phabricator.wikimedia.org/T178405) [21:22:58] !log restbase-dev1006 powercycled via console (T185494) [21:23:01] all i did was do a fail/remove of the disk that had already failed and been removed [21:23:09] and rebooted [21:23:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:12] T185494: Degraded RAID on restbase-dev1006 - https://phabricator.wikimedia.org/T185494 [21:23:35] i see GRUB now [21:24:32] it's coming up .. then [21:24:47] "A start job is running for dev-mapper-restbase ... [21:25:18] dev-mapper-restbase\x2d\x...12 s / 1min 30s) [21:25:57] [ TIME ] Timed out waiting for device dev-mapper-restbase\x2d...\x2dsrv.device. [21:26:00] [DEPEND] Dependency failed for /srv. [21:26:02] [DEPEND] Dependency failed for Local File Systems. [21:26:05] [DEPEND] Dependency failed for File System Check on /dev/mapp...ev1006--vg-srv. [21:26:08] that the volume that sits on md2 (a raid0 made up of sda3, sdb3, sdc3 (failed/replaced), and sd3) [21:26:09] and then it boots to single user mode [21:26:27] uh, ok [21:27:03] i guess a reboot would have set into fits regardless of the other business [21:27:30] i think it should be reinstalled after the disk changes, yea [21:27:47] mutante: yeah, we've done this both ways in the past [21:28:18] mutante: and i was going to see about fixing it only because i can't re-image, and SRE-time is hard to come by atm :) [21:28:23] Control-D to continue .. doesnt work for me [21:28:29] mutante: no worries [21:28:33] ok [21:28:39] it's not hurting anything to be down [21:28:48] alright, should i just boot it into PXE ? [21:28:59] ¯\_(ツ)_/¯ [21:29:20] mutante: to signal it's ready for re-image? [21:29:28] to run the Debian installer [21:29:30] no sure what the implication of that would be [21:29:36] but i might cause more confusion in the work flow [21:29:42] so let's just shut it down and ACK [21:29:52] i have it under scheduled maintenance now [21:30:01] ok, cool [21:30:12] mutante: thanks for the help! [21:30:21] you're welcome [21:31:20] 10Operations, 10ops-eqiad, 10User-Eevans: Degraded RAID on restbase-dev1006 - https://phabricator.wikimedia.org/T185494#3917526 (10Dzahn) the console was showing "root password for maintenance (or type Control-D to continue): " I tried one powercyle and i saw: ``` Starting Activation of LVM2 logi... [21:32:01] maybe the mdadm --fail and --remove should have hapened for the drive itself was swapped out? [21:32:08] s/for/before/ [21:35:03] how to get out of HP console: ESC + ( [21:35:15] takes too long each time :) [21:37:49] urandom: do you know what's up with restbase-dev1004 and 1005 too? [21:38:16] i see an icinga alert for port 7231 - not listening [21:38:20] (03PS1) 10RobH: analytics107[0-7] corrections [puppet] - 10https://gerrit.wikimedia.org/r/416848 (https://phabricator.wikimedia.org/T188294) [21:38:33] restbase isn't running, i guess [21:38:49] that cluster needs to be redone entirely [21:38:58] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [21:39:06] the code running on it was some test version as we worked through the new storage strategy [21:39:12] (03CR) 10RobH: [C: 032] analytics107[0-7] corrections [puppet] - 10https://gerrit.wikimedia.org/r/416848 (https://phabricator.wikimedia.org/T188294) (owner: 10RobH) [21:39:12] would a real solution maybe be "if there is 'dev' in the name, then don't add icinga checks" ? [21:39:19] yeah [21:39:22] definitely [21:39:25] ok, i think that's possible :) [21:39:42] nova creation issue is known [21:39:48] thanks [21:40:19] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [21:40:19] ACKNOWLEDGEMENT - Restbase root url on restbase-dev1004 is CRITICAL: connect to address 10.64.0.89 and port 7231: Connection refused daniel_zahn its DEV [21:40:19] ACKNOWLEDGEMENT - Restbase root url on restbase-dev1005 is CRITICAL: connect to address 10.64.16.96 and port 7231: Connection refused daniel_zahn its DEV [21:40:22] adds "persistent comment" [21:47:13] (03PS1) 10Herron: Use hiera3 role/nuyaml backends on >= stretch [puppet] - 10https://gerrit.wikimedia.org/r/416850 (https://phabricator.wikimedia.org/T188623) [21:47:37] 10Operations, 10monitoring: restbase: skip icinga monitoring if on "dev" machines - https://phabricator.wikimedia.org/T189050#4029186 (10Dzahn) [21:47:45] (03CR) 10jerkins-bot: [V: 04-1] Use hiera3 role/nuyaml backends on >= stretch [puppet] - 10https://gerrit.wikimedia.org/r/416850 (https://phabricator.wikimedia.org/T188623) (owner: 10Herron) [21:52:43] (03PS2) 10Herron: Use hiera3 role/nuyaml backends on >= stretch [puppet] - 10https://gerrit.wikimedia.org/r/416850 (https://phabricator.wikimedia.org/T188623) [21:53:21] (03PS3) 10Rush: wmcs: Notify legoktm for codesearch alerts [puppet] - 10https://gerrit.wikimedia.org/r/415178 (owner: 10Legoktm) [21:53:27] PROBLEM - HHVM rendering on mw2120 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:54:18] RECOVERY - HHVM rendering on mw2120 is OK: HTTP OK: HTTP/1.1 200 OK - 73728 bytes in 0.322 second response time [21:54:39] (03CR) 10Rush: [C: 032] wmcs: Notify legoktm for codesearch alerts [puppet] - 10https://gerrit.wikimedia.org/r/415178 (owner: 10Legoktm) [21:57:11] (03PS1) 10Andrew Bogott: dns labsaliaser: reload lua script whenever it's updated. [puppet] - 10https://gerrit.wikimedia.org/r/416852 (https://phabricator.wikimedia.org/T188619) [21:57:34] (03PS1) 10RobH: fixing dns typos [dns] - 10https://gerrit.wikimedia.org/r/416854 [21:57:37] (03CR) 10Rush: "cheers and sorry idk what I was seeing the first time around" [puppet] - 10https://gerrit.wikimedia.org/r/415178 (owner: 10Legoktm) [21:57:42] (03CR) 10jerkins-bot: [V: 04-1] dns labsaliaser: reload lua script whenever it's updated. [puppet] - 10https://gerrit.wikimedia.org/r/416852 (https://phabricator.wikimedia.org/T188619) (owner: 10Andrew Bogott) [21:57:54] (03CR) 10RobH: [C: 032] fixing dns typos [dns] - 10https://gerrit.wikimedia.org/r/416854 (owner: 10RobH) [21:58:57] (03PS2) 10Andrew Bogott: dns labsaliaser: reload lua script whenever it's updated. [puppet] - 10https://gerrit.wikimedia.org/r/416852 (https://phabricator.wikimedia.org/T188619) [22:00:44] (03CR) 10BryanDavis: dns labsaliaser: reload lua script whenever it's updated. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/416852 (https://phabricator.wikimedia.org/T188619) (owner: 10Andrew Bogott) [22:01:37] (03PS3) 10Andrew Bogott: dns labsaliaser: reload lua script whenever it's updated. [puppet] - 10https://gerrit.wikimedia.org/r/416852 (https://phabricator.wikimedia.org/T188619) [22:10:18] (03CR) 10Chad: [C: 04-1] wikitech: use files from swift rather than local uploads. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416607 (https://phabricator.wikimedia.org/T188915) (owner: 10Andrew Bogott) [22:10:44] (03CR) 10Chad: [C: 032] scap prep: Scap-ify the creation of beta's StartProfiler.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416334 (https://phabricator.wikimedia.org/T180766) (owner: 10Krinkle) [22:10:57] (03CR) 10Chad: [C: 031] scap prep: Scap-ify the creation of beta's StartProfiler.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416334 (https://phabricator.wikimedia.org/T180766) (owner: 10Krinkle) [22:15:42] 10Operations, 10ops-eqiad, 10Analytics-Cluster, 10Analytics-Kanban, and 2 others: rack/setup/install analytics107[0-7] - https://phabricator.wikimedia.org/T188294#4029237 (10RobH) Ok, All systems are installed with stretch. I need to reinstall analytics1076, as it had the wrong hostname set by a bad rever... [22:16:30] 10Operations, 10ops-eqiad, 10Analytics-Cluster, 10Analytics-Kanban, 10User-Elukey: rack/setup/install analytics107[0-7] - https://phabricator.wikimedia.org/T188294#4029238 (10RobH) [22:16:36] (03CR) 10Herron: "This is meant to accomplish the same goal as reverted change https://gerrit.wikimedia.org/r/416665" [puppet] - 10https://gerrit.wikimedia.org/r/416850 (https://phabricator.wikimedia.org/T188623) (owner: 10Herron) [22:16:51] 10Operations, 10ops-eqiad, 10Analytics-Cluster, 10Analytics-Kanban, 10User-Elukey: rack/setup/install analytics107[0-7] - https://phabricator.wikimedia.org/T188294#4002756 (10RobH) ping @elukey: You can take over on ALL but analytics1076. I need to keep working on it for now. [22:17:10] (03PS3) 10Herron: Use hiera3 role/nuyaml backends on >= stretch [puppet] - 10https://gerrit.wikimedia.org/r/416850 (https://phabricator.wikimedia.org/T188623) [22:37:38] (03PS1) 10BryanDavis: labs dns: add some docs for labs-ip-alias-dump [puppet] - 10https://gerrit.wikimedia.org/r/416860 [22:42:57] 10Operations: import prometheus-memcached-exporter into wikimedia-stretch - https://phabricator.wikimedia.org/T189056#4029367 (10Paladox) [22:54:09] 10Operations, 10ops-eqsin, 10Traffic, 10netops: replace eqsin SFP-T/SFP+ - https://phabricator.wikimedia.org/T188923#4029421 (10Papaul) [x] On the device labeled cr1-eqsin, Juniper MX104, top of rack 603, please replace the 4 optics present in the embedded ports (aka not in modules) labeled xe-2/0/0 to xe-... [22:59:22] 10Operations, 10ops-eqsin, 10Traffic: cp5006 unresponsive - https://phabricator.wikimedia.org/T187157#4029425 (10Papaul) All the normal troubleshooting was done on the server. - Unplugging the power - Removing the PSU's for 15 minutes while working on the router Server will not power on. [23:00:02] (03PS1) 10Madhuvishy: dumps: Switch rsyncer profile to use host settings from hiera [puppet] - 10https://gerrit.wikimedia.org/r/416863 (https://phabricator.wikimedia.org/T188726) [23:00:04] MaxSem: My dear minions, it's time we take the moon! Just kidding. Time for Run maintenance script deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180306T2300). [23:00:06] No GERRIT patches in the queue for this window AFAICS. [23:00:37] !log dumping centralauth.spoofuser from db1094 [23:00:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:51] 10Operations: import prometheus-memcached-exporter into wikimedia-stretch - https://phabricator.wikimedia.org/T189056#4029429 (10Dzahn) The class that includs this is: ``` modules/profile/manifests/memcached/instance.pp: include ::profile::prometheus::memcached_exporter ``` And that is included by: ``` m... [23:03:20] 10Operations, 10ops-codfw, 10User-Elukey: rack/setup/install mw2259-mw2290 - https://phabricator.wikimedia.org/T188301#4029430 (10Papaul) All the servers were racked and wired before leaving Dallas. Will have to unrack them and re-wiring them. [23:05:23] !log refreshing spoofuser [23:05:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:20] 10Operations, 10cloud-services-team, 10Upstream: New anti-stackclash (4.9.25-1~bpo8+3 ) kernel super bad for NFS - https://phabricator.wikimedia.org/T169290#4029431 (10chasemp) I talked to someone in `#drbd` (lge a dev I think) who said they have no reason to think there would be an issue with 4.4 or 4.9 ker... [23:10:58] !log cancelled [23:11:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:14] 10Operations: import prometheus-memcached-exporter into wikimedia-stretch - https://phabricator.wikimedia.org/T189056#4029445 (10Paladox) [23:12:50] (03PS1) 10Madhuvishy: dumps: Move rsyncer to distribution profile path and rename [puppet] - 10https://gerrit.wikimedia.org/r/416866 (https://phabricator.wikimedia.org/T188726) [23:16:28] 10Operations, 10ops-eqsin, 10Traffic, 10netops: replace eqsin SFP-T/SFP+ - https://phabricator.wikimedia.org/T188923#4029452 (10Papaul) [23:17:31] 10Operations, 10ops-eqsin, 10Traffic, 10netops: replace eqsin SFP-T/SFP+ - https://phabricator.wikimedia.org/T188923#4024207 (10Papaul) [23:18:16] 10Operations, 10ops-eqsin, 10Traffic, 10netops: replace eqsin SFP-T/SFP+ - https://phabricator.wikimedia.org/T188923#4024207 (10Papaul) [] On the device labeled asw-0603-eqsin, Juniper EX4600, rack 603, please replace the SFP-T (copper SFPs) present in ports 12, 14 and 23 with the QFX-SFP-1GE-T transceiver... [23:22:46] (03CR) 10Madhuvishy: "https://puppet-compiler.wmflabs.org/compiler02/10295/" [puppet] - 10https://gerrit.wikimedia.org/r/416863 (https://phabricator.wikimedia.org/T188726) (owner: 10Madhuvishy) [23:26:46] 10Operations, 10ops-eqsin, 10netops: return faulty MX104 to Juniper - https://phabricator.wikimedia.org/T189060#4029482 (10ayounsi) [23:28:14] (03CR) 10Madhuvishy: "https://puppet-compiler.wmflabs.org/compiler02/10296/" [puppet] - 10https://gerrit.wikimedia.org/r/416866 (https://phabricator.wikimedia.org/T188726) (owner: 10Madhuvishy) [23:29:13] (03PS1) 10Madhuvishy: dumps: Split up rsync config to base, mirrors, and datasets [puppet] - 10https://gerrit.wikimedia.org/r/416869 (https://phabricator.wikimedia.org/T188726) [23:29:57] (03CR) 10jerkins-bot: [V: 04-1] dumps: Split up rsync config to base, mirrors, and datasets [puppet] - 10https://gerrit.wikimedia.org/r/416869 (https://phabricator.wikimedia.org/T188726) (owner: 10Madhuvishy) [23:32:02] ACKNOWLEDGEMENT - puppet last run on labtestvirt2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues daniel_zahn TEST [23:38:46] (03PS2) 10Madhuvishy: dumps: Split up rsync config to base, mirrors, and datasets [puppet] - 10https://gerrit.wikimedia.org/r/416869 (https://phabricator.wikimedia.org/T188726) [23:39:29] (03CR) 10jerkins-bot: [V: 04-1] dumps: Split up rsync config to base, mirrors, and datasets [puppet] - 10https://gerrit.wikimedia.org/r/416869 (https://phabricator.wikimedia.org/T188726) (owner: 10Madhuvishy) [23:57:01] (03CR) 10Krinkle: [C: 031] Beta: Cron to update wmf-config every 3 minutes [puppet] - 10https://gerrit.wikimedia.org/r/414893 (owner: 10Chad)