[00:00:05] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: My dear minions, it's time we take the moon! Just kidding. Time for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190221T0000). [00:00:05] davidwbarratt: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:58] here! [00:01:46] I can SWAT [00:02:16] (03PS2) 10Catrope: Enable partial blocks on Meta Wiki and MediaWiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491291 (https://phabricator.wikimedia.org/T216065) (owner: 10Dbarratt) [00:02:32] thanks1 [00:02:55] (03CR) 10Catrope: [C: 03+2] Enable partial blocks on Meta Wiki and MediaWiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491291 (https://phabricator.wikimedia.org/T216065) (owner: 10Dbarratt) [00:03:27] (03PS1) 10BryanDavis: toolforge: Allow pods to mount /mnt/nfs [puppet] - 10https://gerrit.wikimedia.org/r/491877 (https://phabricator.wikimedia.org/T193646) [00:04:02] (03Merged) 10jenkins-bot: Enable partial blocks on Meta Wiki and MediaWiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491291 (https://phabricator.wikimedia.org/T216065) (owner: 10Dbarratt) [00:05:47] (03CR) 10Andrew Bogott: [C: 03+2] toolforge: Allow pods to mount /mnt/nfs [puppet] - 10https://gerrit.wikimedia.org/r/491877 (https://phabricator.wikimedia.org/T193646) (owner: 10BryanDavis) [00:06:20] davidwbarratt: It's on mwdebug1002, please test [00:06:30] you got it [00:07:07] BEAUTIFUL! [00:07:15] looks perfect to me [00:08:08] 10Operations, 10Release-Engineering-Team, 10Scap: Upgrade scap debian package to 3.9.0-1 - https://phabricator.wikimedia.org/T216666 (10thcipriani) [00:08:58] (03CR) 10jenkins-bot: Enable partial blocks on Meta Wiki and MediaWiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491291 (https://phabricator.wikimedia.org/T216065) (owner: 10Dbarratt) [00:10:22] (03PS1) 10Thcipriani: Scap: upgrade to 3.9.0-1 [puppet] - 10https://gerrit.wikimedia.org/r/491879 (https://phabricator.wikimedia.org/T216666) [00:16:23] davidwbarratt: Whoops, sorry for the delay, got distracted finishing something else. Deploying now [00:16:45] RoanKattouw thanks! [00:17:37] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable partial blocks on metawiki and mediawikiwiki (T216065) (duration: 00m 54s) [00:17:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:17:40] T216065: Enable partial blocks on Meta Wiki and MediaWiki.org on Tues Feb 19 - https://phabricator.wikimedia.org/T216065 [00:21:58] let me know when it's done. :) [00:25:48] RoanKattouw ? [00:26:47] Sorry, that stashbot message was it, it's done [00:26:54] YAY! [00:27:00] And that's SWAT over too [00:27:04] RoanKattouw thanks! [00:45:25] Just a heads-up: I will be taking phabricator down for a long-overdue upgrade in about 15 minutes [00:51:17] * bd808 takes that as a cue to stop working for the day [00:53:41] twentyafterfour will you be updating phabricator/deployment too? [00:54:23] paladox: yes [00:54:29] ok :) [01:00:04] twentyafterfour: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190221T0100). [01:01:11] !log set downtime in icinga for phab100* [01:01:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:14:57] 10Operations, 10ops-codfw: wtp2020: correctable memory errors - https://phabricator.wikimedia.org/T205712 (10RobH) So, this has a warranty of Jan. 19, 2018, so it is out of warranty. Best we can do is see if the slot is bad or dimm, and remove a bad dimm. [01:15:56] !log Taking phabricator offline momentarily for upgrade [01:15:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:18:02] twentyafterfour i get "{"error":{"root_cause":[{"type":"invalid_type_name_exception","reason":"mapping type name [_doc] can't start with '_'"}],"type":"invalid_type_name_exception","reason":"mapping type name [_doc] can't start with '_'"},"status":400} at [/src/future/http/BaseHTTPFuture.php:351]" with elasticsearch [01:18:37] 10Operations, 10ops-codfw: wtp2020: correctable memory errors - https://phabricator.wikimedia.org/T205712 (10RobH) No logged errors in SEL: ` 4 $> ssh root@wtp2020.mgmt.codfw.wmnet root@wtp2020.mgmt.codfw.wmnet's password: /admin1-> racadm getsel Record: 1 Date/Time: 01/15/2015 23:03:58 Source:... [01:18:57] paladox: hmm [01:23:15] paladox: which version of elastic? [01:23:21] * paladox checks [01:23:43] 5.6.13 [01:24:03] * twentyafterfour followed all of the guidelines for elasticsearch 6, which I would think would be backwards compatible but apparently not so [01:24:29] * paladox upgrades to 5.6.15 [01:24:57] I guess I'll revert just that file back to the elastic 5.x version [01:25:15] twentyafterfour oh, so it was for es 6? [01:25:24] coulden't we use our version check? [01:29:14] twentyafterfour i think apache needs restarting for the ui changes? [01:31:44] paladox: I'm not sure [01:31:54] I always restart apache [01:31:58] hmm [01:32:10] paladox: I just pushed a revert for the last es6 commit, does it work for you now? [01:32:18] * paladox tests [01:33:37] twentyafterfour though https://phabricator.wikimedia.org/source/gerrit/manage/ is not showing the redesgned ui [01:33:44] but https://phab.wmflabs.org/diffusion/1/manage/ is [01:34:44] twentyafterfour yup that works [01:34:48] paladox: I haven't updated production yet because you pasted that elasticsearch error [01:34:53] ah ok [01:35:04] so it looks good to you now? [01:36:17] yup [01:36:25] bin/search init [01:36:25] Initializing search service "Elasticsearch". [01:36:26] Service index does not exist, creating... [01:36:26] Done. [01:36:26] Service initialization complete. [01:38:35] !log now taking phabricator offline for upgrade [01:38:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:43:47] !log running phabricator database schema changes [01:43:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:45:53] ah [01:46:44] This is likely to be slow because there is quite a backlog of schema changes [01:46:58] twentyafterfour: have an eta? [01:47:50] XioNoX: not really, hopefully it won't take very long but there are a lot of schema changes and apparently one of them changed the type of a column which requires rebuilding the table [01:48:17] I'd say at least another 5 minutes [01:49:03] that's cool, was worried if it was more than 30 :) [01:59:55] PROBLEM - MariaDB Slave Lag: m3 on db2078 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 730.15 seconds [02:00:23] PROBLEM - MariaDB Slave Lag: m3 on db2042 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 544.49 seconds [02:07:49] RECOVERY - MariaDB Slave Lag: m3 on db2042 is OK: OK slave_sql_lag Replication lag: 0.42 seconds [02:08:35] RECOVERY - MariaDB Slave Lag: m3 on db2078 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [02:21:11] PROBLEM - Check systemd state on phab1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:24:53] I'm getting 503s when browse phab [02:29:47] (03PS1) 10Ayounsi: DNS: Add cr2-eqsin + related cr1 changes [dns] - 10https://gerrit.wikimedia.org/r/491888 (https://phabricator.wikimedia.org/T213121) [02:30:50] JJMC89 hi, known (phabricator is currently being upgraded) [02:31:20] !log phabricator upgrade finished, service appears to be returned to normal [02:31:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:31:27] JJMC89: still getting errors? [02:31:39] Thanks paladox. I was just reading the channel logs and saw it. [02:31:44] twentyafterfour: Its back now [02:33:27] RECOVERY - Check systemd state on phab1001 is OK: OK - running: The system is fully operational [02:34:13] (03PS1) 10Ayounsi: Depool eqsin for cr2-eqsin setup [dns] - 10https://gerrit.wikimedia.org/r/491889 (https://phabricator.wikimedia.org/T213121) [02:40:15] PROBLEM - MariaDB Slave Lag: m3 on db1117 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 643.48 seconds [02:43:01] (03CR) 10Ayounsi: [C: 03+2] Depool eqsin for cr2-eqsin setup [dns] - 10https://gerrit.wikimedia.org/r/491889 (https://phabricator.wikimedia.org/T213121) (owner: 10Ayounsi) [02:44:00] !log depool eqsin - T213121 [02:44:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:44:03] T213121: Deploy cr2-eqsin - https://phabricator.wikimedia.org/T213121 [02:57:39] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [02:57:47] PROBLEM - parsoid on wtp2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:57:57] PROBLEM - restbase endpoints health on restbase2008 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [02:58:11] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [02:58:13] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [02:58:21] PROBLEM - parsoid on wtp2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:58:48] hm... [02:58:57] PROBLEM - LVS HTTP IPv4 on parsoid.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:58:58] PROBLEM - parsoid on wtp2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:59:01] PROBLEM - parsoid on wtp2008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:59:01] RECOVERY - parsoid on wtp2015 is OK: HTTP OK: HTTP/1.1 200 OK - 1051 bytes in 9.533 second response time [02:59:25] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [02:59:27] PROBLEM - parsoid on wtp2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:59:29] PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - CRITICAL - parsoid_8000: Servers wtp2007.codfw.wmnet, wtp2019.codfw.wmnet, wtp2009.codfw.wmnet, wtp2008.codfw.wmnet, wtp2004.codfw.wmnet, wtp2001.codfw.wmnet, wtp2015.codfw.wmnet, wtp2005.codfw.wmnet, wtp2011.codfw.wmnet, wtp2014.codfw.wmnet, wtp2018.codfw.wmnet, wtp2012.codfw.wmnet, wtp2002.codfw.wmnet, wtp2016.codfw.wmnet, wtp2017.codfw.wmnet, wtp201 [02:59:29] p2010.codfw.wmnet, wtp2020.codfw.wmnet are marked down but pooled [02:59:36] uh [02:59:37] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [03:00:09] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy [03:00:17] RECOVERY - parsoid on wtp2008 is OK: HTTP OK: HTTP/1.1 200 OK - 1051 bytes in 7.033 second response time [03:00:23] PROBLEM - parsoid on wtp2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:00:25] RECOVERY - restbase endpoints health on restbase2008 is OK: All endpoints are healthy [03:00:33] PROBLEM - parsoid on wtp2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:00:35] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy [03:00:41] RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy [03:00:47] RECOVERY - parsoid on wtp2003 is OK: HTTP OK: HTTP/1.1 200 OK - 1051 bytes in 0.023 second response time [03:00:51] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - CRITICAL - parsoid_8000: Servers wtp2002.codfw.wmnet, wtp2019.codfw.wmnet, wtp2016.codfw.wmnet, wtp2009.codfw.wmnet, wtp2017.codfw.wmnet, wtp2013.codfw.wmnet, wtp2008.codfw.wmnet, wtp2010.codfw.wmnet, wtp2001.codfw.wmnet, wtp2015.codfw.wmnet, wtp2003.codfw.wmnet, wtp2005.codfw.wmnet, wtp2014.codfw.wmnet are marked down but pooled [03:01:13] PROBLEM - parsoid on wtp2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:01:17] RECOVERY - MariaDB Slave Lag: m3 on db1117 is OK: OK slave_sql_lag Replication lag: 0.16 seconds [03:01:23] PROBLEM - parsoid on wtp2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:01:25] RECOVERY - parsoid on wtp2005 is OK: HTTP OK: HTTP/1.1 200 OK - 1051 bytes in 3.712 second response time [03:01:29] RECOVERY - LVS HTTP IPv4 on parsoid.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1051 bytes in 6.923 second response time [03:01:39] PROBLEM - restbase endpoints health on restbase2007 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [03:01:41] PROBLEM - parsoid on wtp2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:01:43] PROBLEM - PyBal IPVS diff check on lvs2006 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([wtp2018.codfw.wmnet, wtp2013.codfw.wmnet, wtp2017.codfw.wmnet, wtp2002.codfw.wmnet, wtp2010.codfw.wmnet, wtp2015.codfw.wmnet, wtp2019.codfw.wmnet, wtp2009.codfw.wmnet, wtp2008.codfw.wmnet, wtp2011.codfw.wmnet, wtp2014.codfw.wmnet]) [03:01:45] RECOVERY - parsoid on wtp2010 is OK: HTTP OK: HTTP/1.1 200 OK - 1051 bytes in 1.878 second response time [03:02:01] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy [03:02:03] RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy [03:02:11] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy [03:02:27] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [03:02:35] RECOVERY - parsoid on wtp2002 is OK: HTTP OK: HTTP/1.1 200 OK - 1051 bytes in 1.757 second response time [03:02:47] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [03:02:53] RECOVERY - parsoid on wtp2013 is OK: HTTP OK: HTTP/1.1 200 OK - 1051 bytes in 3.745 second response time [03:02:55] RECOVERY - restbase endpoints health on restbase2007 is OK: All endpoints are healthy [03:02:57] RECOVERY - parsoid on wtp2018 is OK: HTTP OK: HTTP/1.1 200 OK - 1051 bytes in 6.957 second response time [03:02:59] PROBLEM - parsoid on wtp2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:03:31] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [03:03:35] OK. I'll not start script run until these errors are resolved. [03:03:41] RECOVERY - parsoid on wtp2019 is OK: HTTP OK: HTTP/1.1 200 OK - 1051 bytes in 0.011 second response time [03:04:03] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy [03:04:11] RECOVERY - parsoid on wtp2015 is OK: HTTP OK: HTTP/1.1 200 OK - 1051 bytes in 6.582 second response time [03:04:18] kart_: I think it will soon recover [03:04:27] RECOVERY - parsoid on wtp2017 is OK: HTTP OK: HTTP/1.1 200 OK - 1051 bytes in 2.267 second response time [03:04:39] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy [03:04:43] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy [03:05:15] Nice. I'll go ahead then. [03:05:20] XioNoX: did you depool eqsin recently and I would assume traffic for chinese wiki mostly goes there? [03:05:37] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [03:06:07] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy [03:06:37] Pchelolo [02:44:00] !log depool eqsin - T213121 [03:06:37] T213121: Deploy cr2-eqsin - https://phabricator.wikimedia.org/T213121 [03:06:57] RECOVERY - PyBal IPVS diff check on lvs2006 is OK: OK: no difference between hosts in IPVS/PyBal [03:08:35] ok.. I know what happened. We need to have a conversation with Parsoid folks.. [03:09:15] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy [03:09:29] Pchelolo: is this a similar problem as before with language variations ? [03:10:00] cdanis: you get a gold star for guessing from the first attempt [03:10:11] https://grafana.wikimedia.org/d/000000068/restbase?panelId=21&fullscreen&orgId=1 [03:10:57] what's up? [03:11:17] paladox: ? [03:11:21] er Pchelolo ? [03:11:33] PROBLEM - parsoid on wtp2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:11:59] XioNoX: not much. As soon as we lost eqsin all the chinese wiki mobile traffic started hitting parsoid for language variant transformations [03:12:07] and parsoid doesn't like that [03:12:39] RECOVERY - parsoid on wtp2005 is OK: HTTP OK: HTTP/1.1 200 OK - 1051 bytes in 0.287 second response time [03:13:22] I think it should get ok soonish [03:15:26] !log Fifth manual run of unpublished draft purge script for ContentTranslation (T216470) [03:15:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:15:29] T216470: Fifth manual run of unpublished draft purge script - https://phabricator.wikimedia.org/T216470 [03:16:42] Pchelolo: depooling a site should not cause anything... definitely worth a task [03:17:13] yeye XioNoX for sure. This is not the first time this particular feature crumbles under a bit of pressure [03:17:28] I’m guessing most of the Chinese language variant queries lived in the eqsin varnishes? [03:17:31] there's a task already somewhere, I'll just reprioritize it [03:18:31] ACKNOWLEDGEMENT - HP RAID on db2050 is CRITICAL: CRITICAL: Slot 0: Predictive Failure: 1I:1:4, 1I:1:7 - Failed: 1I:1:1 - OK: 1I:1:2, 1I:1:3, 1I:1:5, 1I:1:6, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T216670 [03:18:37] 10Operations, 10ops-codfw: Degraded RAID on db2050 - https://phabricator.wikimedia.org/T216670 (10ops-monitoring-bot) [03:21:09] !log replace cp5010 disk 1 - T214274 [03:21:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:21:12] T214274: Degraded RAID on cp5010 - https://phabricator.wikimedia.org/T214274 [03:35:32] Pchelolo: can you cc me on the task if I’m not already? [03:50:37] (03CR) 10Ayounsi: [C: 03+2] DNS: Add cr2-eqsin + related cr1 changes [dns] - 10https://gerrit.wikimedia.org/r/491888 (https://phabricator.wikimedia.org/T213121) (owner: 10Ayounsi) [03:51:03] (03PS2) 10Ayounsi: DNS: Add cr2-eqsin + related cr1 changes [dns] - 10https://gerrit.wikimedia.org/r/491888 (https://phabricator.wikimedia.org/T213121) [03:59:16] cdanis: oh sorry, I stepped away. Will do, I'll probably dig out a ticket tomorrow morning [04:03:16] cdanis: I believe T214099 should be the task to update [04:03:17] T214099: Stress test Parsoid's HTTP API - https://phabricator.wikimedia.org/T214099 [04:06:42] (03PS1) 10Ayounsi: DNS: add cr2-eqsin mgmt A [dns] - 10https://gerrit.wikimedia.org/r/491894 (https://phabricator.wikimedia.org/T213121) [04:08:09] (03CR) 10Ayounsi: [C: 03+2] DNS: add cr2-eqsin mgmt A [dns] - 10https://gerrit.wikimedia.org/r/491894 (https://phabricator.wikimedia.org/T213121) (owner: 10Ayounsi) [04:16:16] 10Operations, 10ops-eqiad, 10ops-eqsin, 10netops, 10Patch-For-Review: Deploy cr2-eqsin - https://phabricator.wikimedia.org/T213121 (10ayounsi) [04:16:38] !log Unplug Tata/NTT/PCCW from cr1-eqsin - T213121 [04:16:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:16:42] T213121: Deploy cr2-eqsin - https://phabricator.wikimedia.org/T213121 [04:22:49] 10Operations, 10ops-eqiad, 10ops-eqsin, 10netops, 10Patch-For-Review: Deploy cr2-eqsin - https://phabricator.wikimedia.org/T213121 (10ayounsi) [04:25:21] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 194 probes of 428 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [04:30:18] !log Finished: Fifth manual run of unpublished draft purge script for ContentTranslation (T216470) [04:30:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:30:22] T216470: Fifth manual run of unpublished draft purge script - https://phabricator.wikimedia.org/T216470 [04:30:31] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 3 probes of 428 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [05:33:15] RECOVERY - MD RAID on cp5010 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [05:34:21] !log rebooting cp5010 for device name on swapped disk (depooled) - T214274 [05:34:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:25] T214274: Degraded RAID on cp5010 - https://phabricator.wikimedia.org/T214274 [05:36:25] 10Operations, 10ops-eqiad, 10ops-eqsin, 10netops, 10Patch-For-Review: Deploy cr2-eqsin - https://phabricator.wikimedia.org/T213121 (10ayounsi) [05:41:58] !log removing cp5010 downtimes from icinga - T214274 [05:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:42:02] T214274: Degraded RAID on cp5010 - https://phabricator.wikimedia.org/T214274 [05:44:55] or not, stupid auth [05:46:15] there we go [05:46:44] !log repooling cp5010 - T214274 [05:46:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:48:34] 10Operations, 10ops-eqsin, 10Traffic: Degraded RAID on cp5010 - https://phabricator.wikimedia.org/T214274 (10BBlack) 05Open→03Resolved Seems to be working fine after replacement! [06:02:29] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2050 - https://phabricator.wikimedia.org/T216670 (10Marostegui) p:05Triage→03Normal a:03Papaul Let's get the disk changed @Papaul - thanks! [06:14:15] PROBLEM - Host cp5005 is DOWN: PING CRITICAL - Packet loss = 100% [06:14:15] PROBLEM - Host lvs5003 is DOWN: PING CRITICAL - Packet loss = 100% [06:14:25] PROBLEM - Host cp5004 is DOWN: PING CRITICAL - Packet loss = 100% [06:14:25] PROBLEM - Host cp5009 is DOWN: PING CRITICAL - Packet loss = 100% [06:14:27] RECOVERY - Host lvs5003 is UP: PING OK - Packet loss = 0%, RTA = 196.68 ms [06:14:29] PROBLEM - Host cp5007 is DOWN: PING CRITICAL - Packet loss = 100% [06:14:31] :| [06:14:38] XioNoX: is that you I guess? [06:14:41] RECOVERY - Host cp5004 is UP: PING WARNING - Packet loss = 44%, RTA = 196.41 ms [06:14:55] RECOVERY - Host cp5005 is UP: PING OK - Packet loss = 0%, RTA = 203.97 ms [06:14:56] most likely yeah [06:15:10] eqsin is depooled, I was configuring VRRP [06:15:11] RECOVERY - Host cp5007 is UP: PING OK - Packet loss = 0%, RTA = 195.54 ms [06:15:11] RECOVERY - Host cp5009 is UP: PING WARNING - Packet loss = 66%, RTA = 197.89 ms [06:15:22] XioNoX: cool, thanks :) [06:15:22] yeah looks like 1 rack flapped or something [06:15:49] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga2001 is CRITICAL: cluster=cache_upload site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [06:16:27] PROBLEM - HTTP availability for Varnish at eqsin on icinga2001 is CRITICAL: job={varnish-text,varnish-upload} site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [06:18:11] (03PS1) 10Marostegui: db-eqiad.php: Depool db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491900 (https://phabricator.wikimedia.org/T210713) [06:19:21] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491900 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [06:20:29] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491900 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [06:21:19] RECOVERY - HTTP availability for Varnish at eqsin on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [06:21:44] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1123 T210713 (duration: 00m 57s) [06:21:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:48] T210713: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 [06:21:56] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491900 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [06:21:57] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [06:22:08] !log Deploy schema change on db1123 - T210713 [06:22:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:05] 10Operations, 10ops-eqiad, 10ops-eqsin, 10netops, 10Patch-For-Review: Deploy cr2-eqsin - https://phabricator.wikimedia.org/T213121 (10ayounsi) [06:29:38] (03PS1) 10Marostegui: db-eqiad.php: Not use db1103,5:3312 on main traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491901 (https://phabricator.wikimedia.org/T216656) [06:31:35] PROBLEM - puppet last run on dbproxy1010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/puppet-enabled] [06:31:41] 10Operations, 10CirrusSearch, 10serviceops, 10Discovery-Search (Current work), 10Patch-For-Review: Find an alternative to HHVM curl connection pooling for PHP 7 - https://phabricator.wikimedia.org/T210717 (10Joe) 05Open→03Resolved This task is resolved per-se, we still might need the mw-config patche... [06:31:51] 10Operations, 10Core Platform Team (PHP7 (TEC4)), 10Core Platform Team Kanban (Doing), 10HHVM, and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Joe) [06:42:32] 10Operations, 10serviceops, 10User-Joe: Set up A/B testing mechanism for PHP7, - https://phabricator.wikimedia.org/T216676 (10Joe) p:05Triage→03Normal [06:43:25] 10Operations, 10serviceops, 10User-Joe: Set up A/B testing mechanism for PHP7, - https://phabricator.wikimedia.org/T216676 (10Joe) [06:46:36] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1123" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491902 [06:50:55] 10Operations, 10ops-eqiad, 10ops-eqsin, 10netops, 10Patch-For-Review: Deploy cr2-eqsin - https://phabricator.wikimedia.org/T213121 (10ayounsi) [06:57:41] RECOVERY - puppet last run on dbproxy1010 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:58:03] (03PS1) 10Ayounsi: DNS: cr2-eqsin A, cr11->cr2 renames where needed [dns] - 10https://gerrit.wikimedia.org/r/491903 (https://phabricator.wikimedia.org/T213121) [06:58:29] (03CR) 10Ayounsi: [C: 03+2] DNS: cr2-eqsin A, cr11->cr2 renames where needed [dns] - 10https://gerrit.wikimedia.org/r/491903 (https://phabricator.wikimedia.org/T213121) (owner: 10Ayounsi) [06:59:53] (03PS2) 10Ayounsi: DNS: cr2-eqsin A, cr1->cr2 renames where needed [dns] - 10https://gerrit.wikimedia.org/r/491903 (https://phabricator.wikimedia.org/T213121) [07:00:36] (03CR) 10Ayounsi: [C: 03+2] DNS: cr2-eqsin A, cr1->cr2 renames where needed [dns] - 10https://gerrit.wikimedia.org/r/491903 (https://phabricator.wikimedia.org/T213121) (owner: 10Ayounsi) [07:02:09] (03CR) 10Ayounsi: [C: 03+2] Monitoring: add cr2-eqsin [puppet] - 10https://gerrit.wikimedia.org/r/490518 (https://phabricator.wikimedia.org/T213121) (owner: 10Ayounsi) [07:02:20] (03PS2) 10Ayounsi: Monitoring: add cr2-eqsin [puppet] - 10https://gerrit.wikimedia.org/r/490518 (https://phabricator.wikimedia.org/T213121) [07:06:46] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1123" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491902 (owner: 10Marostegui) [07:07:53] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1123" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491902 (owner: 10Marostegui) [07:08:06] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1123" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491902 (owner: 10Marostegui) [07:08:58] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1123 T210713 (duration: 00m 55s) [07:09:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:01] T210713: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 [07:09:27] (03PS1) 10Marostegui: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491906 (https://phabricator.wikimedia.org/T210713) [07:10:45] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491906 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [07:11:47] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491906 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [07:12:54] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1077 T210713 (duration: 00m 56s) [07:12:56] !log Deploy schema change on db1077 - this will generate lag on labsdb:s3 T210713 [07:12:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:13] PROBLEM - Juniper alarms on cr2-eqsin is CRITICAL: JNX_ALARMS CRITICAL - The requested table is empty or does not exist https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [07:16:21] (03PS2) 10Ammarpad: Add new namespaces for th.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491054 (https://phabricator.wikimedia.org/T216322) [07:18:20] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491906 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [07:24:49] 10Operations, 10ops-eqiad, 10ops-eqsin, 10netops, 10Patch-For-Review: Deploy cr2-eqsin - https://phabricator.wikimedia.org/T213121 (10ayounsi) [07:25:34] (03PS3) 10Ammarpad: Increase default thumb size to 260px on Dutch Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490395 (https://phabricator.wikimedia.org/T215106) [07:25:50] (03PS4) 10Ammarpad: Set wgArticleCountMethod='any' for zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487115 [07:26:04] (03PS11) 10Ammarpad: Add 'Author' namespace in Sanskrit Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486221 (https://phabricator.wikimedia.org/T214553) [07:26:45] (03PS3) 10Elukey: Add profile::analytics::refinery to notebook100[3,4] and stat1006 [puppet] - 10https://gerrit.wikimedia.org/r/491824 (https://phabricator.wikimedia.org/T212386) [07:27:51] (03CR) 10Elukey: [C: 03+2] Add profile::analytics::refinery to notebook100[3,4] and stat1006 [puppet] - 10https://gerrit.wikimedia.org/r/491824 (https://phabricator.wikimedia.org/T212386) (owner: 10Elukey) [07:36:55] PROBLEM - puppet last run on notebook1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/var/log/refinery] [07:37:19] puppet broken on notebook + stat1006 is me :) [07:37:23] PROBLEM - puppet last run on stat1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/var/log/refinery] [07:39:25] PROBLEM - puppet last run on notebook1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 second ago with 1 failures. Failed resources (up to 3 shown): File[/var/log/refinery] [07:39:25] (03PS1) 10Elukey: role::swap: add analytics-admins admin group [puppet] - 10https://gerrit.wikimedia.org/r/491908 (https://phabricator.wikimedia.org/T212386) [07:40:06] (03CR) 10Elukey: [C: 03+2] role::swap: add analytics-admins admin group [puppet] - 10https://gerrit.wikimedia.org/r/491908 (https://phabricator.wikimedia.org/T212386) (owner: 10Elukey) [07:44:08] !log rolling out remaining systemd security updates on jessie [07:44:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:19] RECOVERY - puppet last run on notebook1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:49:49] RECOVERY - puppet last run on notebook1004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:51:19] 10Operations, 10Discovery-Search, 10Elasticsearch: Enable nginx prometheus metrics for all elastic nodes - https://phabricator.wikimedia.org/T216681 (10Mathew.onipe) [07:51:35] 10Operations, 10Discovery-Search, 10Elasticsearch: Enable nginx prometheus metrics for all elastic nodes - https://phabricator.wikimedia.org/T216681 (10Mathew.onipe) p:05Triage→03Normal [07:53:36] 10Operations, 10Elasticsearch, 10Discovery-Search (Current work): Enable nginx prometheus metrics for all elastic nodes - https://phabricator.wikimedia.org/T216681 (10Mathew.onipe) [07:55:05] (03PS1) 10Elukey: profile::analytics::refinery: ensure log dir only for 'hdfs' [puppet] - 10https://gerrit.wikimedia.org/r/491909 (https://phabricator.wikimedia.org/T212386) [07:55:40] (03CR) 10jerkins-bot: [V: 04-1] profile::analytics::refinery: ensure log dir only for 'hdfs' [puppet] - 10https://gerrit.wikimedia.org/r/491909 (https://phabricator.wikimedia.org/T212386) (owner: 10Elukey) [07:57:07] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga2001 is CRITICAL: cluster=cache_upload site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:57:09] (03PS2) 10Elukey: profile::analytics::refinery: ensure log dir only for 'hdfs' [puppet] - 10https://gerrit.wikimedia.org/r/491909 (https://phabricator.wikimedia.org/T212386) [07:58:01] PROBLEM - HTTP availability for Varnish at eqsin on icinga2001 is CRITICAL: job=varnish-upload site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [07:58:13] (03CR) 10Elukey: [C: 03+2] "Andrew: didn't see any reason to deploy logdir + logrotate config in a non hadoop use case (like stat1006), but if you don't like the solu" [puppet] - 10https://gerrit.wikimedia.org/r/491909 (https://phabricator.wikimedia.org/T212386) (owner: 10Elukey) [08:01:41] RECOVERY - HTTP availability for Varnish at eqsin on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [08:02:01] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:02:46] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/491837 (https://phabricator.wikimedia.org/T216506) (owner: 10Andrew Bogott) [08:03:27] RECOVERY - puppet last run on stat1006 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [08:04:35] 10Operations, 10Core Platform Team (PHP7 (TEC4)), 10Core Platform Team Kanban (Doing), 10HHVM, and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10MaxSem) [08:14:07] (03PS3) 10Ammarpad: Add new namespaces for th.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491054 (https://phabricator.wikimedia.org/T216322) [08:30:34] (03CR) 10Marostegui: mariadb: Add the option of postprocessing backups (034 comments) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/491818 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [08:31:45] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491913 [08:33:55] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491913 (owner: 10Marostegui) [08:34:55] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491913 (owner: 10Marostegui) [08:35:58] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1077 T210713 (duration: 00m 53s) [08:36:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:02] T210713: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 [08:36:53] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491913 (owner: 10Marostegui) [08:37:27] (03PS1) 10Marostegui: db-eqiad.php: Depool db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491914 [08:38:28] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491914 (owner: 10Marostegui) [08:39:30] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491914 (owner: 10Marostegui) [08:40:36] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1075 T210713 (duration: 00m 54s) [08:40:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:46] !log Deploy schema change on db1075 - T210713 [08:40:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:48] (03PS5) 10GTirloni: quarry: Setup CSP http header [puppet] - 10https://gerrit.wikimedia.org/r/491377 (https://phabricator.wikimedia.org/T214637) (owner: 10Framawiki) [08:46:36] (03PS4) 10Ammarpad: Add new namespaces for th.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491054 (https://phabricator.wikimedia.org/T216322) [08:47:57] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491914 (owner: 10Marostegui) [08:49:23] (03CR) 10Gehel: elasticsearch: add script to execute systemctl on each elasticsearch instance (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/491850 (https://phabricator.wikimedia.org/T207920) (owner: 10Gehel) [08:50:41] (03PS6) 10Gehel: elasticsearch: add a file with the list on elasticsearch instances [puppet] - 10https://gerrit.wikimedia.org/r/491850 (https://phabricator.wikimedia.org/T207920) [08:50:57] (03PS7) 10Gehel: elasticsearch: add a file with the list on elasticsearch instances [puppet] - 10https://gerrit.wikimedia.org/r/491850 (https://phabricator.wikimedia.org/T207920) [08:54:15] (03PS5) 10Gehel: elasticsearch: systemctl iterates explicitly on elasticsearch instances [software/spicerack] - 10https://gerrit.wikimedia.org/r/491808 (https://phabricator.wikimedia.org/T207920) [08:54:35] (03CR) 10Gehel: [C: 03+2] elasticsearch: get_next_clusters_nodes raises ElasticsearchClusterError [software/spicerack] - 10https://gerrit.wikimedia.org/r/491803 (https://phabricator.wikimedia.org/T207920) (owner: 10Gehel) [08:54:43] (03CR) 10Ammarpad: "> Patch Set 1:" (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491054 (https://phabricator.wikimedia.org/T216322) (owner: 10Ammarpad) [08:55:32] (03CR) 10Ammarpad: "> > Patch Set 1:" (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491054 (https://phabricator.wikimedia.org/T216322) (owner: 10Ammarpad) [08:55:34] (03CR) 10jenkins-bot: elasticsearch: get_next_clusters_nodes raises ElasticsearchClusterError [software/spicerack] - 10https://gerrit.wikimedia.org/r/491803 (https://phabricator.wikimedia.org/T207920) (owner: 10Gehel) [08:59:24] gehel: didn't we go for the puppet-deployed bash script here? ^^^ [08:59:56] well, dcausse proposed using xarg, which makes a lot of sent to me [09:00:20] and with that, the script becomes a oneliner that makes just as much sense and is more explicit in spicerack [09:00:32] at least IMHO [09:00:40] but I can put that oneliner in a file [09:02:06] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Package citoid version 0.0.1 chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/491776 (owner: 10Alexandros Kosiaris) [09:02:34] gehel: fair enough, xargs at least DTRT as far as exit codes goes [09:03:39] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/491808 (https://phabricator.wikimedia.org/T207920) (owner: 10Gehel) [09:12:30] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1075" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491918 [09:14:59] !log temporarily stop prometheus@labs.service on labmon for journald restarts (part of security update) [09:15:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:38] (03CR) 10DCausse: cloudelastic: Add cloudelastic configs (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [09:21:31] 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Core Platform Team Backlog (Watching / External), and 2 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10akosiaris) >>! In T211881#4969064, @Jhernandez wrote: >>>! In T211881#4954470, @akosiaris wrote: >>>>! In... [09:22:16] (03CR) 10Filippo Giunchedi: Add setup.py and tox.ini (031 comment) [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/491768 (https://phabricator.wikimedia.org/T216253) (owner: 10Filippo Giunchedi) [09:23:17] PROBLEM - graphite-labs.wikimedia.org render on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:23:25] PROBLEM - graphite-labs.wikimedia.org api on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:23:53] that's expected I think ^ [09:25:46] (03CR) 10GTirloni: [C: 03+2] quarry: Setup CSP http header [puppet] - 10https://gerrit.wikimedia.org/r/491377 (https://phabricator.wikimedia.org/T214637) (owner: 10Framawiki) [09:26:25] PROBLEM - graphite-labs.wikimedia.org api on labmon1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:26:25] PROBLEM - graphite-labs.wikimedia.org render on labmon1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:26:51] 10Operations, 10Wikimedia-Logstash: Retire udp2log: onboard its producers and consumers to the logging pipeline - https://phabricator.wikimedia.org/T205856 (10fgiunchedi) >>! In T205856#4968542, @Ottomata wrote: >> re: the open question itself I'm leaning towards having json on kafka > > Yes please! > >> The... [09:27:20] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1075" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491918 (owner: 10Marostegui) [09:27:27] PROBLEM - puppet last run on cloudvirtan1002 is CRITICAL: CRITICAL: Puppet has 5 failures. Last run 4 minutes ago with 5 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Service[rsyslog],Exec[x509-bundle labvirt-star.eqiad.wmnet-chained],Exec[x509-bundle labvirt-star.eqiad.wmnet-chain] [09:27:37] (03CR) 10Alexandros Kosiaris: [C: 03+1] Add ganeti read-only user deployment [puppet] - 10https://gerrit.wikimedia.org/r/490397 (https://phabricator.wikimedia.org/T215229) (owner: 10CRusnov) [09:28:30] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1075" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491918 (owner: 10Marostegui) [09:29:32] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1075 T210713 (duration: 00m 52s) [09:29:32] !log Deploy schema change on s3 primary master (db1078) - T210713 [09:29:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:35] T210713: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 [09:29:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:54] (03PS1) 10Marostegui: filtered_tables.txt: Remove change_tag.ct_tag column [puppet] - 10https://gerrit.wikimedia.org/r/491920 (https://phabricator.wikimedia.org/T210713) [09:32:17] 10Operations, 10Analytics, 10Discovery, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10fgiunchedi) >>! In T213976#4968603, @Ottomata wrote: > Alright, I'm not familiar with Swift, but if we were to do t... [09:32:52] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1075" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491918 (owner: 10Marostegui) [09:33:30] (03CR) 10Mathew.onipe: [C: 03+1] "minor comment." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/491850 (https://phabricator.wikimedia.org/T207920) (owner: 10Gehel) [09:34:49] RECOVERY - graphite-labs.wikimedia.org api on labmon1002 is OK: HTTP OK: HTTP/1.1 200 OK - 337 bytes in 0.120 second response time [09:34:53] RECOVERY - graphite-labs.wikimedia.org render on labmon1002 is OK: HTTP OK: HTTP/1.1 200 OK - 1661 bytes in 0.160 second response time [09:35:19] 10Operations, 10ops-eqiad, 10DC-Ops: icinga1001 mysterious reboots - https://phabricator.wikimedia.org/T210108 (10Volans) [09:35:27] 10Operations, 10ops-eqiad, 10monitoring, 10Patch-For-Review: icinga1001 crashed - https://phabricator.wikimedia.org/T214760 (10Volans) 05Resolved→03Open icinga1001 is unresponsive this morning (no ping, no ssh, black console), re-opening [09:35:28] (03CR) 10Gehel: elasticsearch: add a file with the list on elasticsearch instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/491850 (https://phabricator.wikimedia.org/T207920) (owner: 10Gehel) [09:35:43] !log force rebooting unresponsive icinga1001 T214760 [09:35:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:46] T214760: icinga1001 crashed - https://phabricator.wikimedia.org/T214760 [09:36:47] RECOVERY - graphite-labs.wikimedia.org api on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 337 bytes in 0.121 second response time [09:36:47] RECOVERY - graphite-labs.wikimedia.org render on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1661 bytes in 0.153 second response time [09:37:27] 10Operations, 10Elasticsearch, 10Maps, 10Discovery-Search (Current work): Review Elastic/maps Grafana dashboards - https://phabricator.wikimedia.org/T209812 (10Mathew.onipe) There's need to add more metrics to upstream and also enable support for groups indices stats: https://github.com/justwatchcom/elasti... [09:38:04] 10Operations, 10DBA, 10Patch-For-Review, 10Performance-Team (Radar): Increase parsercache keys TTL from 22 days back to 30 days - https://phabricator.wikimedia.org/T210992 (10Marostegui) In a couple of days there it will be a month since I switched the TTL from 22 days to 24. There has not been any issues... [09:38:25] (03CR) 10DCausse: [C: 03+1] elasticsearch: add a file with the list on elasticsearch instances [puppet] - 10https://gerrit.wikimedia.org/r/491850 (https://phabricator.wikimedia.org/T207920) (owner: 10Gehel) [09:39:27] (03PS2) 10Alexandros Kosiaris: Fix iteration of secret values in scaffolding [deployment-charts] - 10https://gerrit.wikimedia.org/r/491821 [09:40:03] 10Operations, 10ops-eqsin, 10Traffic: Degraded RAID on cp5010 - https://phabricator.wikimedia.org/T214274 (10ayounsi) return shipment ticket 1-185737841426 opened with Equinix, DHL should pick up the defective disk in the next few days. [09:41:29] 10Operations, 10ops-eqiad, 10monitoring, 10Patch-For-Review: icinga1001 crashed - https://phabricator.wikimedia.org/T214760 (10Volans) Hardware logs; `lang=bash $ sudo ipmi-sel [... SNIP ...] 5 | Feb-20-2019 | 13:34:48 | CPU Machine Chk | Processor | transition to Non-recoverable ; OE... [09:42:33] (03CR) 10Muehlenhoff: [C: 03+1] Add setup.py and tox.ini (031 comment) [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/491768 (https://phabricator.wikimedia.org/T216253) (owner: 10Filippo Giunchedi) [09:42:59] (03CR) 10Filippo Giunchedi: [C: 03+2] debian: add dh-python/pybuild [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/491770 (owner: 10Filippo Giunchedi) [09:43:01] (03CR) 10Filippo Giunchedi: [C: 03+2] Add missing metrics help text, required for prometheus 2.0 [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/491772 (owner: 10Filippo Giunchedi) [09:43:19] (03CR) 10Filippo Giunchedi: [C: 03+2] Add setup.py and tox.ini [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/491768 (https://phabricator.wikimedia.org/T216253) (owner: 10Filippo Giunchedi) [09:43:25] (03CR) 10Filippo Giunchedi: [C: 03+2] Reformat with black + isort [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/491769 (owner: 10Filippo Giunchedi) [09:43:29] (03CR) 10Filippo Giunchedi: [C: 03+2] Add missing metrics help text [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/491771 (https://phabricator.wikimedia.org/T216253) (owner: 10Filippo Giunchedi) [09:43:46] 10Operations, 10DBA, 10Patch-For-Review, 10Performance-Team (Radar): Increase parsercache keys TTL from 22 days back to 30 days - https://phabricator.wikimedia.org/T210992 (10jcrespo) @Marostegui Did the hit rate increase? [09:44:33] 10Operations, 10DBA, 10Patch-For-Review, 10Performance-Team (Radar): Increase parsercache keys TTL from 22 days back to 30 days - https://phabricator.wikimedia.org/T210992 (10Marostegui) There is no significant increase that can be seen on the graphs, but also 2 days might be too low to notice something [09:45:20] (03PS2) 10Filippo Giunchedi: Add missing metrics help text, required for prometheus 2.0 [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/491772 [09:46:03] (03PS3) 10Alexandros Kosiaris: Fix iteration of secret values in scaffolding [deployment-charts] - 10https://gerrit.wikimedia.org/r/491821 [09:46:40] (03CR) 10Filippo Giunchedi: [C: 03+2] Add missing metrics help text, required for prometheus 2.0 [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/491772 (owner: 10Filippo Giunchedi) [09:46:50] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] Add missing metrics help text [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/491771 (https://phabricator.wikimedia.org/T216253) (owner: 10Filippo Giunchedi) [09:46:55] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] Reformat with black + isort [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/491769 (owner: 10Filippo Giunchedi) [09:47:05] (03PS8) 10Gehel: elasticsearch: add a file with the list on elasticsearch instances [puppet] - 10https://gerrit.wikimedia.org/r/491850 (https://phabricator.wikimedia.org/T207920) [09:47:09] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] Add setup.py and tox.ini [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/491768 (https://phabricator.wikimedia.org/T216253) (owner: 10Filippo Giunchedi) [09:47:16] 10Operations, 10ops-eqiad, 10ops-eqsin, 10netops, 10Patch-For-Review: Deploy cr2-eqsin - https://phabricator.wikimedia.org/T213121 (10ayounsi) Also audited all the spares we have onsite: https://docs.google.com/spreadsheets/d/1FKYVQJePjTQ7nVwYv4oDC6Gszk7RLrkq5ySN0fjvSoY/edit#gid=2057953856 Labelled the f... [09:48:28] (03CR) 10Gehel: [C: 03+2] elasticsearch: add a file with the list on elasticsearch instances [puppet] - 10https://gerrit.wikimedia.org/r/491850 (https://phabricator.wikimedia.org/T207920) (owner: 10Gehel) [09:53:35] RECOVERY - puppet last run on cloudvirtan1002 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [09:54:27] (03CR) 10Fsero: [C: 03+1] Fix iteration of secret values in scaffolding [deployment-charts] - 10https://gerrit.wikimedia.org/r/491821 (owner: 10Alexandros Kosiaris) [09:55:36] (03PS2) 10Gehel: Turn off proxy_intercept_errors for nginx [puppet] - 10https://gerrit.wikimedia.org/r/491870 (https://phabricator.wikimedia.org/T214032) (owner: 10Smalyshev) [09:56:48] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] "thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/491821 (owner: 10Alexandros Kosiaris) [09:58:27] 10Operations, 10ops-eqsin, 10Traffic: amber light on cp5006/5007 - https://phabricator.wikimedia.org/T216691 (10ayounsi) p:05Triage→03High [09:59:09] (03PS1) 10Filippo Giunchedi: Fix invalid metric name [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/491923 (https://phabricator.wikimedia.org/T216253) [10:02:05] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] Fix invalid metric name [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/491923 (https://phabricator.wikimedia.org/T216253) (owner: 10Filippo Giunchedi) [10:04:37] (03CR) 10Gehel: [C: 03+2] elasticsearch: systemctl iterates explicitly on elasticsearch instances [software/spicerack] - 10https://gerrit.wikimedia.org/r/491808 (https://phabricator.wikimedia.org/T207920) (owner: 10Gehel) [10:04:49] !log create cxserver namespace on kubernetes eqiad codfw staging clusters T213195 [10:04:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:52] T213195: Migrate cxserver to kubernetes - https://phabricator.wikimedia.org/T213195 [10:04:55] !log create citoid namespace on kubernetes eqiad codfw staging clusters T213194 [10:04:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:58] T213194: Migrate citoid to kubernetes - https://phabricator.wikimedia.org/T213194 [10:05:35] (03CR) 10jenkins-bot: elasticsearch: systemctl iterates explicitly on elasticsearch instances [software/spicerack] - 10https://gerrit.wikimedia.org/r/491808 (https://phabricator.wikimedia.org/T207920) (owner: 10Gehel) [10:06:39] (03PS1) 10Ayounsi: Icinga: add cr2-eqsin mgmt interface [puppet] - 10https://gerrit.wikimedia.org/r/491925 (https://phabricator.wikimedia.org/T213121) [10:10:22] (03CR) 10Ayounsi: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/14765/icinga2001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/491925 (https://phabricator.wikimedia.org/T213121) (owner: 10Ayounsi) [10:13:02] (03PS1) 10Mahveotm: [bugfix] disable crosswiki upload till a solution is found for the broken images [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491928 (https://phabricator.wikimedia.org/T214230) [10:18:17] (03CR) 10Ladsgroup: [C: 03+1] "There's master of s1 left but everywhere else it's dropped \o/" [puppet] - 10https://gerrit.wikimedia.org/r/491920 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [10:18:42] (03CR) 10Marostegui: [C: 03+2] filtered_tables.txt: Remove change_tag.ct_tag column [puppet] - 10https://gerrit.wikimedia.org/r/491920 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [10:18:50] (03PS2) 10Marostegui: filtered_tables.txt: Remove change_tag.ct_tag column [puppet] - 10https://gerrit.wikimedia.org/r/491920 (https://phabricator.wikimedia.org/T210713) [10:19:18] (03CR) 10Gehel: "I'm OK in principle with this change, but we should first ensure that we have proper error handling in blazegraph, so that we don't expose" [puppet] - 10https://gerrit.wikimedia.org/r/491870 (https://phabricator.wikimedia.org/T214032) (owner: 10Smalyshev) [10:23:40] !log on boron unblock trusty builds with umount /var/cache/pbuilder/base-trusty-amd64.cow/dev/ptmx [10:23:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:46] (03PS1) 10Gehel: elasticsearch: don't filter shards when checking for large shards [puppet] - 10https://gerrit.wikimedia.org/r/491930 [10:26:19] !log akosiaris@deploy1001 scap-helm list [namespace: list, clusters: eqiad,codfw] [10:26:19] !log akosiaris@deploy1001 scap-helm list cluster eqiad completed [10:26:19] !log akosiaris@deploy1001 scap-helm list cluster codfw completed [10:26:19] !log akosiaris@deploy1001 scap-helm list finished [10:26:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:24] (03CR) 10DCausse: [C: 03+1] elasticsearch: don't filter shards when checking for large shards [puppet] - 10https://gerrit.wikimedia.org/r/491930 (owner: 10Gehel) [10:26:32] (03CR) 10Gehel: [C: 03+2] elasticsearch: don't filter shards when checking for large shards [puppet] - 10https://gerrit.wikimedia.org/r/491930 (owner: 10Gehel) [10:26:33] sigh [10:27:56] !log akosiaris@deploy1001 scap-helm mathoid upgrade -f mathoid-values.yaml stable/mathoid [namespace: mathoid, clusters: staging] [10:27:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:05] !log akosiaris@deploy1001 scap-helm mathoid upgrade -f mathoid-values.yaml staging stable/mathoid [namespace: mathoid, clusters: staging] [10:28:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:08] (03PS14) 10Mathew.onipe: Add wdqs data transfer cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/488256 (https://phabricator.wikimedia.org/T213401) [10:28:10] (03CR) 10Mathew.onipe: Add wdqs data transfer cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/488256 (https://phabricator.wikimedia.org/T213401) (owner: 10Mathew.onipe) [10:28:10] !log akosiaris@deploy1001 scap-helm mathoid cluster staging completed [10:28:10] !log akosiaris@deploy1001 scap-helm mathoid finished [10:28:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:22] !log akosiaris@deploy1001 scap-helm mathoid upgrade -f mathoid-staging-values.yaml staging stable/mathoid [namespace: mathoid, clusters: staging] [10:29:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:27] !log akosiaris@deploy1001 scap-helm mathoid cluster staging completed [10:29:27] !log akosiaris@deploy1001 scap-helm mathoid finished [10:29:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:54] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=DELETE https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:33:35] (03PS1) 10Volans: setup.py: add long_description_content_type [software/spicerack] - 10https://gerrit.wikimedia.org/r/491931 [10:34:20] PROBLEM - kubelet operational latencies on kubestage1001 is CRITICAL: instance=kubestage1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [10:34:47] (03CR) 10Gehel: [C: 03+1] "LGTM, trivial enough" [software/spicerack] - 10https://gerrit.wikimedia.org/r/491931 (owner: 10Volans) [10:35:15] (03PS7) 10Mathew.onipe: cloudelastic: Add cloudelastic configs [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921) [10:35:32] RECOVERY - kubelet operational latencies on kubestage1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [10:36:04] (03CR) 10Mathew.onipe: cloudelastic: Add cloudelastic configs (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [10:37:52] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:38:14] (03CR) 10Volans: [C: 03+2] setup.py: add long_description_content_type [software/spicerack] - 10https://gerrit.wikimedia.org/r/491931 (owner: 10Volans) [10:39:06] (03CR) 10jenkins-bot: setup.py: add long_description_content_type [software/spicerack] - 10https://gerrit.wikimedia.org/r/491931 (owner: 10Volans) [10:42:52] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.0.19 [software/spicerack] - 10https://gerrit.wikimedia.org/r/491936 [10:47:02] !log akosiaris@deploy1001 scap-helm mathoid upgrade -f mathoid-values.yaml production stable/mathoid [namespace: mathoid, clusters: eqiad,codfw] [10:47:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:07] !log akosiaris@deploy1001 scap-helm mathoid cluster eqiad completed [10:47:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:14] !log akosiaris@deploy1001 scap-helm mathoid cluster codfw completed [10:47:14] !log akosiaris@deploy1001 scap-helm mathoid finished [10:47:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:59] (03CR) 10Marostegui: "Not sure why I have been added to review this - I don't have any context about it :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491928 (https://phabricator.wikimedia.org/T214230) (owner: 10Mahveotm) [10:53:14] (03CR) 10Gehel: [C: 03+1] "Actually, we're already leaking stack traces through late errors during SPARQL queries. So this does not significantly degrade the current" [puppet] - 10https://gerrit.wikimedia.org/r/491870 (https://phabricator.wikimedia.org/T214032) (owner: 10Smalyshev) [10:53:22] (03PS3) 10Gehel: Turn off proxy_intercept_errors for nginx [puppet] - 10https://gerrit.wikimedia.org/r/491870 (https://phabricator.wikimedia.org/T214032) (owner: 10Smalyshev) [10:54:07] (03CR) 10Gehel: [C: 03+2] Turn off proxy_intercept_errors for nginx [puppet] - 10https://gerrit.wikimedia.org/r/491870 (https://phabricator.wikimedia.org/T214032) (owner: 10Smalyshev) [10:55:29] !log upgrade mathoid staging+production to latest helm chart [10:55:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:43] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.0.19 [software/spicerack] - 10https://gerrit.wikimedia.org/r/491936 (owner: 10Volans) [11:02:34] (03CR) 10jenkins-bot: CHANGELOG: add changelogs for release v0.0.19 [software/spicerack] - 10https://gerrit.wikimedia.org/r/491936 (owner: 10Volans) [11:03:20] (03PS1) 10Volans: Upstream release v0.0.19 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/491940 [11:08:38] (03CR) 10Volans: [C: 03+2] Upstream release v0.0.19 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/491940 (owner: 10Volans) [11:11:56] !log uploaded spicerack_0.0.19-1_amd64.deb to apt.wikimedia.org stretch-wikimedia [11:11:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:29] !log upgraded spicerack to 0.0.19 on cumin[12]001 [11:13:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:31] cc gehel ^^^ [11:19:15] volans: thanks! Will break it after lunch! [11:19:38] eheheh [11:19:38] yw [11:21:02] (03PS3) 10Arturo Borrero Gonzalez: aptrepo: pull openstack mitaka packages into reprepro [puppet] - 10https://gerrit.wikimedia.org/r/491558 (https://phabricator.wikimedia.org/T216497) [11:21:35] (03CR) 10jerkins-bot: [V: 04-1] aptrepo: pull openstack mitaka packages into reprepro [puppet] - 10https://gerrit.wikimedia.org/r/491558 (https://phabricator.wikimedia.org/T216497) (owner: 10Arturo Borrero Gonzalez) [11:21:39] (03PS1) 10Elukey: profile::analytics::refinery::job::test::camus: fix topic whitelist [puppet] - 10https://gerrit.wikimedia.org/r/491944 (https://phabricator.wikimedia.org/T212259) [11:22:34] (03CR) 10Elukey: [C: 03+2] profile::analytics::refinery::job::test::camus: fix topic whitelist [puppet] - 10https://gerrit.wikimedia.org/r/491944 (https://phabricator.wikimedia.org/T212259) (owner: 10Elukey) [11:24:14] (03PS4) 10Arturo Borrero Gonzalez: aptrepo: pull openstack mitaka packages into reprepro [puppet] - 10https://gerrit.wikimedia.org/r/491558 (https://phabricator.wikimedia.org/T216497) [11:24:45] (03CR) 10jerkins-bot: [V: 04-1] aptrepo: pull openstack mitaka packages into reprepro [puppet] - 10https://gerrit.wikimedia.org/r/491558 (https://phabricator.wikimedia.org/T216497) (owner: 10Arturo Borrero Gonzalez) [11:25:24] 10Operations, 10ops-eqiad, 10monitoring, 10Patch-For-Review: icinga1001 crashed - https://phabricator.wikimedia.org/T214760 (10Volans) @RobH and it crashed again already! I'll leave it down in case @Cmjohnson wants to attach a physical console. Anyway, it's all yours, can be shutdown/reboot at will. [11:27:55] (03PS5) 10Arturo Borrero Gonzalez: aptrepo: pull openstack mitaka packages into reprepro [puppet] - 10https://gerrit.wikimedia.org/r/491558 (https://phabricator.wikimedia.org/T216497) [11:28:48] (03PS3) 10DCausse: [cirrus] Start using local nginx reverse proxy for connections reuse [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488895 (https://phabricator.wikimedia.org/T215491) [11:28:50] (03PS34) 10DCausse: [cirrus] Cleanup transitional states [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381) [11:29:24] PROBLEM - puppet last run on cloudvirtan1002 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 6 minutes ago with 6 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Service[rsyslog],Exec[x509-bundle labvirt-star.eqiad.wmnet-chained],Exec[x509-bundle labvirt-star.eqiad.wmnet-chain] [11:32:57] (03PS6) 10Arturo Borrero Gonzalez: aptrepo: pull openstack mitaka packages into reprepro [puppet] - 10https://gerrit.wikimedia.org/r/491558 (https://phabricator.wikimedia.org/T216497) [11:34:40] <_joe_> arturo: what does "cloudvirtan" stands for? [11:35:05] hypervisors that run analytics only workloads [11:35:54] <_joe_> oh, ok [11:35:58] <_joe_> thanks [11:39:19] (03PS7) 10Arturo Borrero Gonzalez: aptrepo: pull openstack mitaka packages into reprepro [puppet] - 10https://gerrit.wikimedia.org/r/491558 (https://phabricator.wikimedia.org/T216497) [11:39:42] _joe_: fyi I'm going to try to ship the mw-config patch to enable the local proxy you put in place for mw -> elastic connection reuse [11:40:22] <_joe_> dcausse: cool, let me know if you want me to review it [11:41:22] _joe_: I suppose it does not hurt if could give it another look (https://gerrit.wikimedia.org/r/488895) Erik gave me a +1 already [11:41:33] _joe_: FYI, weird case [11:41:45] https://www.irccloud.com/pastebin/HWZH2HCF/ [11:43:59] 10Operations, 10Analytics, 10Discovery, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10Ladsgroup) In general it would be great if the storage would be decoupled from the analytics cluster through an API... [11:54:36] (03PS1) 10Mathew.onipe: elasticsearch: enable prometheus collector for nginx [puppet] - 10https://gerrit.wikimedia.org/r/491948 (https://phabricator.wikimedia.org/T216681) [11:55:16] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: enable prometheus collector for nginx [puppet] - 10https://gerrit.wikimedia.org/r/491948 (https://phabricator.wikimedia.org/T216681) (owner: 10Mathew.onipe) [11:58:06] <_joe_> dcausse: I'll do in a few promised [11:58:15] thanks :) [12:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: That opportune time is upon us again. Time for a European Mid-day SWAT(Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190221T1200). [12:00:04] dcausse and Zoranzoki21: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:13] (03PS2) 10Mathew.onipe: elasticsearch: enable prometheus collector for nginx [puppet] - 10https://gerrit.wikimedia.org/r/491948 (https://phabricator.wikimedia.org/T216681) [12:00:21] Here \o/ [12:00:26] o/ [12:00:35] (03PS1) 10Elukey: profile::analytics::refinery::job::test::camus: fix checked topic [puppet] - 10https://gerrit.wikimedia.org/r/491949 (https://phabricator.wikimedia.org/T212259) [12:00:41] \o [12:00:59] !log disable puppet in install1002 to test T216497 [12:01:07] dcausse: go ahead while get ready [12:01:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:07] T216497: CloudVPS: workaround archival of jessie-backports repo - https://phabricator.wikimedia.org/T216497 [12:01:23] I can SWAT today, unless dcausse wants to take over the entire swat :) [12:01:29] zeljkof: I'm not ready yet :( [12:01:41] dcausse: ok, ping me when ready [12:02:04] (03CR) 10Elukey: [C: 03+2] profile::analytics::refinery::job::test::camus: fix checked topic [puppet] - 10https://gerrit.wikimedia.org/r/491949 (https://phabricator.wikimedia.org/T212259) (owner: 10Elukey) [12:02:10] zeljkof: While dcausse is not ready, can you do: Run namespaceDupes.php on urwiki (phab:T216667) [12:02:11] T216667: Run script to fix inconsistent titles for Urdu Wikipedia - https://phabricator.wikimedia.org/T216667 [12:03:38] Zoranzoki21: sure [12:03:42] <_joe_> dcausse: yes please hold, I would require a change to the patch :/ [12:03:52] _joe_: sure [12:04:35] 10Operations, 10ORES, 10Scoring-platform-team: [Epic] Deploy ORES in kubernetes cluster - https://phabricator.wikimedia.org/T182331 (10Ladsgroup) [12:04:37] zeljkof: ok, do it :) [12:04:50] (03PS1) 10Muehlenhoff: Record new MOU date for shiladsen [puppet] - 10https://gerrit.wikimedia.org/r/491950 [12:06:20] (03CR) 10Muehlenhoff: [C: 03+2] Record new MOU date for shiladsen [puppet] - 10https://gerrit.wikimedia.org/r/491950 (owner: 10Muehlenhoff) [12:06:40] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Our tests have shown that using the proxy and not curl pools gives a slight, but not insignificant performance penalty under HHVM. I would" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488895 (https://phabricator.wikimedia.org/T215491) (owner: 10DCausse) [12:07:34] _joe_: ok makes sense [12:08:06] I'll rework my patch chain as I planned to remove the HHVM support in the next one [12:08:07] <_joe_> heh sorry I wanted to talk to you earlier in the day, but I'm lagging behind as usual [12:08:21] no problem! [12:08:27] <_joe_> yeah let's not do that for now. [12:08:55] zeljkof: I'm removingmy patch, it's not ready yet [12:08:59] dcausse: ok [12:09:10] Zoranzoki21: script is done https://phabricator.wikimedia.org/T216667#4971629 [12:09:14] zeljkof: I will add one patch more at calendar [12:09:22] there are some unresloved problems [12:09:23] because dcausse removed patch [12:09:49] zeljkof: Script looks ok [12:10:11] !log T216497 import reprepro key 7638D0442B90D010 (debian archive automatic signing key (8/jessie) [12:10:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:19] T216497: CloudVPS: workaround archival of jessie-backports repo - https://phabricator.wikimedia.org/T216497 [12:10:49] (03PS4) 10Zoranzoki21: Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491826 (https://phabricator.wikimedia.org/T216642) [12:11:21] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491699 (https://phabricator.wikimedia.org/T216563) (owner: 10Zoranzoki21) [12:11:54] zeljkof: 491699 can be directly done [12:12:07] Zoranzoki21: ok [12:12:17] no testing at mwdebug? [12:12:30] (03Merged) 10jenkins-bot: Disable mobile main page special casing on huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491699 (https://phabricator.wikimedia.org/T216563) (owner: 10Zoranzoki21) [12:12:38] zeljkof: No [12:12:45] (03PS2) 10Zoranzoki21: Add img.raremaps.com at wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491823 (https://phabricator.wikimedia.org/T216638) [12:13:22] zeljkof: I added some new patches at calendar [12:13:25] !log gilles@deploy1001 Started deploy [3d2png/deploy@ca39432]: Updating repo [12:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:34] So we have 6 patches [12:13:55] !log gilles@deploy1001 Finished deploy [3d2png/deploy@ca39432]: Updating repo (duration: 00m 29s) [12:13:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:07] !log zfilipin@deploy1001 Synchronized dblists/mobilemainpagelegacy.dblist: SWAT: [[gerrit:491699|Disable mobile main page special casing on huwiki (T216563)]] (duration: 00m 54s) [12:14:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:11] T216563: Disable mobile main page special casing on huwiki - https://phabricator.wikimedia.org/T216563 [12:14:23] Zoranzoki21: 491699 deployed [12:14:40] zeljkof: Ok, all other patches no need mwdebug too [12:14:45] (03PS4) 10Zfilipin: IS.php: Add wgProofreadPagePageJoiner, set it per default on '-' and at zhwikisource on __PAGEJOIN__ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482502 (https://phabricator.wikimedia.org/T205826) (owner: 10Zoranzoki21) [12:17:06] PROBLEM - puppet last run on cloudvirtan1003 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 3 minutes ago with 4 failures. Failed resources (up to 3 shown): Service[rsyslog],Exec[x509-bundle labvirt-star.eqiad.wmnet-chained],Exec[x509-bundle labvirt-star.eqiad.wmnet-chain],Service[nagios-nrpe-server] [12:17:23] !log enable puppet in install1002 (done testing T216497) [12:17:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:28] T216497: CloudVPS: workaround archival of jessie-backports repo - https://phabricator.wikimedia.org/T216497 [12:17:46] (03PS8) 10Arturo Borrero Gonzalez: aptrepo: pull openstack mitaka packages into reprepro [puppet] - 10https://gerrit.wikimedia.org/r/491558 (https://phabricator.wikimedia.org/T216497) [12:18:55] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: pull openstack mitaka packages into reprepro [puppet] - 10https://gerrit.wikimedia.org/r/491558 (https://phabricator.wikimedia.org/T216497) (owner: 10Arturo Borrero Gonzalez) [12:18:57] (03PS5) 10Zoranzoki21: Add category at wgGettingStartedExcludedCategories for srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482534 [12:18:59] (03CR) 10jenkins-bot: Disable mobile main page special casing on huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491699 (https://phabricator.wikimedia.org/T216563) (owner: 10Zoranzoki21) [12:19:03] (03PS6) 10Zoranzoki21: Add categories for all Croatian projects at wmgBabelMainCategory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482548 [12:20:14] Zoranzoki21: I'm looking at 482502 but I'm not familiar with wgProofreadPagePageJoiner, I'm not sure if the patch would break stuff [12:20:30] zeljkof: It wont [12:20:30] 10Operations, 10Multimedia, 10Thumbor, 10serviceops, 10Performance-Team (Radar): Deploy 3d2png to thumbor servers (stretch) - https://phabricator.wikimedia.org/T216494 (10Gilles) I see that the repo is one commit behind master on deployment.eqiad.wmnet and when attempting to deploy: ` deploy-local faile... [12:20:38] (03PS1) 10Fsero: Initial debianization for envoyproxy [debs/envoyproxy] (wikimedia-stretch) - 10https://gerrit.wikimedia.org/r/491951 (https://phabricator.wikimedia.org/T215810) [12:20:58] Zoranzoki21: I don't want to deploy it until there's at least one +1 from somebody familiar with it [12:21:06] zeljkof: ok [12:21:22] skip it [12:21:27] Zoranzoki21: is there a task for 482534? [12:21:43] zeljkof: No [12:21:57] well, how come we're doing it then? :) [12:22:02] (03PS2) 10Fsero: Initial debianization for envoyproxy [debs/envoyproxy] (wikimedia-stretch) - 10https://gerrit.wikimedia.org/r/491951 (https://phabricator.wikimedia.org/T215810) [12:22:13] I mean, why are we doing the change? who requested it? [12:22:29] zeljkof: It is translating [12:22:51] Because it provides per default English translation [12:23:03] But in config no looks so [12:23:55] !log importing openstack mitaka packages to reprepro @ install1002 (T216497) [12:23:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:58] T216497: CloudVPS: workaround archival of jessie-backports repo - https://phabricator.wikimedia.org/T216497 [12:24:26] !log gilles@deploy1001 Started deploy [3d2png/deploy@ca39432]: l thumbor2002 [12:24:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:35] !log gilles@deploy1001 Finished deploy [3d2png/deploy@ca39432]: l thumbor2002 (duration: 00m 08s) [12:24:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:45] (03CR) 10Zfilipin: "I didn't feel comfortable deploying this during EU SWAT. Please get review from somebody familiar with the code." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482502 (https://phabricator.wikimedia.org/T205826) (owner: 10Zoranzoki21) [12:24:46] !log gilles@deploy1001 Started deploy [3d2png/deploy@ca39432]: l thumbor2002.codfw.wmnet [12:24:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:51] !log gilles@deploy1001 Finished deploy [3d2png/deploy@ca39432]: l thumbor2002.codfw.wmnet (duration: 00m 04s) [12:24:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:39] (03CR) 10Zfilipin: "I didn't feel comfortable deploying this during EU SWAT. Please create a task describing the problem." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482534 (owner: 10Zoranzoki21) [12:26:27] (03CR) 10Fsero: "this is the content of the package" [debs/envoyproxy] (wikimedia-stretch) - 10https://gerrit.wikimedia.org/r/491951 (https://phabricator.wikimedia.org/T215810) (owner: 10Fsero) [12:26:44] !log gilles@deploy1001 Started deploy [3d2png/deploy@ca39432]: (no justification provided) [12:26:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:58] (03CR) 10Zfilipin: "I didn't feel comfortable deploying this during EU SWAT. Please create a task describing the problem." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482548 (owner: 10Zoranzoki21) [12:26:58] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:27:22] !log gilles@deploy1001 Finished deploy [3d2png/deploy@ca39432]: (no justification provided) (duration: 00m 38s) [12:27:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:22] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:30:51] !log akosiaris@deploy1001 scap-helm mathoid upgrade --recreate-pods -f mathoid-values.yaml production stable/mathoid [namespace: mathoid, clusters: eqiad,codfw] [12:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:59] !log akosiaris@deploy1001 scap-helm mathoid cluster eqiad completed [12:31:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:09] !log akosiaris@deploy1001 scap-helm mathoid cluster codfw completed [12:31:09] !log akosiaris@deploy1001 scap-helm mathoid finished [12:31:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:34] (03PS3) 10Arturo Borrero Gonzalez: openstack: use new repository for mitaka packages [puppet] - 10https://gerrit.wikimedia.org/r/491736 (https://phabricator.wikimedia.org/T216497) [12:33:01] (03PS5) 10Zfilipin: Throttle rule for National Gallery of Canada Library and Archives edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491826 (https://phabricator.wikimedia.org/T216642) (owner: 10Zoranzoki21) [12:33:31] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491826 (https://phabricator.wikimedia.org/T216642) (owner: 10Zoranzoki21) [12:33:33] !log disable puppet in cloudnet2001-dev to test T216497 [12:33:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:36] T216497: CloudVPS: workaround archival of jessie-backports repo - https://phabricator.wikimedia.org/T216497 [12:34:31] (03Merged) 10jenkins-bot: Throttle rule for National Gallery of Canada Library and Archives edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491826 (https://phabricator.wikimedia.org/T216642) (owner: 10Zoranzoki21) [12:37:11] i would like to restart hhvm and upgrade apach on mwmaint1002.eqiad.wmnet, will proceed in 10 mins if no objections [12:38:51] !log zfilipin@deploy1001 Synchronized wmf-config/throttle.php: SWAT: [[gerrit:491826|Throttle rule for National Gallery of Canada Library and Archives edit-a-thon (T216642)]] (duration: 00m 53s) [12:38:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:55] T216642: Requesting temporary lift of IP cap for edit-a-thon - https://phabricator.wikimedia.org/T216642 [12:39:04] Zoranzoki21: 491826 deployed [12:39:06] ok [12:39:18] and 491823 now [12:39:30] It is all from me for this week :) [12:39:41] !log gilles@deploy1001 Started deploy [3d2png/deploy@ca39432]: (no justification provided) [12:39:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:01] !log gilles@deploy1001 Finished deploy [3d2png/deploy@ca39432]: (no justification provided) (duration: 00m 20s) [12:40:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:12] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491823 (https://phabricator.wikimedia.org/T216638) (owner: 10Zoranzoki21) [12:41:18] (03Merged) 10jenkins-bot: Add img.raremaps.com at wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491823 (https://phabricator.wikimedia.org/T216638) (owner: 10Zoranzoki21) [12:42:00] (03CR) 10jenkins-bot: Throttle rule for National Gallery of Canada Library and Archives edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491826 (https://phabricator.wikimedia.org/T216642) (owner: 10Zoranzoki21) [12:42:03] (03CR) 10jenkins-bot: Add img.raremaps.com at wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491823 (https://phabricator.wikimedia.org/T216638) (owner: 10Zoranzoki21) [12:42:45] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:491823|Add img.raremaps.com at wgCopyUploadsDomains (T216638)]] (duration: 00m 52s) [12:42:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:48] T216638: Add img.raremaps.com to the wgCopyUploadsDomains whitelist of Wikimedia Commons - https://phabricator.wikimedia.org/T216638 [12:43:05] Zoranzoki21: 491823 is deployed [12:43:11] Thanks! [12:43:12] RECOVERY - puppet last run on cloudvirtan1003 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [12:43:56] Zoranzoki21: please test the patches and thanks for deploying with #releng ;) [12:44:02] !log EU SWAT finished [12:44:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:07] * Zoranzoki21 waves [12:45:23] (03PS3) 10Mathew.onipe: elasticsearch: enable prometheus collector for nginx [puppet] - 10https://gerrit.wikimedia.org/r/491948 (https://phabricator.wikimedia.org/T216681) [12:50:03] !log restarting hhvm and updateing apache on mwmaint1002.eqiad.wmnet [12:50:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:54] RECOVERY - puppet last run on thumbor2002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:58:12] RECOVERY - puppet last run on cloudvirtan1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190221T1300) [13:03:27] i would like to restart hhvm and upgrade apach on deploy1001.eqiad.wmnet, will proceed in 10 mins if no objections [13:06:07] (03PS4) 10Arturo Borrero Gonzalez: openstack: use new repository for mitaka packages [puppet] - 10https://gerrit.wikimedia.org/r/491736 (https://phabricator.wikimedia.org/T216497) [13:09:42] 10Operations: Audit our puppet tree for uses of jessie-backports - https://phabricator.wikimedia.org/T216711 (10MoritzMuehlenhoff) [13:18:30] !log restarting rolling upgrade on elasticsearch / cirrus / codfw - T215931 [13:18:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:33] T215931: Upgrade elasticsearch to 5.6.14 - https://phabricator.wikimedia.org/T215931 [13:18:35] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.rolling-upgrade [13:18:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:53] * volans hides [13:19:22] !log restarting hhvm and updateing apache on deploy1001.eqiad.wmnet [13:19:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:24] * gehel will find volans anywhere he runs [13:19:28] lol [13:19:51] !log gehel@cumin2001 END (FAIL) - Cookbook sre.elasticsearch.rolling-upgrade (exit_code=99) [13:19:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:54] !log gilles@deploy1001 Started deploy [3d2png/deploy@ca39432]: (no justification provided) [13:19:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:01] sighs [13:20:10] !log gilles@deploy1001 Finished deploy [3d2png/deploy@ca39432]: (no justification provided) (duration: 00m 16s) [13:20:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:54] what stupid mistake did I do again [13:21:18] (03PS1) 10WMDE-Fisch: Show referencePreviews on group0 wikis as beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491959 (https://phabricator.wikimedia.org/T214905) [13:21:40] (03PS2) 10WMDE-Fisch: Show referencePreviews on group0 wikis as beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491959 (https://phabricator.wikimedia.org/T214905) [13:23:05] (03PS1) 10Gehel: elasticsearch: fix typo (xarg instead of xargs) [software/spicerack] - 10https://gerrit.wikimedia.org/r/491960 (https://phabricator.wikimedia.org/T207920) [13:25:52] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.rolling-upgrade [13:25:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:19] (03PS2) 10Jcrespo: mariadb: Add the option of postprocessing backups [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/491818 (https://phabricator.wikimedia.org/T210292) [13:34:42] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Add the option of postprocessing backups [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/491818 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [13:37:15] (03PS5) 10Arturo Borrero Gonzalez: openstack: use new repository for mitaka packages [puppet] - 10https://gerrit.wikimedia.org/r/491736 (https://phabricator.wikimedia.org/T216497) [13:44:01] 10Operations: Switch PHP 7.2 packages to an internal component - https://phabricator.wikimedia.org/T216712 (10MoritzMuehlenhoff) [13:44:49] 10Operations: Switch PHP 7.2 packages to an internal component - https://phabricator.wikimedia.org/T216712 (10MoritzMuehlenhoff) [13:45:11] 10Operations, 10Core Platform Team (PHP7 (TEC4)), 10Core Platform Team Kanban (Doing), 10HHVM, and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10MoritzMuehlenhoff) [13:46:59] (03PS6) 10Arturo Borrero Gonzalez: openstack: use new repository for mitaka packages [puppet] - 10https://gerrit.wikimedia.org/r/491736 (https://phabricator.wikimedia.org/T216497) [13:47:01] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] Update config and modules [software/rescue-pxe] - 10https://gerrit.wikimedia.org/r/490626 (owner: 10Filippo Giunchedi) [13:47:09] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] Use hwraid packages from WMF apt repo [software/rescue-pxe] - 10https://gerrit.wikimedia.org/r/490627 (owner: 10Filippo Giunchedi) [13:47:17] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] .gitignore: more debirf profile ignores [software/rescue-pxe] - 10https://gerrit.wikimedia.org/r/490628 (owner: 10Filippo Giunchedi) [13:47:25] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] Import 'packages' file into debian-amd64 [software/rescue-pxe] - 10https://gerrit.wikimedia.org/r/490629 (owner: 10Filippo Giunchedi) [13:47:32] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] Makefile: add conveniency targets [software/rescue-pxe] - 10https://gerrit.wikimedia.org/r/490630 (owner: 10Filippo Giunchedi) [13:48:10] 10Operations, 10Toolforge, 10cloud-services-team (Kanban): Switch PHP 7.2 packages to an internal component - https://phabricator.wikimedia.org/T216712 (10aborrero) [13:50:38] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga2001 is CRITICAL: 1.141e+05 gt 1e+05 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [13:51:14] (03PS7) 10Arturo Borrero Gonzalez: openstack: use new repository for mitaka packages [puppet] - 10https://gerrit.wikimedia.org/r/491736 (https://phabricator.wikimedia.org/T216497) [13:53:09] 10Operations, 10Release-Engineering-Team, 10Scap, 10Patch-For-Review: Upgrade scap debian package to 3.9.0-1 - https://phabricator.wikimedia.org/T216666 (10fgiunchedi) I've generally took care of scap uploads, though I'd appreciate if someone else can pick it up as well as I have other priorities ATM. The... [13:53:49] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "Catalog compilation as expected:" [puppet] - 10https://gerrit.wikimedia.org/r/491736 (https://phabricator.wikimedia.org/T216497) (owner: 10Arturo Borrero Gonzalez) [13:55:34] (03CR) 10Volans: [C: 03+2] elasticsearch: fix typo (xarg instead of xargs) [software/spicerack] - 10https://gerrit.wikimedia.org/r/491960 (https://phabricator.wikimedia.org/T207920) (owner: 10Gehel) [13:57:21] !log depool and reimage logstash1007 - T213898 [13:57:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:24] T213898: Replace and expand Elasticsearch storage in eqiad and upgrade the cluster from Debian jessie to stretch - https://phabricator.wikimedia.org/T213898 [14:00:04] Deploy window MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190221T1400) [14:04:45] (03PS1) 10BBlack: Revert "Depool eqsin for cr2-eqsin setup" [dns] - 10https://gerrit.wikimedia.org/r/491964 (https://phabricator.wikimedia.org/T213121) [14:07:51] 10Operations: gmail considers all Phabricator email to be spam due to missing SPF record - https://phabricator.wikimedia.org/T216714 (10LarsWirzenius) [14:12:15] (03PS43) 10Jbond: Create module for managing ulogd [puppet] - 10https://gerrit.wikimedia.org/r/486513 (https://phabricator.wikimedia.org/T116011) [14:14:16] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga2001 is OK: (C)1e+05 gt (W)1e+04 gt 9867 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [14:15:49] (03PS44) 10Jbond: Create module for managing ulogd [puppet] - 10https://gerrit.wikimedia.org/r/486513 (https://phabricator.wikimedia.org/T116011) [14:17:40] (03PS45) 10Jbond: Create module for managing ulogd [puppet] - 10https://gerrit.wikimedia.org/r/486513 (https://phabricator.wikimedia.org/T116011) [14:18:06] (03CR) 10jerkins-bot: [V: 04-1] Create module for managing ulogd [puppet] - 10https://gerrit.wikimedia.org/r/486513 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond) [14:19:44] (03PS46) 10Jbond: Create module for managing ulogd [puppet] - 10https://gerrit.wikimedia.org/r/486513 (https://phabricator.wikimedia.org/T116011) [14:20:10] (03CR) 10jerkins-bot: [V: 04-1] Create module for managing ulogd [puppet] - 10https://gerrit.wikimedia.org/r/486513 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond) [14:20:32] 10Operations: gmail considers all Phabricator email to be spam due to missing SPF record - https://phabricator.wikimedia.org/T216714 (10hashar) [14:21:08] 10Operations, 10ops-eqsin, 10Traffic: cp5007 correctable mem errors - https://phabricator.wikimedia.org/T216716 (10BBlack) [14:21:18] 10Operations, 10ops-eqsin, 10Traffic: cp5006 correctable mem errors - https://phabricator.wikimedia.org/T216717 (10BBlack) [14:22:05] 10Operations, 10ops-eqsin, 10Traffic: cp5006 correctable mem errors - https://phabricator.wikimedia.org/T216717 (10BBlack) [14:22:07] 10Operations, 10ops-eqsin, 10Traffic: cp5007 correctable mem errors - https://phabricator.wikimedia.org/T216716 (10BBlack) [14:22:09] 10Operations, 10ops-eqsin, 10Traffic: amber light on cp5006/5007 - https://phabricator.wikimedia.org/T216691 (10BBlack) [14:22:38] (03PS47) 10Jbond: Create module for managing ulogd [puppet] - 10https://gerrit.wikimedia.org/r/486513 (https://phabricator.wikimedia.org/T116011) [14:22:46] 10Operations, 10Mail: gmail considers all Phabricator email to be spam due to missing SPF record - https://phabricator.wikimedia.org/T216714 (10hashar) I think Keith @herron is the new postmaster! [14:23:07] (03CR) 10jerkins-bot: [V: 04-1] Create module for managing ulogd [puppet] - 10https://gerrit.wikimedia.org/r/486513 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond) [14:23:39] 10Operations, 10Mail: gmail considers all Phabricator email to be spam due to missing SPF record - https://phabricator.wikimedia.org/T216714 (10LarsWirzenius) Some more headers, as requested by Antoine. Received: from mx1001.wikimedia.org (mx1001.wikimedia.org. [208.80.154.76]) by mx.google.co... [14:25:45] (03PS48) 10Jbond: Create module for managing ulogd [puppet] - 10https://gerrit.wikimedia.org/r/486513 (https://phabricator.wikimedia.org/T116011) [14:26:12] (03CR) 10jerkins-bot: [V: 04-1] Create module for managing ulogd [puppet] - 10https://gerrit.wikimedia.org/r/486513 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond) [14:27:13] (03PS49) 10Jbond: Create module for managing ulogd [puppet] - 10https://gerrit.wikimedia.org/r/486513 (https://phabricator.wikimedia.org/T116011) [14:30:52] 10Operations, 10Mail: gmail considers all Phabricator email to be spam due to missing SPF record - https://phabricator.wikimedia.org/T216714 (10hashar) From DNS: ` $ dig +short TXT phabricator.wikimedia.org "v=spf1 mx ip4:10.64.32.150 ip6:2620:0:861:103:10:64:32:150 -all" ` The IP addresses in the SPF record... [14:33:15] (03Abandoned) 10Mathew.onipe: elasticsearch: enable prometheus collector for nginx [puppet] - 10https://gerrit.wikimedia.org/r/491948 (https://phabricator.wikimedia.org/T216681) (owner: 10Mathew.onipe) [14:33:43] 10Operations, 10Analytics, 10Discovery, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10Ottomata) > it would be great if the storage would be decoupled from the analytics cluster through an API Well, AP... [14:35:02] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:35:15] (03PS2) 10Gehel: elasticsearch: fix typo (xarg instead of xargs) [software/spicerack] - 10https://gerrit.wikimedia.org/r/491960 (https://phabricator.wikimedia.org/T207920) [14:35:17] (03PS1) 10Gehel: elasticsearch: expose force_allocation_of_all_unassigned_shards() [software/spicerack] - 10https://gerrit.wikimedia.org/r/491968 [14:35:45] (03PS1) 10Elukey: role::analytics_test_cluster::coordinator: ensure hive-site.xml in HDFS [puppet] - 10https://gerrit.wikimedia.org/r/491969 (https://phabricator.wikimedia.org/T212259) [14:36:42] !log bmansurov@deploy1001 Started deploy [recommendation-api/deploy@600e689]: Update to 0bb0a07626a74e0aea6dfbad669c31f76fc73365 [14:36:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:01] (03CR) 10Elukey: [C: 03+2] role::analytics_test_cluster::coordinator: ensure hive-site.xml in HDFS [puppet] - 10https://gerrit.wikimedia.org/r/491969 (https://phabricator.wikimedia.org/T212259) (owner: 10Elukey) [14:37:09] !log restart vhtcpd on cp5002 to debug multicast loss [14:37:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:17] (03PS1) 10Gehel: elasticsearch: add cookbook to force allocation of all shards [cookbooks] - 10https://gerrit.wikimedia.org/r/491970 [14:37:35] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1114 crashed (HW memory issues) - https://phabricator.wikimedia.org/T214720 (10Marostegui) [14:37:52] (03PS2) 10Gehel: elasticsearch: add cookbook to force allocation of all shards [cookbooks] - 10https://gerrit.wikimedia.org/r/491970 [14:38:46] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:38:57] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-herron: Replace and expand Elasticsearch storage in eqiad and upgrade the cluster from Debian jessie to stretch - https://phabricator.wikimedia.org/T213898 (10fgiunchedi) [14:39:28] 10Operations, 10Mail: gmail considers all Phabricator email to be spam due to missing SPF record - https://phabricator.wikimedia.org/T216714 (10MarcoAurelio) Could this be happening to some addresses and not others? I've been checking my most recent Phabricator emails and none of them arrived to spam, and all... [14:39:33] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: expose force_allocation_of_all_unassigned_shards() [software/spicerack] - 10https://gerrit.wikimedia.org/r/491968 (owner: 10Gehel) [14:40:51] (03Abandoned) 10Gehel: elasticsearch: expose force_allocation_of_all_unassigned_shards() [software/spicerack] - 10https://gerrit.wikimedia.org/r/491968 (owner: 10Gehel) [14:41:27] (03PS3) 10Gehel: elasticsearch: add cookbook to force allocation of all shards [cookbooks] - 10https://gerrit.wikimedia.org/r/491970 [14:41:42] !log bmansurov@deploy1001 Finished deploy [recommendation-api/deploy@600e689]: Update to 0bb0a07626a74e0aea6dfbad669c31f76fc73365 (duration: 04m 59s) [14:41:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:40] (03PS1) 10Mathew.onipe: tlsproxy: add prometheus exporter option [puppet] - 10https://gerrit.wikimedia.org/r/491972 (https://phabricator.wikimedia.org/T216681) [14:44:54] (03PS1) 10Elukey: Move ensure hive-site.xml from (test) hadoo coord to ui [puppet] - 10https://gerrit.wikimedia.org/r/491973 (https://phabricator.wikimedia.org/T212259) [14:46:00] (03CR) 10Elukey: [C: 03+2] Move ensure hive-site.xml from (test) hadoo coord to ui [puppet] - 10https://gerrit.wikimedia.org/r/491973 (https://phabricator.wikimedia.org/T212259) (owner: 10Elukey) [14:46:11] (03CR) 10Ottomata: [C: 03+1] "Hm, not sure either. It isn't really 'hadoop' that might require a refinery log dir, but weather or not we schedule refinery jobs to run " (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/491909 (https://phabricator.wikimedia.org/T212386) (owner: 10Elukey) [14:46:30] (03CR) 10Mathew.onipe: [C: 03+1] "Nice!" [cookbooks] - 10https://gerrit.wikimedia.org/r/491970 (owner: 10Gehel) [14:46:45] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, three small remarks related to the comments/docs, code-wise good to merge!" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/486513 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond) [14:48:21] (03CR) 10Gehel: [C: 03+2] elasticsearch: add cookbook to force allocation of all shards [cookbooks] - 10https://gerrit.wikimedia.org/r/491970 (owner: 10Gehel) [14:50:33] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.force-shard-allocation [14:50:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:37] !log gehel@cumin2001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [14:50:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:31] (03PS1) 10Sbisson: Welcome survey: add a control group to viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491975 (https://phabricator.wikimedia.org/T216669) [14:54:10] this looks really cool as a way to do technical interviews http://www.opterview.com/ [14:58:59] (03PS3) 10Andrew Bogott: imagemagick: Resolve version conflicts between toolforge and prod [puppet] - 10https://gerrit.wikimedia.org/r/491837 (https://phabricator.wikimedia.org/T216506) [14:59:20] (03Abandoned) 10Muehlenhoff: Absent libpcre3-dbg from hhvm::debug [puppet] - 10https://gerrit.wikimedia.org/r/491462 (https://phabricator.wikimedia.org/T176370) (owner: 10Muehlenhoff) [15:00:16] (03CR) 10Andrew Bogott: [C: 03+2] imagemagick: Resolve version conflicts between toolforge and prod [puppet] - 10https://gerrit.wikimedia.org/r/491837 (https://phabricator.wikimedia.org/T216506) (owner: 10Andrew Bogott) [15:05:50] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one comment, but good to go" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/490836 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond) [15:07:46] !log migrating ES shards away from logstash100[456] with "cluster.routing.allocation.exclude._name" : "logstash1004-production-logstash-eqiad,logstash1005-production-logstash-eqiad,logstash1006-production-logstash-eqiad” T214608 [15:07:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:52] T214608: rack/setup/install logstash101[012].eqiad.wmnet - https://phabricator.wikimedia.org/T214608 [15:12:03] (03PS2) 10Mathew.onipe: tlsproxy: add prometheus exporter option [puppet] - 10https://gerrit.wikimedia.org/r/491972 (https://phabricator.wikimedia.org/T216681) [15:15:13] (03PS5) 10Ammarpad: Set wgArticleCountMethod='any' for zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487115 [15:15:42] (03PS4) 10Ammarpad: Increase default thumb size to 260px on Dutch Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490395 (https://phabricator.wikimedia.org/T215106) [15:15:59] (03PS12) 10Ammarpad: Add 'Author' namespace in Sanskrit Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486221 (https://phabricator.wikimedia.org/T214553) [15:16:42] 10Operations, 10Proton, 10Core Platform Team Backlog (Watching / External), 10Reading-Infrastructure-Team-Backlog (Kanban), 10Services (watching): Proton fails with Chromium 72.0.3626.96 - https://phabricator.wikimedia.org/T216493 (10MoritzMuehlenhoff) >>! In T216493#4969555, @MSantos wrote: > I think we... [15:18:04] (03PS1) 10Herron: phabricator: udpate spf record with phab[12]001 current ipv6 addrs [dns] - 10https://gerrit.wikimedia.org/r/491984 (https://phabricator.wikimedia.org/T216714) [15:22:59] (03CR) 10Alexandros Kosiaris: [C: 04-1] phabricator: udpate spf record with phab[12]001 current ipv6 addrs (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/491984 (https://phabricator.wikimedia.org/T216714) (owner: 10Herron) [15:23:17] !log installing krb5 updates for jessie [15:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:15] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2050 - https://phabricator.wikimedia.org/T216670 (10Papaul) a:05Papaul→03Marostegui disk replaced [15:27:56] 10Operations, 10monitoring, 10Patch-For-Review, 10User-CDanis, 10User-fgiunchedi: Better organization for SRE grafana dashboards - https://phabricator.wikimedia.org/T178690 (10akosiaris) @fgiunchedi I had a stab at a RED pattern dashboard for mathoid. Let me know what you think. https://grafana.wikimed... [15:28:07] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2050 - https://phabricator.wikimedia.org/T216670 (10Marostegui) Thanks! ` logicaldrive 1 (3.3 TB, RAID 1+0, Recovering, 2% complete) physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, Rebuilding) ` [15:28:54] (03PS1) 10Muehlenhoff: Add library hint for krb5 [puppet] - 10https://gerrit.wikimedia.org/r/491986 [15:29:00] (03CR) 10Herron: phabricator: udpate spf record with phab[12]001 current ipv6 addrs (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/491984 (https://phabricator.wikimedia.org/T216714) (owner: 10Herron) [15:30:07] please stop using the word 'spf', I've already been pinged too many times :( [15:32:45] SPF|Cloud: not sure if that's a joke or not, but umm your own fault for naming yourself after a common internet standard [15:32:56] :P [15:34:45] bawolff: I'm well aware of the importance of SPF records, just wished they weren't called 'SPF' ;-) [15:35:59] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for krb5 [puppet] - 10https://gerrit.wikimedia.org/r/491986 (owner: 10Muehlenhoff) [15:47:16] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga2001 is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [15:48:38] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 122, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:49:19] XioNoX: ? [15:50:43] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Move cloudvirt1024 to 10Gb ethernet - https://phabricator.wikimedia.org/T216724 (10Andrew) a:03RobH [15:52:08] 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Core Platform Team Backlog (Watching / External), and 2 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10Milimetric) >>! In T211881#4954092, @Jhernandez wrote: > I personally fail to see how this would be refac... [15:52:09] !log gehel@cumin2001 END (ERROR) - Cookbook sre.elasticsearch.rolling-upgrade (exit_code=97) [15:52:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:16] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [15:53:24] (03PS1) 10Alexandros Kosiaris: Followup for a2b0fdbfc60a5d1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/491989 [15:56:04] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 124, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:57:19] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.rolling-upgrade [15:57:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:04] PROBLEM - Host mr1-esams.oob is DOWN: CRITICAL - Time to live exceeded (164.138.24.90) [15:58:12] PROBLEM - Host mr1-codfw.oob is DOWN: CRITICAL - Time to live exceeded (216.117.46.36) [15:58:36] PROBLEM - puppet last run on cloudvirtan1002 is CRITICAL: CRITICAL: Puppet has 7 failures. Last run 5 minutes ago with 7 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Service[rsyslog],Exec[x509-bundle labvirt-star.eqiad.wmnet-chained],Exec[x509-bundle labvirt-star.eqiad.wmnet-chain] [15:58:38] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): Move various support services for Cloud VPS currently in prod into their own instances - https://phabricator.wikimedia.org/T207536 (10GTirloni) [16:00:48] mhhh the TTL exceeded is curious, XioNoX known? mr1-esams.oob and mr1-codfw.oob in icinga [16:01:23] usually TTL exceed means a routing loop somewhere [16:01:52] he's probably already offline tho [16:01:57] (03PS2) 10Herron: phabricator: udpate spf record with phab[12]001 current addrs [dns] - 10https://gerrit.wikimedia.org/r/491984 (https://phabricator.wikimedia.org/T216714) [16:02:01] no maintenance afaics [16:02:42] nope [16:03:08] I can reach them from the outside world, so something here [16:03:22] RECOVERY - Host mr1-esams.oob is UP: PING OK - Packet loss = 0%, RTA = 116.48 ms [16:03:30] RECOVERY - Host mr1-codfw.oob is UP: PING OK - Packet loss = 0%, RTA = 2.74 ms [16:03:35] something temporary, I guess! [16:03:48] indeed, I was gonna say "from icinga2001 I can reach them too" [16:03:49] I didn't get a look from icinga1001 pov before it recovered :/ [16:03:51] (03PS1) 10Gehel: elasticsearch: provide a bretter datetime format example [cookbooks] - 10https://gerrit.wikimedia.org/r/491991 [16:04:04] isn't icinga1001 still down? [16:04:15] oh right, 2001 then :) [16:04:50] dumb q, what's 'oob' in this context? [16:04:55] out-of-band [16:05:13] separate management network? [16:05:13] those are IPs we can reach into the mgmt routers at each site, without using wmf networks to get there, in an emergency [16:05:17] got it [16:05:35] (03CR) 10Volans: elasticsearch: provide a bretter datetime format example (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/491991 (owner: 10Gehel) [16:09:54] (03CR) 10Anomie: [C: 03+1] "While probably a good idea in general, I note this won't help with the specific case in T216656." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491901 (https://phabricator.wikimedia.org/T216656) (owner: 10Marostegui) [16:11:01] (03CR) 10Marostegui: "> While probably a good idea in general, I note this won't help with" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491901 (https://phabricator.wikimedia.org/T216656) (owner: 10Marostegui) [16:11:51] (03PS2) 10Gehel: elasticsearch: provide a better datetime format example [cookbooks] - 10https://gerrit.wikimedia.org/r/491991 [16:11:58] 10Operations, 10Mail, 10Patch-For-Review: gmail considers all Phabricator email to be spam due to missing SPF record - https://phabricator.wikimedia.org/T216714 (10herron) >>! In T216714#4971931, @hashar wrote: > Side question, wouldn't it be sufficient to just whitelist the mx1001 relay instead of each indi... [16:12:08] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 122, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:12:51] (03CR) 10Gehel: elasticsearch: provide a better datetime format example (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/491991 (owner: 10Gehel) [16:13:00] (03CR) 10Herron: [C: 03+2] phabricator: udpate spf record with phab[12]001 current addrs [dns] - 10https://gerrit.wikimedia.org/r/491984 (https://phabricator.wikimedia.org/T216714) (owner: 10Herron) [16:13:24] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 124, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:14:00] (03PS1) 10Herron: Revert "phabricator: udpate spf record with phab[12]001 current addrs" [dns] - 10https://gerrit.wikimedia.org/r/491995 [16:14:09] 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Core Platform Team Backlog (Watching / External), and 2 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10Bawolff) > The graphoid service fetches the graph from the mediawiki API using essentially mediawiki page... [16:14:41] (03CR) 10Herron: [C: 03+2] Revert "phabricator: udpate spf record with phab[12]001 current addrs" [dns] - 10https://gerrit.wikimedia.org/r/491995 (owner: 10Herron) [16:15:09] (03PS1) 10Herron: phabricator: udpate spf record with phab[12]001 current addrs [dns] - 10https://gerrit.wikimedia.org/r/491996 (https://phabricator.wikimedia.org/T216714) [16:15:36] (03PS3) 10Gehel: elasticsearch: provide a better datetime format example [cookbooks] - 10https://gerrit.wikimedia.org/r/491991 [16:17:08] !log updating scap3 to 3.9.0-1 [16:17:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:18] (03PS2) 10Herron: phabricator: udpate spf record with phab[12]001 current addrs [dns] - 10https://gerrit.wikimedia.org/r/491996 (https://phabricator.wikimedia.org/T216714) [16:17:35] !log uploading scap3 3.9.0.1 package to trusty, jessie and stretch [16:17:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:56] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga2001 is CRITICAL: 1.028e+05 gt 1e+05 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [16:18:20] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.force-shard-allocation [16:18:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:26] (03CR) 10Herron: [C: 03+2] "reverted this before deploying via authdns-update. syntax is 'ip4' not 'ipv4'" [dns] - 10https://gerrit.wikimedia.org/r/491984 (https://phabricator.wikimedia.org/T216714) (owner: 10Herron) [16:19:34] (03CR) 10Herron: [C: 03+2] phabricator: udpate spf record with phab[12]001 current addrs [dns] - 10https://gerrit.wikimedia.org/r/491996 (https://phabricator.wikimedia.org/T216714) (owner: 10Herron) [16:20:20] !log gehel@cumin2001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [16:20:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:59] !log uploading scap3 3.9.0.1 package to trusty, jessie and stretch T216666 [16:22:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:02] T216666: Upgrade scap debian package to 3.9.0-1 - https://phabricator.wikimedia.org/T216666 [16:22:25] 10Operations, 10Release-Engineering-Team, 10Scap, 10Patch-For-Review: Upgrade scap debian package to 3.9.0-1 - https://phabricator.wikimedia.org/T216666 (10fsero) done now :) [16:23:53] !log updated phabricator.wikimedia.org spf record T216714 [16:23:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:57] T216714: gmail considers all Phabricator email to be spam due to missing SPF record - https://phabricator.wikimedia.org/T216714 [16:24:48] RECOVERY - puppet last run on cloudvirtan1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:26:01] mmm so mirror maker is lagging for cirrusSearchElasticaWrite events [16:26:13] dcausse --^ [16:26:17] known? [16:26:30] we have seen yesterday some issues with eventbus and large events [16:26:56] topic mediawiki.job.cirrusSearchElasticaWrite. MessageSizeTooLargeError: MESSAGE_SIZE_TOO_LARGE [16:27:05] this happened recently [16:27:09] Cc: gehel [16:27:22] not sure if already known or not [16:27:46] elukey: for today, that could be the current cluster restart that has an influence [16:27:54] do you have a timing for those events? [16:29:56] it's almost ceratinly the cluster restart, or more specifically backing up writes into the job queue while performing the restart [16:30:38] what's the limit 4mb of compressed data? [16:30:45] yeah, and due to a weird bug in tqdm / colorama, we're actually stopping writes for longer than expected [16:30:59] 10Operations, 10Traffic: Content purges are unreliable - https://phabricator.wikimedia.org/T133821 (10Bawolff) [16:31:13] but I would expect this additional queuing to produce more messages, not larger ones [16:31:20] ebernhardson: what am I missing? [16:31:37] gehel: we don't generally create those jobs, they run in process. They only get queue'd on failure [16:32:17] gehel: so basically, the large wiki page updates generally go straight to elastic and never into the kafka job queue to trigger this issue. [16:32:43] so the message size too large is seen because we're queuing, but we still need to address what we do with too large messages when we do need to queue? [16:33:14] gehel: right, basically the switch from redis to kafka basde job queue changed the underlying assumptions about being able to stuff a few hundred MB/min into a queue to pull back out later [16:33:40] not only the throughput, but the individual message size [16:33:47] (i'm guessing on rate, but 100MB/min seems plausible guess) [16:33:57] gehel: for mirror maker the timing is clear with the consumer lag https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw%20prometheus%2Fops&var-lag_datasource=eqiad%20prometheus%2Fops&var-mirror_name=main-eqiad_to_main-codfw&refresh=5m&panelId=5&fullscreen&orgId=1&from=now-3h&to=now [16:34:24] the MESSAGE_SIZE_etc.. it is related to the 4mb yes dcausse [16:34:53] i wonder how it gets all the way to 4MB though, chosing a random large page (world war II on enwiki) the doc is only ~200kB [16:35:23] elukey: is it 4mb compressed or uncompressed? [16:35:30] elukey: yep, the lag is matching the cluster upgrade [16:36:06] volans: any chance we could merge https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/481858 as a quick fix? [16:36:17] dcausse: mmm it should be kafka message payload, that in theory should come either compressed or not to kafka (IIRC), so the limit should apply to what gets to kafka [16:36:30] and I currently don't recall if eventbus compresses messages [16:37:03] gehel: reading backlog [16:37:29] volans: short version, we're overloading kafka / mirrormaker with the elasticsearch clsuter restart [16:37:36] PROBLEM - Request latencies on acrux is CRITICAL: instance=10.192.0.93:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:38:12] and that tqdm / colorama bug is slowing down the process enough that instead of freezing writes for 5' max, we're freezing them fo 10-20' [16:38:23] !log gehel@cumin2001 END (ERROR) - Cookbook sre.elasticsearch.rolling-upgrade (exit_code=97) [16:38:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:31] why does a elastic clsuter restart result in larger messages? [16:38:33] gehel: ack, I cannot guarantee that woks, can I apply it directly on 2001? [16:38:44] volans: sure! [16:38:50] gehel: so for eventbus, it is already recovered, it was a brief set of 400s [16:38:52] RECOVERY - Request latencies on acrux is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:39:00] and for mirror maker, is basically a consumer lag issue [16:39:03] not a big deal [16:39:06] ottomata: the messages are not normally generated. Under normal operating conditions we run the job in process, it succeedes, and kafka is not involved [16:39:07] ottomata: it's a special queue that is being used to store lost updates [16:39:15] volans: you have some time, the cluster needs some time to go back to green before the next batch anyway [16:39:29] ottomata: when doing a cluster restart we back up all writes that are rejecting into the job queue to be retried [16:39:41] ah ok, and some of those are too large [16:40:14] we had a ticket open somewhere to freeze writes by stoping the consumers instead of re-enqueuing the messages [16:40:19] ottomata: do we compress events in eventbus when sending them to kafka? [16:40:23] which would be much more efficient in the context [16:40:33] elukey: i think the client uses snappy compression, let me check [16:40:34] gehel: go ahead [16:40:34] ^ this would solve all the problems [16:40:38] the short answer is basically with the job queue swapped out we have to re-engineer the process [16:40:44] the underlying assumptions changed [16:40:48] yupo [16:40:49] compression_type=snappy [16:40:55] dcausse: --^ [16:41:03] compressed ) [16:41:04] :) [16:41:05] !log applied hot band-aid patch to spicerack/remote.py on cumin2001 ( https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/481858 ) [16:41:06] ok :) [16:41:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:25] what gehel said about pausing consumer is the solution imo [16:41:48] the cookbook should be able to control changeprop [16:42:05] sounds reasonable [16:42:13] I can't find that ticket :/ [16:42:30] ottomata: I also think that the eventbus alarm should end up in here rather than only analytics, what do you think? (for visibility) [16:43:12] we still have instant updates for wikibase tho :/ [16:43:29] elukey: +1 [16:43:36] i think there is a todo comment on that alert to make it so [16:43:36] 10Operations, 10MediaWiki-Cache, 10serviceops, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 5 others: Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10EvanProdromou) Thanks @jijiki and @Joe Assuming thos... [16:43:47] (03PS2) 10Cwhite: admin: add Runa Bhattacharjee to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/489776 (https://phabricator.wikimedia.org/T215576) [16:45:42] 10Operations, 10Discovery, 10Discovery-Search, 10monitoring, 10Wikimedia-Incident: Alert when ES indexes are freezed for more than 30 minutes - https://phabricator.wikimedia.org/T110171 (10Gehel) Already implemented in https://gerrit.wikimedia.org/r/c/operations/puppet/+/431754 as part of T193605 [16:45:44] 10Operations, 10Discovery, 10Discovery-Search, 10Elasticsearch, 10Epic: EPIC: Cultivating the Elasticsearch garden (operational lessons from 1.7.1 upgrade) - https://phabricator.wikimedia.org/T109089 (10Gehel) [16:45:47] 10Operations, 10Discovery, 10Discovery-Search, 10monitoring, 10Wikimedia-Incident: Alert when ES indexes are freezed for more than 30 minutes - https://phabricator.wikimedia.org/T110171 (10Gehel) 05Open→03Resolved a:03Gehel [16:45:57] ottomata: ack, will look into it asap, seems really important [16:46:03] gehel: wait a second, I'm testing something, not sure it works as expected [16:46:22] (03PS2) 10Ottomata: Add EventBus multi endpoint configuration and add eventgate-analytics endpoint [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490418 (https://phabricator.wikimedia.org/T211247) [16:46:31] * gehel is wating [16:46:47] hah, i finally found a 4MB wiki page: https://zh.wikisource.org/wiki/%E9%80%9A%E5%BF%97_(%E5%9B%9B%E5%BA%AB%E5%85%A8%E6%9B%B8%E6%9C%AC)/%E5%85%A8%E8%A6%BD4?action=cirrusdump [16:47:09] (03CR) 10jerkins-bot: [V: 04-1] Add EventBus multi endpoint configuration and add eventgate-analytics endpoint [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490418 (https://phabricator.wikimedia.org/T211247) (owner: 10Ottomata) [16:47:11] i dont know what we can do about it though :P they are just big and json wants to encode utf-8 as verbosely as possible [16:47:29] unicode encodings are unfair :) [16:47:55] gehel: green light, go ahead [16:47:59] progress bars are still printed [16:48:03] but not output [16:48:07] it should help [16:48:31] volans: still waiting on cluster recovery as well [16:49:06] for context, I was stupid and testing with default dry_run=True and I was puzzled [16:49:13] all works as expected [16:49:18] * gehel is taking a break (cooking + family dinner) [16:49:27] will restart the cookbook when done [16:49:38] (03PS3) 10Ottomata: Add EventBus multi endpoint configuration and add eventgate-analytics endpoint [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490418 (https://phabricator.wikimedia.org/T211247) [16:49:39] gehel: also which DC are you restarting? [16:50:04] !log eqsin: restarting all varnish backends to wipe cache after purge loss (site currently depooled) [16:50:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:13] volans: codfw [16:50:18] ok [16:50:20] (03CR) 10CDanis: [C: 03+1] admin: add Runa Bhattacharjee to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/489776 (https://phabricator.wikimedia.org/T215576) (owner: 10Cwhite) [16:50:38] (03CR) 10jerkins-bot: [V: 04-1] Add EventBus multi endpoint configuration and add eventgate-analytics endpoint [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490418 (https://phabricator.wikimedia.org/T211247) (owner: 10Ottomata) [16:51:13] * onimisionipe thinks we will needing a lot of docs on elastic search cluster operations [16:51:26] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga2001 is OK: (C)1e+05 gt (W)1e+04 gt 11 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [16:51:29] 10Operations, 10monitoring, 10Goal, 10Patch-For-Review: Upgrade production prometheus-node-exporter to >= 0.16 - https://phabricator.wikimedia.org/T213708 (10colewhite) On further investigation, the log messages appear to be from the shebang of the [[ https://github.com/prometheus/node_exporter/blob/v0.17.... [16:52:36] 10Operations, 10ops-eqsin, 10Traffic: cp5007 correctable mem errors - https://phabricator.wikimedia.org/T216716 (10RobH) Please note that Dell support typically requires the following steps to be taken for any memory replacement: * Update bios firmware on host to latest revision ** current version is 2.9.1,... [16:52:37] (03PS4) 10Ottomata: Add EventBus multi endpoint configuration and add eventgate-analytics endpoint [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490418 (https://phabricator.wikimedia.org/T211247) [16:52:41] 10Operations, 10ops-eqsin, 10Traffic: cp5006 correctable mem errors - https://phabricator.wikimedia.org/T216717 (10RobH) Please note that Dell support typically requires the following steps to be taken for any memory replacement: * Update bios firmware on host to latest revision ** current version is 2.9.1,... [16:52:43] (03CR) 10Cwhite: [C: 03+2] admin: add Runa Bhattacharjee to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/489776 (https://phabricator.wikimedia.org/T215576) (owner: 10Cwhite) [16:52:50] (03PS3) 10Cwhite: admin: add Runa Bhattacharjee to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/489776 (https://phabricator.wikimedia.org/T215576) [16:54:12] 10Operations, 10ops-codfw: ms-be2030 spontaneous reboot - https://phabricator.wikimedia.org/T204567 (10Papaul) @fgiunchedi is it possible to depool this server for me to do a firmware upgrade before I resolve the task? [16:54:50] (03PS6) 10Paladox: Gerrit: Remove socket config from log4j [puppet] - 10https://gerrit.wikimedia.org/r/490797 [16:57:16] 10Operations, 10monitoring, 10Goal, 10Patch-For-Review: Upgrade production prometheus-node-exporter to >= 0.16 - https://phabricator.wikimedia.org/T213708 (10CDanis) Looks like -n is/was a gawk option? -n --non-decimal-data Enable automatic interpretation of octal and hexadecimal values in input data (see... [16:58:38] PROBLEM - puppet last run on cloudvirtan1002 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 5 minutes ago with 6 failures. Failed resources (up to 3 shown): Exec[ip addr add 2620:0:861:118:10:64:20:45/64 dev eth0],Exec[absent_ensure_members],Exec[ops_ensure_members],Exec[wikidev_ensure_members] [17:00:05] godog and _joe_: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Puppet SWAT(Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190221T1700). [17:00:05] No GERRIT patches in the queue for this window AFAICS. [17:00:41] (03CR) 10Ottomata: [C: 04-1] "-1 for now, we are going to do the migration to using new EventServices config for just eventbus everywhere before we add new eventgate se" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490418 (https://phabricator.wikimedia.org/T211247) (owner: 10Ottomata) [17:08:37] 10Operations, 10ops-eqiad, 10monitoring, 10Patch-For-Review: icinga1001 crashed - https://phabricator.wikimedia.org/T214760 (10RobH) ` /admin1-> racadm getsel Record: 1 Date/Time: 02/12/2019 21:44:15 Source: system Severity: Ok Description: Log cleared. --------------------------------------... [17:10:25] !log rebooting cp5006 to flash bios in memory troubleshooting steps via T216717 [17:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:28] T216717: cp5006 correctable mem errors - https://phabricator.wikimedia.org/T216717 [17:11:53] !log eqsin: restarting all varnish frontends to wipe cache after purge loss (site currently depooled) (skipping 5006/7 since they're being rebooted for bios flashing anyways) [17:11:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:48] heh, i havent taken down cp5006 yet, im updating the ilom first for reliability of bios update [17:13:54] doign that on both of them first [17:14:28] yeah np [17:15:10] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga2001 is CRITICAL: cluster={cache_text,cache_upload} site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:20:08] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:27:37] ok, cp5006 reboot now [17:27:43] 10Operations, 10Analytics, 10Analytics-Kanban, 10hardware-requests: GPU upgrade for stat1005 - https://phabricator.wikimedia.org/T216226 (10elukey) [17:27:54] 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843 (10elukey) [17:29:53] 10Operations, 10Proton, 10Core Platform Team Backlog (Watching / External), 10Reading-Infrastructure-Team-Backlog (Kanban), 10Services (watching): Proton fails with Chromium 72.0.3626.96 - https://phabricator.wikimedia.org/T216493 (10MSantos) @MoritzMuehlenhoff I understand, the thing is that Puppeteer i... [17:29:59] (03CR) 10Alexandros Kosiaris: "Aside from the TODOs and the WIP status, lgtm" [dns] - 10https://gerrit.wikimedia.org/r/491860 (https://phabricator.wikimedia.org/T211247) (owner: 10Ottomata) [17:30:22] 10Operations, 10ops-eqiad, 10DC-Ops, 10Data-Services, and 2 others: Decommission labstore100[123] and their disk shelves - https://phabricator.wikimedia.org/T187456 (10Bstorm) [17:31:00] PROBLEM - Host cp5006 is DOWN: PING CRITICAL - Packet loss = 100% [17:31:04] 10Operations, 10Analytics, 10Analytics-Kanban, 10hardware-requests: GPU upgrade for stat1005 - https://phabricator.wikimedia.org/T216226 (10elukey) @RobH I'd like to establish the next steps for this task, so we can order a new GPU and test it as soon as possible, since a lot of the new Fiscal Year plannin... [17:36:34] PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp5006_v4, cp5006_v6 [17:36:38] PROBLEM - IPsec on cp1084 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp5006_v4, cp5006_v6 [17:36:42] PROBLEM - IPsec on cp2018 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp5006_v4, cp5006_v6 [17:36:42] PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp5006_v4, cp5006_v6 [17:36:44] PROBLEM - IPsec on cp1076 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp5006_v4, cp5006_v6 [17:36:48] PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp5006_v4, cp5006_v6 [17:37:02] PROBLEM - IPsec on cp1082 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp5006_v4, cp5006_v6 [17:37:04] PROBLEM - IPsec on cp1080 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp5006_v4, cp5006_v6 [17:37:12] PROBLEM - IPsec on cp1088 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp5006_v4, cp5006_v6 [17:37:14] PROBLEM - IPsec on cp1090 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp5006_v4, cp5006_v6 [17:37:14] PROBLEM - IPsec on cp1086 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp5006_v4, cp5006_v6 [17:37:22] PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp5006_v4, cp5006_v6 [17:37:22] PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp5006_v4, cp5006_v6 [17:37:24] PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp5006_v4, cp5006_v6 [17:37:32] PROBLEM - IPsec on cp2025 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp5006_v4, cp5006_v6 [17:37:32] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp5006_v4, cp5006_v6 [17:37:36] PROBLEM - IPsec on cp1078 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp5006_v4, cp5006_v6 [17:37:36] PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp5006_v4, cp5006_v6 [17:37:36] PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp5006_v4, cp5006_v6 [17:37:42] PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp5006_v4, cp5006_v6 [17:40:18] 10Operations, 10Analytics, 10Analytics-Kanban, 10Wikimedia-Stream, 10Services (watching): Eventstreams build is broken - https://phabricator.wikimedia.org/T216184 (10Milimetric) p:05Triage→03High [17:40:52] 10Operations, 10ops-eqsin, 10Traffic: cp5006 correctable mem errors - https://phabricator.wikimedia.org/T216717 (10RobH) I've updated the bios to the latest revision, 2.9.1 POST shows no errors, but I'm going to wipe the SEL and run (dells) hardware test suite. [17:41:02] 10Operations, 10ops-eqsin, 10Traffic: cp5006 correctable mem errors - https://phabricator.wikimedia.org/T216717 (10RobH) ` /admin1-> racadm getsel Record: 1 Date/Time: 07/25/2018 16:19:36 Source: system Severity: Ok Description: Log cleared. ----------------------------------------------------... [17:41:09] (03CR) 10Alexandros Kosiaris: "Aside from the WIP, looks fine to me. But don't merge please. We 'd like to use it this as training material :-D" [puppet] - 10https://gerrit.wikimedia.org/r/491861 (https://phabricator.wikimedia.org/T211247) (owner: 10Ottomata) [17:46:11] bmansurov: FYI, I deployed your patch for adding tests to recommendation-api repo, sorry for the delay :( [17:47:03] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1020 - https://phabricator.wikimedia.org/T194855 (10Bstorm) 05Resolved→03Open Unfortunately, the icinga error has returned. https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=cloud... [17:47:54] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.rolling-upgrade [17:47:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:50] gehel: keep me posted ;) [17:49:05] volans: wilco [17:51:19] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 122, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:52:21] 10Operations, 10Analytics, 10Analytics-Cluster, 10Traffic: Respect X-Forwarded-For only from trustworthy sources - https://phabricator.wikimedia.org/T56783 (10Milimetric) 05Open→03Declined >>! In T56783#2688311, @BBlack wrote: > Or is this basically now an off-topic ticket going nowhere? My money's on... [17:52:34] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 124, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:53:13] (03PS1) 10CRusnov: Add ganeti->netbox sync script [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/492007 [17:53:41] 10Operations, 10ops-eqsin, 10Traffic: cp5006 correctable mem errors - https://phabricator.wikimedia.org/T216717 (10RobH) Ok, running task comment of steps taken: * updated bios * rebooted into hardware tests * POST shows no memory errors: ` Testing Memory... Testing Memory... 10% Complete Testing Memory...... [17:54:03] !log cp5007 rebooting into bios update and hardware testing via T216716 [17:54:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:06] T216716: cp5007 correctable mem errors - https://phabricator.wikimedia.org/T216716 [17:55:53] 10Operations, 10netops: Fix codfw x-connect 65373 - https://phabricator.wikimedia.org/T215193 (10Papaul) CyrusOne Checked the reading from the fiber patch panel in A8 same readings. So they are still going to run some test out of the cage. [17:56:52] (03PS1) 10MSantos: Increase number of cpus on maps2004 [puppet] - 10https://gerrit.wikimedia.org/r/492008 [17:57:28] (03PS1) 10Cmjohnson: Adding mgmt dns for labsdb1012 [dns] - 10https://gerrit.wikimedia.org/r/492009 (https://phabricator.wikimedia.org/T215231) [17:58:03] cmjohnson1: \o/ [17:58:26] PROBLEM - Host cp5007 is DOWN: PING CRITICAL - Packet loss = 100% [17:58:29] (03PS2) 10MSantos: Increase number of cpus on maps2004 [puppet] - 10https://gerrit.wikimedia.org/r/492008 [18:00:04] cscott, arlolra, subbu, halfak, and Amir1: Dear deployers, time to do the Services – Graphoid / Parsoid / Citoid / ORES deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190221T1800). [18:00:19] 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T216526 (10Bstorm) 05Open→03Invalid Current status is ok: ` Adapter 0 -- Virtual Drive Information: Virtual Drive: 0 (Target Id: 0) Name : RAID Level : Primary-1,... [18:00:26] 10Operations, 10ops-eqsin, 10Traffic: amber light on cp5006/5007 - https://phabricator.wikimedia.org/T216691 (10RobH) So I updated the bios on cp5007, and this happened in post: ` UEFI0107: One or more memory errors have occurred on memory slot: A1. Remove input power to the system, reseat the DIMM module... [18:03:12] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install labsdb1012.eqiad.wmnet - https://phabricator.wikimedia.org/T215231 (10Cmjohnson) [18:03:26] (03CR) 10Cmjohnson: [C: 03+2] Adding mgmt dns for labsdb1012 [dns] - 10https://gerrit.wikimedia.org/r/492009 (https://phabricator.wikimedia.org/T215231) (owner: 10Cmjohnson) [18:04:04] PROBLEM - IPsec on cp2004 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp5007_v4, cp5007_v6 [18:04:04] PROBLEM - IPsec on cp2012 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp5007_v4, cp5007_v6 [18:04:14] PROBLEM - IPsec on cp2016 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp5007_v4, cp5007_v6 [18:04:14] PROBLEM - IPsec on cp2001 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp5007_v4, cp5007_v6 [18:04:18] PROBLEM - IPsec on cp1087 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp5007_v4, cp5007_v6 [18:04:20] PROBLEM - IPsec on cp1075 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp5007_v4, cp5007_v6 [18:04:26] PROBLEM - IPsec on cp1079 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp5007_v4, cp5007_v6 [18:04:26] PROBLEM - IPsec on cp1089 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp5007_v4, cp5007_v6 [18:04:36] PROBLEM - IPsec on cp1081 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp5007_v4, cp5007_v6 [18:04:42] PROBLEM - IPsec on cp1083 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp5007_v4, cp5007_v6 [18:04:42] PROBLEM - IPsec on cp2023 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp5007_v4, cp5007_v6 [18:04:42] PROBLEM - IPsec on cp2006 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp5007_v4, cp5007_v6 [18:04:44] PROBLEM - IPsec on cp2007 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp5007_v4, cp5007_v6 [18:04:50] PROBLEM - IPsec on cp2010 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp5007_v4, cp5007_v6 [18:04:54] PROBLEM - IPsec on cp1077 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp5007_v4, cp5007_v6 [18:04:56] PROBLEM - IPsec on cp1085 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp5007_v4, cp5007_v6 [18:05:04] PROBLEM - IPsec on cp2019 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp5007_v4, cp5007_v6 [18:05:12] PROBLEM - IPsec on cp2013 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp5007_v4, cp5007_v6 [18:07:08] 10Operations, 10ops-eqsin, 10Traffic: cp5007 correctable mem errors - https://phabricator.wikimedia.org/T216716 (10RobH) ` 3 $> ssh root@cp5007.mgmt.eqsin.wmnet root@cp5007.mgmt.eqsin.wmnet's password: /admin1-> racadm getsel Record: 1 Date/Time: 10/31/2017 14:19:03 Source: system Severity: O... [18:07:24] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 122, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:07:47] 10Operations, 10Data-Services, 10decommission, 10cloud-services-team (Kanban): Reclaim/Decommission labsdb1004.eqiad.wmnet and labsdb1005.eqiad.wmnet as soon as they are ready - https://phabricator.wikimedia.org/T216749 (10Bstorm) 05Open→03Stalled p:05Triage→03Normal [18:08:04] 10Operations, 10Data-Services, 10decommission, 10cloud-services-team (Kanban): Reclaim/Decommission labsdb1004.eqiad.wmnet and labsdb1005.eqiad.wmnet as soon as they are ready - https://phabricator.wikimedia.org/T216749 (10Bstorm) [18:09:51] I'm here, going to deploy twice for ores today [18:09:55] akosiaris: ^ [18:10:27] (03PS2) 10CRusnov: Add ganeti->netbox sync script [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/492007 [18:11:36] PROBLEM - Host ms-be1033.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:11:40] 10Operations, 10ops-eqsin, 10Traffic: cp5007 correctable mem errors - https://phabricator.wikimedia.org/T216716 (10RobH) bios update successful. I've cleared the SEL so I can launch Dell hardware testing utility. [18:13:26] commit to rollback in case needed: ad160b0405bd35d10aa525f3ff78fd0c23e2b10b [18:13:33] !log ladsgroup@deploy1001 Started deploy [ores/deploy@5d50713]: (no justification provided) [18:13:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:52] sorry I forgot to put summary [18:14:42] 10Operations, 10ops-eqsin, 10Traffic: cp5006 correctable mem errors - https://phabricator.wikimedia.org/T216717 (10RobH) Please note the hardware testing is still running on this system. I'm monitoring its serial output, but @ayounsi shouldn't modify the system until I update (or unless he attaches a crash... [18:14:45] 10Operations, 10ops-eqsin, 10Traffic: cp5007 correctable mem errors - https://phabricator.wikimedia.org/T216716 (10RobH) Please note the hardware testing is still running on this system. I'm monitoring its serial output, but @ayounsi shouldn't modify the system until I update (or unless he attaches a crash... [18:15:59] 10Operations, 10Recommendation-API, 10Release-Engineering-Team, 10Research, and 2 others: Recommendation API improvements - https://phabricator.wikimedia.org/T213222 (10bmansurov) [18:20:15] 10Operations, 10Recommendation-API, 10Release-Engineering-Team, 10Research, 10Services (watching): Recommendation API improvements - https://phabricator.wikimedia.org/T213222 (10bmansurov) [18:20:55] 10Operations, 10Recommendation-API, 10Release-Engineering-Team, 10Research, 10Services (watching): Recommendation API improvements - https://phabricator.wikimedia.org/T213222 (10bmansurov) 05Open→03Resolved [18:22:55] Amir1: love the confidence (commit to rollback ready) :-P [18:23:13] 10Operations, 10Office-IT, 10Wikimedia-Mailing-lists: Mailing list migration for Arbitration Committee to Google Group - https://phabricator.wikimedia.org/T215940 (10colewhite) Mbox files shared with @eross . [18:23:23] 10Operations, 10Office-IT, 10Wikimedia-Mailing-lists: Mailing list migration for Arbitration Committee to Google Group - https://phabricator.wikimedia.org/T215940 (10colewhite) [18:23:34] 10Operations, 10Office-IT, 10Wikimedia-Mailing-lists: Mailing list migration for Arbitration Committee to Google Group - https://phabricator.wikimedia.org/T215940 (10colewhite) a:05colewhite→03None [18:23:48] haha, It would be harder to track it down in a random note in my laptop, also it's good when I'm not around and someone else wants to revert :D [18:24:56] (03CR) 10Ppchelko: "I guess we can merge this as is even before the LVS was created for testing in labs since there's already an instance in deployment-prep" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490418 (https://phabricator.wikimedia.org/T211247) (owner: 10Ottomata) [18:28:10] !log ladsgroup@deploy1001 Finished deploy [ores/deploy@5d50713]: (no justification provided) (duration: 14m 37s) [18:28:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:26] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1020 - https://phabricator.wikimedia.org/T216649 (10Andrew) 05Open→03Resolved a:03Andrew Looks good now! [18:28:27] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.force-shard-allocation [18:28:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:03] !log gehel@cumin2001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [18:29:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:46] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1020 - https://phabricator.wikimedia.org/T194855 (10Andrew) 05Open→03Resolved Looks good now. Thanks @Cmjohnson ! [18:30:56] !log ignore icinga1001 alerts, rebooting it into hardware tests via T214760 [18:30:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:59] T214760: icinga1001 crashed - https://phabricator.wikimedia.org/T214760 [18:31:21] robh: which alerts? [18:32:07] i didnt bother putting into maint [18:32:09] in icinga is all [18:32:17] no worries, you can't ;) [18:32:18] plus its rebooting so logged to the task [18:32:24] oh, its not in monitoring? [18:32:33] the old story of being icinga itself [18:33:00] !log gehel@cumin2001 END (ERROR) - Cookbook sre.elasticsearch.rolling-upgrade (exit_code=97) [18:33:00] Okay, I'll deploy the next one [18:33:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:28] volans: so faidon approved the spare system allocation of icinga1002 [18:33:32] im going to work on that this afternoon [18:33:36] 10Operations, 10ops-eqiad: ms-be1033 down and not powering up - https://phabricator.wikimedia.org/T215998 (10Cmjohnson) @fgiunchedi HPE did not believe me that a motherboard swap is needed. They asked that I do a bunch of troubleshooting first. Below are the steps they asked me to do. I have replied that thei... [18:33:39] since this fix on 1001 isnt going quickly [18:34:05] also my planned day off tomorrow is canceled (the dog rescue im volunteering at needs me to come a different day for the electrical project im doing for htem) [18:34:12] so worst case ill install tomorrow AM if i get sidetracked. [18:34:24] robh: ack, same specs? [18:34:31] yeah [18:35:51] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install labsdb1012.eqiad.wmnet - https://phabricator.wikimedia.org/T215231 (10Cmjohnson) a:05Cmjohnson→03ayounsi @arzhel This server needs to go into the cloud-support vlan but it's not available to me for row C.... [18:36:26] both are just general dual cpu misc systems [18:39:15] 10Operations, 10ops-eqiad: Disk failure on labsdb1005 - https://phabricator.wikimedia.org/T216202 (10Cmjohnson) @bstorm I swapped the disk with a used spare but this server really needs to be decommissioned...the warranty expired in 2015. [18:41:55] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1114 crashed (HW memory issues) - https://phabricator.wikimedia.org/T214720 (10Cmjohnson) @Marostegui I will need to swap DIMM B3 and B7 to the A side. LMK when the server is down and ready [18:42:35] 10Operations, 10ops-eqiad: Disk failure on labsdb1005 - https://phabricator.wikimedia.org/T216202 (10Bstorm) 05Open→03Resolved a:03Bstorm T216749 working on it as soon as we don't need it for restoring some tables anymore! Soon. Thanks, looks good. [18:42:36] !log ladsgroup@deploy1001 Started deploy [ores/deploy@2d84709]: Change default task serializer of celery from pickle to json (T206333) [18:42:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:39] T206333: Change default serializer of celery from pickle to json - https://phabricator.wikimedia.org/T206333 [18:44:50] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1114 crashed (HW memory issues) - https://phabricator.wikimedia.org/T214720 (10jcrespo) I will put it down now (it is out of service, I only need to downtime it on icinga) [18:46:36] !log shutting down db1114 T214720 [18:46:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:40] T214720: db1114 crashed (HW memory issues) - https://phabricator.wikimedia.org/T214720 [18:48:41] (03PS5) 10Ottomata: Use EventBus multi endpoint configuration for eventbus configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490418 (https://phabricator.wikimedia.org/T211247) [18:52:13] (03PS8) 10CRusnov: Add ganeti read-only user deployment [puppet] - 10https://gerrit.wikimedia.org/r/490397 (https://phabricator.wikimedia.org/T215229) [18:59:11] 10Operations, 10ops-eqiad: ms-be1033 down and not powering up - https://phabricator.wikimedia.org/T215998 (10Cmjohnson) The server did eventually power up, so it looks like I am eating some crow on this one. Re-connected everything and put back to normal operating standard. Booting into the OS now. [18:59:30] !log ladsgroup@deploy1001 Finished deploy [ores/deploy@2d84709]: Change default task serializer of celery from pickle to json (T206333) (duration: 16m 54s) [18:59:32] RECOVERY - Host ms-be1033 is UP: PING OK - Packet loss = 0%, RTA = 40.76 ms [18:59:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:33] T206333: Change default serializer of celery from pickle to json - https://phabricator.wikimedia.org/T206333 [19:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor I � Unicode. All rise for Morning SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190221T1900). [19:00:04] Amir1: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:20] Actually it's not one, it's three :D [19:01:02] RECOVERY - Host ms-be1033.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.18 ms [19:06:05] Amir1: are you deploying your patches or do you need a SWATer? (can't remember if you've got the powers or not) [19:06:34] PROBLEM - very high load average likely xfs on ms-be1033 is CRITICAL: CRITICAL - load average: 138.12, 107.00, 52.42 [19:06:48] (03CR) 10Ppchelko: [C: 03+1] "We can only deploy it after today train because of the RCFeed change." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490418 (https://phabricator.wikimedia.org/T211247) (owner: 10Ottomata) [19:07:18] (03PS1) 10Bstorm: toolsdb: Remove the word temporary from comments [puppet] - 10https://gerrit.wikimedia.org/r/492024 (https://phabricator.wikimedia.org/T216170) [19:08:31] thcipriani: I have SWAT superpowers but I'm not official SWATer, so I do it when no one from the official SWAT team is around [19:09:47] Amir1: if you're comfortable deploying your patches, go for it since you're the only person with patches. If you'd rather me deploy them, I can :) [19:10:48] thcipriani: I will do it. [19:11:09] Thanks! [19:11:15] awesome, thanks! :) [19:13:29] (03PS2) 10Ladsgroup: Set wmgWikibaseRepoIdGeneratorSeparateDbConnection to true for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491484 (https://phabricator.wikimedia.org/T215147) [19:13:42] (03CR) 10Ladsgroup: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491484 (https://phabricator.wikimedia.org/T215147) (owner: 10Ladsgroup) [19:14:40] (03Merged) 10jenkins-bot: Set wmgWikibaseRepoIdGeneratorSeparateDbConnection to true for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491484 (https://phabricator.wikimedia.org/T215147) (owner: 10Ladsgroup) [19:16:38] (03CR) 10jenkins-bot: Set wmgWikibaseRepoIdGeneratorSeparateDbConnection to true for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491484 (https://phabricator.wikimedia.org/T215147) (owner: 10Ladsgroup) [19:17:31] 10Operations, 10Scoring-platform-team (Current), 10User-Ladsgroup: Spec out migrating ORES to kubernetes - https://phabricator.wikimedia.org/T210109 (10Ladsgroup) 05Open→03Resolved [19:19:30] 10Operations, 10Jade, 10TechCom, 10Core Platform Team Backlog (Watching / External), and 4 others: Deploy JADE extension to production - https://phabricator.wikimedia.org/T183381 (10Ladsgroup) [19:19:49] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT [[gerrit:491484|Set wmgWikibaseRepoIdGeneratorSeparateDbConnection to true for wikidata (T215147)]] (duration: 00m 56s) [19:19:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:53] T215147: Deploy separate DB connection for ID insertions on wikidata.org - https://phabricator.wikimedia.org/T215147 [19:20:21] (03PS1) 10Thcipriani: Gerrit 2.15.10 release [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/492025 (https://phabricator.wikimedia.org/T214359) [19:21:14] (03PS1) 10Jbond: Add password reset function to ipmi module [software/spicerack] - 10https://gerrit.wikimedia.org/r/492026 [19:23:33] (03PS2) 10Ladsgroup: Change Special:ItemDisambiguation from blank special page to disabled page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491237 (https://phabricator.wikimedia.org/T216397) [19:23:42] (03CR) 10Ladsgroup: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491237 (https://phabricator.wikimedia.org/T216397) (owner: 10Ladsgroup) [19:23:54] (03CR) 10Paladox: [C: 03+2] "LGTM on your test site." [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/492025 (https://phabricator.wikimedia.org/T214359) (owner: 10Thcipriani) [19:24:49] (03Merged) 10jenkins-bot: Change Special:ItemDisambiguation from blank special page to disabled page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491237 (https://phabricator.wikimedia.org/T216397) (owner: 10Ladsgroup) [19:25:06] RECOVERY - puppet last run on cloudvirtan1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:25:18] (03CR) 10Gehel: [C: 04-1] "Looks reasonable to me. A few tests wouldn't hurt, in particular a test that validates the format returned by the user list command." [software/spicerack] - 10https://gerrit.wikimedia.org/r/492026 (owner: 10Jbond) [19:25:23] (03CR) 10jerkins-bot: [V: 04-1] Add password reset function to ipmi module [software/spicerack] - 10https://gerrit.wikimedia.org/r/492026 (owner: 10Jbond) [19:25:34] [XG77HApAAC4AAARUHhEAAAAN] 2019-02-21 19:25:16: Fatal exception of type "ArgumentCountError" [19:25:37] oops [19:25:49] (03CR) 10Gehel: [C: 03+2] Increase number of cpus on maps2004 [puppet] - 10https://gerrit.wikimedia.org/r/492008 (owner: 10MSantos) [19:25:57] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.rolling-upgrade [19:25:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:05] (03PS1) 10Ladsgroup: Revert "Change Special:ItemDisambiguation from blank special page to disabled page" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492028 [19:26:26] (03CR) 10Ladsgroup: [C: 03+2] Revert "Change Special:ItemDisambiguation from blank special page to disabled page" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492028 (owner: 10Ladsgroup) [19:27:10] PROBLEM - Host db1114.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:27:27] (03Merged) 10jenkins-bot: Revert "Change Special:ItemDisambiguation from blank special page to disabled page" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492028 (owner: 10Ladsgroup) [19:27:46] (03CR) 10jenkins-bot: Change Special:ItemDisambiguation from blank special page to disabled page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491237 (https://phabricator.wikimedia.org/T216397) (owner: 10Ladsgroup) [19:27:48] (03CR) 10jenkins-bot: Revert "Change Special:ItemDisambiguation from blank special page to disabled page" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492028 (owner: 10Ladsgroup) [19:27:56] (03PS3) 10Ladsgroup: Drop obsolete Wikibase configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491506 (https://phabricator.wikimedia.org/T213713) [19:28:44] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1114 crashed (HW memory issues) - https://phabricator.wikimedia.org/T214720 (10Cmjohnson) Before DIMM Swap racadm log /admin1-> racadm getsel Record: 1 Date/Time: 11/04/2017 15:21:07 Source: system Severity: Ok Description: Log cleared... [19:29:25] (03PS1) 10Gehel: maps: ncpu_ratio is a float, not an int [puppet] - 10https://gerrit.wikimedia.org/r/492030 [19:29:38] (03CR) 10Volans: [C: 04-1] "Nice! Thanks a lot for the patch. LGTM in general, I didn't test the reset password IPMI command yet." [software/spicerack] - 10https://gerrit.wikimedia.org/r/492026 (owner: 10Jbond) [19:30:17] (03CR) 10Ladsgroup: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491506 (https://phabricator.wikimedia.org/T213713) (owner: 10Ladsgroup) [19:30:34] RECOVERY - Host db1114.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.80 ms [19:30:35] (03CR) 10Gehel: [C: 03+2] maps: ncpu_ratio is a float, not an int [puppet] - 10https://gerrit.wikimedia.org/r/492030 (owner: 10Gehel) [19:31:16] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1114 crashed (HW memory issues) - https://phabricator.wikimedia.org/T214720 (10Cmjohnson) @jynus @marostegui I swapped DIMM B3 to A3 and B7 to A7 and cleared the idrac log. Please put some stress on the server and let's monitor. [19:31:18] (03Merged) 10jenkins-bot: Drop obsolete Wikibase configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491506 (https://phabricator.wikimedia.org/T213713) (owner: 10Ladsgroup) [19:32:04] PROBLEM - puppet last run on maps2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:32:49] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.force-shard-allocation [19:32:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:52] !log gehel@cumin2001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [19:32:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:14] !log ladsgroup@deploy1001 Synchronized wmf-config/Wikibase.php: SWAT: [[gerrit:491506|Drop obsolete Wikibase configs (T213713)]], Part I (duration: 00m 52s) [19:33:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:17] T213713: Remove tmpMaxItemIdForNew(Item|property)IdHtmlFormatter options & legacy code paths - https://phabricator.wikimedia.org/T213713 [19:33:59] (03PS1) 10EBernhardson: [cirrus] autocomplete: enable subphrase suggester builds on officewiki (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492031 (https://phabricator.wikimedia.org/T150153) [19:34:02] (03PS1) 10EBernhardson: [cirrus] autocomplete: enable subphrase matching for officewiki (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492032 (https://phabricator.wikimedia.org/T150153) [19:35:34] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:491506|Drop obsolete Wikibase configs (T213713)]], Part II (duration: 00m 53s) [19:35:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:06] (03PS3) 10Jcrespo: mariadb: Add the option of postprocessing backups [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/491818 (https://phabricator.wikimedia.org/T210292) [19:36:31] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Add the option of postprocessing backups [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/491818 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [19:38:42] RECOVERY - Device not healthy -SMART- on labsdb1005 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=labsdb1005&var-datasource=eqiad+prometheus/ops [19:38:48] (03CR) 10jenkins-bot: Drop obsolete Wikibase configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491506 (https://phabricator.wikimedia.org/T213713) (owner: 10Ladsgroup) [19:39:25] Okay I use this time to do another ores deployment [19:40:27] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1114 crashed (HW memory issues) - https://phabricator.wikimedia.org/T214720 (10jcrespo) Thanks, I have left it warming the buffer pool/replicating, tomorrow I will create a backup to touch all memory space. [19:42:08] (03CR) 10Jbond: "Thanks for the quick review, have marked this as WIP as there are still a couple more things to add including tests but good to know im go" [software/spicerack] - 10https://gerrit.wikimedia.org/r/492026 (owner: 10Jbond) [19:44:45] (03CR) 10Ottomata: "The TODOs can't be resolved until the Puppet patch is merged, and the puppet patch shouldn't be merged unless the DNS is set up. Can we g" [dns] - 10https://gerrit.wikimedia.org/r/491860 (https://phabricator.wikimedia.org/T211247) (owner: 10Ottomata) [19:45:01] (03PS4) 10Ottomata: Set up DNS for eventgate-analytics [dns] - 10https://gerrit.wikimedia.org/r/491860 (https://phabricator.wikimedia.org/T211247) [19:45:12] (03CR) 10Volans: [C: 04-1] "This change is ready for review." (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/492026 (owner: 10Jbond) [19:52:17] (03CR) 10Ottomata: "Training material for...? :)" [puppet] - 10https://gerrit.wikimedia.org/r/491861 (https://phabricator.wikimedia.org/T211247) (owner: 10Ottomata) [19:52:44] !log ladsgroup@deploy1001 Started deploy [ores/deploy@5d937b1]: Drop accepting pickle altogether (T206333) [19:52:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:47] T206333: Change default serializer of celery from pickle to json - https://phabricator.wikimedia.org/T206333 [19:53:06] ahh flashbacks [19:57:15] 10Operations, 10ops-eqiad, 10monitoring, 10Patch-For-Review: icinga1001 crashed - https://phabricator.wikimedia.org/T214760 (10RobH) icigna1001 passes ALL dell hardware tests when the built in extended tests were run. Annoying it doesnt show the errors we see in the OS logs. [19:57:29] (03PS2) 10BBlack: Revert "Depool eqsin for cr2-eqsin setup" [dns] - 10https://gerrit.wikimedia.org/r/491964 (https://phabricator.wikimedia.org/T213121) [19:57:57] (03CR) 10BBlack: [C: 03+2] Revert "Depool eqsin for cr2-eqsin setup" [dns] - 10https://gerrit.wikimedia.org/r/491964 (https://phabricator.wikimedia.org/T213121) (owner: 10BBlack) [19:58:58] !log eqsin: repooling user traffic [19:59:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:04] thcipriani: I, the Bot under the Fountain, allow thee, The Deployer, to do MediaWiki train - Americas version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190221T2000). [20:01:18] * thcipriani train [20:02:11] 10Operations, 10ops-eqiad, 10monitoring, 10Patch-For-Review: icinga1001 crashed - https://phabricator.wikimedia.org/T214760 (10Cmjohnson) I swapped CPU1 w/ CPU2 and cleared the log. Please monitor to see where and if the error continues or moves. [20:02:13] Amir1: are you still deploying SWAT things? Or am I OK to start train deployment? [20:02:29] thcipriani: it's about to finish [20:02:37] okie doke :) [20:06:01] !log ladsgroup@deploy1001 Finished deploy [ores/deploy@5d937b1]: Drop accepting pickle altogether (T206333) (duration: 13m 17s) [20:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:04] T206333: Change default serializer of celery from pickle to json - https://phabricator.wikimedia.org/T206333 [20:06:11] thcipriani: I'm done ^ [20:06:23] Amir1: great, thanks! [20:06:29] 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Core Platform Team Backlog (Watching / External), and 2 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10Yurik) Note that Graphoid already supports `POST` approach (see v2 Graphoid api). You can post a graph s... [20:06:32] and thanks for SWATing your own patches, appreciated :) [20:06:39] One thing, if you see something weird, I deployed lots of things [20:06:42] ping me [20:06:47] 10Operations, 10ExternalGuidance, 10Traffic, 10MW-1.33-notes (1.33.0-wmf.18; 2019-02-19), 10Patch-For-Review: Deliver mobile-based version for automatic translations - https://phabricator.wikimedia.org/T212197 (10dr0ptp4kt) @BBlack in https://gerrit.wikimedia.org/r/490120 I checked in with @Pginer-WMF to... [20:06:53] you are very welcome, I wish I could help more [20:15:15] (03PS1) 10Thcipriani: all wikis to 1.33.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492041 [20:15:17] (03CR) 10Thcipriani: [C: 03+2] all wikis to 1.33.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492041 (owner: 10Thcipriani) [20:16:32] (03Merged) 10jenkins-bot: all wikis to 1.33.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492041 (owner: 10Thcipriani) [20:18:34] !log thcipriani@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.33.0-wmf.18 [20:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:51] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1114 crashed (HW memory issues) - https://phabricator.wikimedia.org/T214720 (10Marostegui) @jcrespo maybe we can leave a mydumper running 24x7 on a loop for days on that host: dumping everything, deleting the backups file, dump everyting and so forth. [20:23:45] (03CR) 10jenkins-bot: all wikis to 1.33.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492041 (owner: 10Thcipriani) [20:24:44] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga2001 is CRITICAL: 1.054e+05 gt 1e+05 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [20:26:13] ebernhardson: this is due to the one topic being bursty because of cluster restarts, right? [20:26:16] ^^ [20:26:22] should catch up eventually? [20:28:35] (03PS1) 10EBernhardson: [cirrus] Switch production search traffic to codfw (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492044 (https://phabricator.wikimedia.org/T215931) [20:28:38] (03PS1) 10EBernhardson: [cirrus] Switch production search traffic to codfw (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492045 (https://phabricator.wikimedia.org/T215931) [20:29:39] (03CR) 10jerkins-bot: [V: 04-1] [cirrus] Switch production search traffic to codfw (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492044 (https://phabricator.wikimedia.org/T215931) (owner: 10EBernhardson) [20:34:03] ottomata: right. And restarts is almost done [20:34:05] ottomata: right, if the scripts are working right it should pause writes for 5 minutes, and then let them drain [20:34:51] The last nodes actually recovered faster than expected, and thus there wasn't much time to process the writes before the next restarts [20:35:24] !log gehel@cumin2001 END (PASS) - Cookbook sre.elasticsearch.rolling-upgrade (exit_code=0) [20:35:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:47] Volans, onimisionipe : restart completed ! Thanks for all the help ! [20:37:06] gehel: \o/ lmk in which state we want to leave 1) cumin2001 2) the repo (merging or not that patch) [20:37:24] gehel: Great! I can see all from my end too [20:37:25] 3) make a new release (depends on 2 mostly) [20:37:57] volans: I'll continue with eqiad tomorrow, so ideally a new release [20:38:22] (03CR) 10Ppchelko: [C: 03+1] "Seems like MW has been deployed. Should we merge in order to check whether it will work in labs?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490418 (https://phabricator.wikimedia.org/T211247) (owner: 10Ottomata) [20:38:37] But it's getting late, we can check the details tomorrow [20:39:16] (03CR) 10Ottomata: [C: 03+2] "Let's go!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490418 (https://phabricator.wikimedia.org/T211247) (owner: 10Ottomata) [20:39:56] gehel: no prob for the release, just wondering if the cumin-suppress patch is needed too, seems so at this point [20:40:15] Yep, it makes a big difference ! [20:40:16] and I'm unsure if it's a good or bad thing, I still have mixed feelings about that [20:40:21] patch [20:40:47] but I'll agree to merge it for now at least until cumin solves the slowness problem [20:40:59] if only upstream would listen more to my suggestions... [20:41:02] :-P [20:41:57] Blaming upstream... Too easy ;) [20:44:33] * volans changes hat and takes the blame [20:45:54] 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Core Platform Team Backlog (Watching / External), and 2 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10Bawolff) > P.S. @Bawolff I might be wrong, but I think at the time of first Graphoid version, pageprops w... [20:46:09] (03CR) 10jenkins-bot: Use EventBus multi endpoint configuration for eventbus configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490418 (https://phabricator.wikimedia.org/T211247) (owner: 10Ottomata) [20:57:24] RECOVERY - Host cp5007 is UP: PING OK - Packet loss = 0%, RTA = 195.51 ms [20:57:28] RECOVERY - IPsec on cp1076 is OK: Strongswan OK - 72 ESP OK [20:57:30] RECOVERY - Host cp5006 is UP: PING OK - Packet loss = 0%, RTA = 195.31 ms [20:57:38] RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 64 ESP OK [20:57:38] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 64 ESP OK [20:57:38] RECOVERY - IPsec on cp1087 is OK: Strongswan OK - 56 ESP OK [20:57:40] RECOVERY - IPsec on cp1075 is OK: Strongswan OK - 56 ESP OK [20:57:42] RECOVERY - IPsec on cp2006 is OK: Strongswan OK - 52 ESP OK [20:57:42] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 64 ESP OK [20:57:42] RECOVERY - IPsec on cp2007 is OK: Strongswan OK - 52 ESP OK [20:57:42] RECOVERY - IPsec on cp2023 is OK: Strongswan OK - 52 ESP OK [20:57:44] RECOVERY - IPsec on cp1082 is OK: Strongswan OK - 72 ESP OK [20:57:44] RECOVERY - IPsec on cp1079 is OK: Strongswan OK - 56 ESP OK [20:57:46] RECOVERY - IPsec on cp1080 is OK: Strongswan OK - 72 ESP OK [20:57:48] RECOVERY - IPsec on cp1089 is OK: Strongswan OK - 56 ESP OK [20:57:48] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 64 ESP OK [20:57:48] RECOVERY - IPsec on cp2025 is OK: Strongswan OK - 64 ESP OK [20:57:49] RECOVERY - IPsec on cp2010 is OK: Strongswan OK - 52 ESP OK [20:57:52] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 64 ESP OK [20:57:52] RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 64 ESP OK [20:57:56] RECOVERY - IPsec on cp1088 is OK: Strongswan OK - 72 ESP OK [20:57:58] RECOVERY - IPsec on cp1081 is OK: Strongswan OK - 56 ESP OK [20:57:58] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 64 ESP OK [20:57:58] RECOVERY - IPsec on cp1090 is OK: Strongswan OK - 72 ESP OK [20:57:58] RECOVERY - IPsec on cp1086 is OK: Strongswan OK - 72 ESP OK [20:58:02] RECOVERY - IPsec on cp1083 is OK: Strongswan OK - 56 ESP OK [20:58:04] RECOVERY - IPsec on cp2019 is OK: Strongswan OK - 52 ESP OK [20:58:04] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 64 ESP OK [20:58:04] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga2001 is OK: (C)1e+05 gt (W)1e+04 gt 436 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [20:58:08] RECOVERY - IPsec on cp2018 is OK: Strongswan OK - 64 ESP OK [20:58:10] RECOVERY - IPsec on cp2013 is OK: Strongswan OK - 52 ESP OK [20:58:12] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 64 ESP OK [20:58:16] RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 64 ESP OK [20:58:16] RECOVERY - IPsec on cp2004 is OK: Strongswan OK - 52 ESP OK [20:58:16] RECOVERY - IPsec on cp2012 is OK: Strongswan OK - 52 ESP OK [20:58:18] RECOVERY - IPsec on cp1085 is OK: Strongswan OK - 56 ESP OK [20:58:18] RECOVERY - IPsec on cp1077 is OK: Strongswan OK - 56 ESP OK [20:58:20] RECOVERY - IPsec on cp1078 is OK: Strongswan OK - 72 ESP OK [20:58:24] RECOVERY - IPsec on cp2016 is OK: Strongswan OK - 52 ESP OK [20:58:26] RECOVERY - IPsec on cp2001 is OK: Strongswan OK - 52 ESP OK [20:58:34] RECOVERY - IPsec on cp1084 is OK: Strongswan OK - 72 ESP OK [20:59:45] (03PS2) 10Paladox: Update healthcheck [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/489734 [21:06:38] RECOVERY - very high load average likely xfs on ms-be1033 is OK: OK - load average: 71.19, 73.03, 79.80 [21:31:17] (03PS2) 10EBernhardson: [cirrus] Switch production search traffic to codfw (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492044 (https://phabricator.wikimedia.org/T215931) [21:38:47] (03PS1) 10Eevans: sessions: add (dummy) key material for session storage cluster [labs/private] - 10https://gerrit.wikimedia.org/r/492196 (https://phabricator.wikimedia.org/T215883) [21:39:10] (03CR) 10Eevans: [V: 03+1 C: 03+1] sessions: add (dummy) key material for session storage cluster [labs/private] - 10https://gerrit.wikimedia.org/r/492196 (https://phabricator.wikimedia.org/T215883) (owner: 10Eevans) [21:40:19] (03PS3) 10EBernhardson: [cirrus] Switch production search traffic to codfw (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492044 (https://phabricator.wikimedia.org/T215931) [21:40:22] (03PS2) 10EBernhardson: [cirrus] Switch production search traffic to codfw (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492045 (https://phabricator.wikimedia.org/T215931) [21:40:47] (03PS9) 10CRusnov: Add ganeti read-only user deployment [puppet] - 10https://gerrit.wikimedia.org/r/490397 (https://phabricator.wikimedia.org/T215229) [21:42:18] (03CR) 10CRusnov: [C: 03+2] Add ganeti read-only user deployment [puppet] - 10https://gerrit.wikimedia.org/r/490397 (https://phabricator.wikimedia.org/T215229) (owner: 10CRusnov) [21:43:56] 10Operations, 10ops-eqsin, 10Traffic: cp5007 correctable mem errors - https://phabricator.wikimedia.org/T216716 (10RobH) So, hardware testing completed, both the quick and in depth testing offered by the Dell utility selected during POST. However, previous SEL entries (posted above) show issues in dimm sl... [21:45:41] 10Operations, 10ops-eqsin, 10Traffic: cp5006 correctable mem errors - https://phabricator.wikimedia.org/T216717 (10RobH) a:03ayounsi This passed all in depth Dell hardware test utilities, and issued no further errors since I cleared the log and ran hardware tests. Since our onsite time is limited, I'd rec... [21:47:00] 10Operations, 10ops-eqsin, 10Traffic: cp5007 correctable mem errors - https://phabricator.wikimedia.org/T216716 (10RobH) a:03ayounsi Since onsite time is limited, it may be best for Arzhel to swap dimm A1 to A2, and swap dimm a5 to a4. This moves two questionable dimms to two slots that haven't reported e... [21:47:02] (03PS1) 10Ottomata: Monitor stream.wikimedia.org public endpoint [puppet] - 10https://gerrit.wikimedia.org/r/492199 (https://phabricator.wikimedia.org/T215013) [21:47:53] (03CR) 10jerkins-bot: [V: 04-1] Monitor stream.wikimedia.org public endpoint [puppet] - 10https://gerrit.wikimedia.org/r/492199 (https://phabricator.wikimedia.org/T215013) (owner: 10Ottomata) [21:50:48] onimisionipe: gehel: hello- there is an unmerged puppet change https://gerrit.wikimedia.org/r/c/operations/puppet/+/492030 is it okay to merge this? [21:51:13] (03CR) 10DCausse: "still unsure why ::role::lvs::realserver is here" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [21:54:41] chaomodus: I think you need to refresh. Its merged already [21:54:56] okay [21:55:09] Oh I mean at the puppet-merge stage [21:55:16] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [21:55:17] it is waiting to be puppet-merged [21:57:18] chaomodus: oh.. Its ok to merge [21:58:00] okay great thank you! [21:58:52] chaomodus: yep, I confirm, ok to merge [21:58:55] sorry for the mess! [21:59:00] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [22:00:02] no worries :) [22:02:09] (03PS2) 10Ottomata: Monitor stream.wikimedia.org public endpoint [puppet] - 10https://gerrit.wikimedia.org/r/492199 (https://phabricator.wikimedia.org/T215013) [22:02:59] (03CR) 10jerkins-bot: [V: 04-1] Monitor stream.wikimedia.org public endpoint [puppet] - 10https://gerrit.wikimedia.org/r/492199 (https://phabricator.wikimedia.org/T215013) (owner: 10Ottomata) [22:06:10] PROBLEM - Check systemd state on ganeti1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:08:28] PROBLEM - Check systemd state on ganeti2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:08:56] RECOVERY - puppet last run on maps2004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [22:09:22] PROBLEM - Check systemd state on ganeti1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:10:03] PROBLEM - Check systemd state on ganeti2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:10:35] looks like... ferm is failing on all ganeti hosts? [22:10:52] PROBLEM - Check systemd state on ganeti1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:10:58] PROBLEM - Check systemd state on ganeti1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:10:59] Feb 21 22:07:50 ganeti1002 puppet-agent[2990]: (/Stage[main]/Ferm/Service[ferm]) Feb 21 22:07:50 ganeti1002 ferm[3994]: DNS query for 'netmon1002.eqiad.wmnet' failed: NXDOMAIN [22:11:03] chaomodus: ^^^ [22:11:27] it's wikimedia.org [22:11:32] not eqiad.wmnet [22:12:04] PROBLEM - Check systemd state on ganeti2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:12:42] PROBLEM - Check systemd state on ganeti2008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:13:20] PROBLEM - Check systemd state on ganeti2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:14:10] (03PS1) 10Volans: ganeti: fix netmon hostnames [puppet] - 10https://gerrit.wikimedia.org/r/492202 [22:14:16] chaomodus: ^^^ [22:14:58] (03CR) 10Volans: [C: 03+2] ganeti: fix netmon hostnames [puppet] - 10https://gerrit.wikimedia.org/r/492202 (owner: 10Volans) [22:15:49] I'm forcing a puppet run where it failed ( https://wikitech.wikimedia.org/wiki/Cumin#Run_Puppet_only_if_last_run_failed ) [22:16:30] PROBLEM - Check systemd state on ganeti2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:16:54] ah ofc puppet didn't fail [22:17:02] PROBLEM - Check systemd state on ganeti2007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:17:05] !log forcing a puppet run on A:ganeti [22:17:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:44] RECOVERY - Check systemd state on ganeti2008 is OK: OK - running: The system is fully operational [22:17:44] RECOVERY - Check systemd state on ganeti2006 is OK: OK - running: The system is fully operational [22:17:54] PROBLEM - Check systemd state on ganeti1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:18:16] RECOVERY - Check systemd state on ganeti2007 is OK: OK - running: The system is fully operational [22:18:22] RECOVERY - Check systemd state on ganeti2003 is OK: OK - running: The system is fully operational [22:18:22] RECOVERY - Check systemd state on ganeti2001 is OK: OK - running: The system is fully operational [22:18:28] RECOVERY - Check systemd state on ganeti2002 is OK: OK - running: The system is fully operational [22:18:36] RECOVERY - Check systemd state on ganeti2005 is OK: OK - running: The system is fully operational [22:18:42] RECOVERY - Check systemd state on ganeti1001 is OK: OK - running: The system is fully operational [22:19:08] RECOVERY - Check systemd state on ganeti1008 is OK: OK - running: The system is fully operational [22:19:24] RECOVERY - Check systemd state on ganeti1007 is OK: OK - running: The system is fully operational [22:19:38] RECOVERY - Check systemd state on ganeti1003 is OK: OK - running: The system is fully operational [22:20:58] RECOVERY - Check systemd state on ganeti1002 is OK: OK - running: The system is fully operational [22:21:13] chaomodus: I've forced the puppet run, they seem to have recovered but please double check that everything else is working [22:24:50] oh heh they seem to be working yes [22:25:03] !log change pw for NazarSusP [22:25:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:07] you look away for a second :\ [22:26:13] (03PS1) 10CRusnov: Change ownership of users file to match required ownership. [puppet] - 10https://gerrit.wikimedia.org/r/492203 [22:27:39] (03CR) 10Volans: Change ownership of users file to match required ownership. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/492203 (owner: 10CRusnov) [22:31:32] (03PS2) 10CRusnov: ganeti: Change ownership of rapi users file to match required ownership [puppet] - 10https://gerrit.wikimedia.org/r/492203 (https://phabricator.wikimedia.org/T215229) [23:25:08] 10Operations, 10Data-Services, 10decommission, 10cloud-services-team (Kanban): Reclaim/Decommission labsdb1004.eqiad.wmnet and labsdb1005.eqiad.wmnet as soon as they are ready - https://phabricator.wikimedia.org/T216749 (10bd808) [23:44:48] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): Move various support services for Cloud VPS currently in prod into their own instances - https://phabricator.wikimedia.org/T207536 (10Bstorm)