[00:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Evening SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190117T0000). [00:00:04] No GERRIT patches in the queue for this window AFAICS. [00:00:34] But because of the terrible way CN deploys work, it got into the deploy directory for wmf.12 [00:00:41] so this is really just to clean that up [00:01:15] not that it matters too much, in fact, but the idea had been to put it on the train, and that's the way the train rolls [00:01:25] and SWATs are for smaller thingies [00:01:30] 10Operations, 10netops: Replace accepted-prefix-limit with prefix-limit - https://phabricator.wikimedia.org/T211730 (10ayounsi) a:05ayounsi→03faidon Over to Faidon for feedback. [00:05:18] Reedy: RoanKattouw: ok I'm just gonna +2 it and then try to set the deploy directory to what it should be. Could I bother one of you to just verify what I have to do before I do it? I could paste the commands here before pressing Enter... [00:05:21] thx in advance [00:05:36] sorry it's been a while since I deployed anything to prod myself.... [00:05:53] At least earlier... the CN submodule hadn't been updated on deploy1001 [00:06:10] So merge, and a git pull followed by a git status to check the w/c was now clean [00:06:37] (03PS2) 10Jforrester: WBMI: Disable showing 'depicts' statements on Commons for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484781 [00:06:39] (03PS2) 10Jforrester: [Beta Cluster] WBMI: Show 'depicts' statements on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484782 [00:12:03] (03PS1) 10Jforrester: Beta Features: Add the new PHP7 beta feature to the whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484799 [00:12:31] Reedy: ^^ In case I forget to deploy later. [00:12:56] (03CR) 10Reedy: [C: 03+1] "woo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484799 (owner: 10Jforrester) [00:16:40] (03PS11) 10Dzahn: services: add missing 'mediawiki/services' prefix to git cloning [puppet] - 10https://gerrit.wikimedia.org/r/484602 (https://phabricator.wikimedia.org/T201366) [00:17:29] Reedy: don't we usually fetch then rebase instead of pull on deployment-staging? [00:17:54] I'm pretty sure the repos are setup to rebase automagically on pull [00:18:07] But you can fetch and rebase if you want too, doesn't matter too much :) [00:18:35] (03PS1) 10Jforrester: dblists: Rename 'wikidatarepo' to 'wikibaserepo' part I – Create it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484801 (https://phabricator.wikimedia.org/T213504) [00:18:37] (03PS1) 10Jforrester: dblists: Rename 'wikidatarepo' to 'wikibaserepo' part II – Use it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484802 (https://phabricator.wikimedia.org/T213504) [00:18:39] (03PS1) 10Jforrester: dblists: Rename 'wikidatarepo' to 'wikibaserepo' part III – Stop using it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484803 (https://phabricator.wikimedia.org/T213504) [00:18:41] (03PS1) 10Jforrester: dblists: Rename 'wikidatarepo' to 'wikibaserepo' part IV – Delete it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484804 (https://phabricator.wikimedia.org/T213504) [00:20:27] RECOVERY - MariaDB Slave Lag: s5 on db2059 is OK: OK slave_sql_lag Replication lag: 54.54 seconds [00:22:09] RECOVERY - MariaDB Slave Lag: s5 on db2038 is OK: OK slave_sql_lag Replication lag: 59.15 seconds [00:22:40] Reedy: hmmm ok, just looking at https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Step_2:_get_the_code_on_the_deployment_host [00:25:01] RECOVERY - MariaDB Slave Lag: s8 on db2082 is OK: OK slave_sql_lag Replication lag: 31.67 seconds [00:25:09] RECOVERY - MariaDB Slave Lag: s8 on db2080 is OK: OK slave_sql_lag Replication lag: 13.26 seconds [00:25:13] RECOVERY - MariaDB Slave Lag: s8 on db2094 is OK: OK slave_sql_lag Replication lag: 3.18 seconds [00:25:19] RECOVERY - MariaDB Slave Lag: s8 on db2086 is OK: OK slave_sql_lag Replication lag: 0.14 seconds [00:25:35] RECOVERY - MariaDB Slave Lag: s8 on db2045 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [00:25:39] RECOVERY - MariaDB Slave Lag: s8 on db2083 is OK: OK slave_sql_lag Replication lag: 0.35 seconds [00:25:49] RECOVERY - MariaDB Slave Lag: s8 on db2081 is OK: OK slave_sql_lag Replication lag: 0.45 seconds [00:25:53] RECOVERY - MariaDB Slave Lag: s8 on db2085 is OK: OK slave_sql_lag Replication lag: 0.19 seconds [00:26:01] RECOVERY - MariaDB Slave Lag: s8 on db2079 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [00:26:02] (03CR) 10Dzahn: [C: 03+2] "doing this before there is any actual content of course.." [puppet] - 10https://gerrit.wikimedia.org/r/483604 (https://phabricator.wikimedia.org/T180641) (owner: 10Dzahn) [00:26:11] (03PS2) 10Dzahn: static-rt: add LDAP simple auth [puppet] - 10https://gerrit.wikimedia.org/r/483604 (https://phabricator.wikimedia.org/T180641) [00:26:18] Reedy: ok so did git fetch, remote log shows only the CN submodule "revert" patch [00:27:15] Reedy: so now, just 'git rebase origin/wmf/1.33.0-wmf.12', correct? [00:28:09] seems pretty straightforward, I'm pressing Enter! [00:29:00] Reedy: RoanKattouw: ok, so the "modified: extensions/CentralNotice (new commits)" that git status was showing before is gone! [00:29:21] I think my work here is done [00:29:23] thx much! [00:29:39] sweet [00:29:44] thanks for tidying up :) [00:29:45] (03PS3) 10Dzahn: static-rt: add LDAP simple auth, allow ops [puppet] - 10https://gerrit.wikimedia.org/r/483604 (https://phabricator.wikimedia.org/T180641) [00:29:52] Reedy: thx for the help doing so! [00:30:09] (03CR) 10Dzahn: [C: 03+2] static-rt: add LDAP simple auth, allow ops [puppet] - 10https://gerrit.wikimedia.org/r/483604 (https://phabricator.wikimedia.org/T180641) (owner: 10Dzahn) [00:30:10] Since nothing was pushed out anywhere, I guess I don't need to log anything to server logs? [00:33:03] (03PS1) 10Thcipriani: Use python2 as basepython [software] - 10https://gerrit.wikimedia.org/r/484806 [00:33:52] (03CR) 10Dzahn: "also see https://gerrit.wikimedia.org/r/482693 if that could get merged then this is getting much closer to being able to delete the modu" [puppet] - 10https://gerrit.wikimedia.org/r/391849 (https://phabricator.wikimedia.org/T162070) (owner: 10Jcrespo) [00:34:39] PROBLEM - MariaDB Slave Lag: s2 on db2049 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 312.00 seconds [00:34:59] Gonna be only near but mostly not at the keyboard for a bit, pls ping if there's any fallout from anything I may have done wrong ^ thx! [00:36:02] (03PS2) 10Thcipriani: Use python2 as basepython [software] - 10https://gerrit.wikimedia.org/r/484806 [00:39:19] (03PS1) 10Dzahn: webserver_static: add proxypass LDAP password for static RT [puppet] - 10https://gerrit.wikimedia.org/r/484808 (https://phabricator.wikimedia.org/T180641) [00:39:49] (03CR) 10jerkins-bot: [V: 04-1] webserver_static: add proxypass LDAP password for static RT [puppet] - 10https://gerrit.wikimedia.org/r/484808 (https://phabricator.wikimedia.org/T180641) (owner: 10Dzahn) [00:39:55] (03PS2) 10Dzahn: webserver_static: add proxypass LDAP password for static RT [puppet] - 10https://gerrit.wikimedia.org/r/484808 (https://phabricator.wikimedia.org/T180641) [00:41:29] PROBLEM - MariaDB Slave Lag: s2 on db2091 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.44 seconds [00:41:53] PROBLEM - MariaDB Slave Lag: s2 on db2035 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.87 seconds [00:42:07] PROBLEM - MariaDB Slave Lag: s2 on db2063 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 308.22 seconds [00:42:11] PROBLEM - MariaDB Slave Lag: s2 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.04 seconds [00:42:13] PROBLEM - MariaDB Slave Lag: s2 on db2041 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.24 seconds [00:42:29] PROBLEM - MariaDB Slave Lag: s2 on db2088 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 312.39 seconds [00:42:33] PROBLEM - MariaDB Slave Lag: s2 on db2056 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 313.11 seconds [00:43:00] (03CR) 10Dzahn: [C: 03+2] webserver_static: add proxypass LDAP password for static RT [puppet] - 10https://gerrit.wikimedia.org/r/484808 (https://phabricator.wikimedia.org/T180641) (owner: 10Dzahn) [00:50:03] (03PS2) 10Dzahn: testreduce: use component/node10 for node 10 on stretch [puppet] - 10https://gerrit.wikimedia.org/r/484579 (https://phabricator.wikimedia.org/T201366) [00:50:34] (03CR) 10jerkins-bot: [V: 04-1] testreduce: use component/node10 for node 10 on stretch [puppet] - 10https://gerrit.wikimedia.org/r/484579 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [00:50:47] (03PS3) 10Dzahn: testreduce: use component/node10 for node 10 on stretch [puppet] - 10https://gerrit.wikimedia.org/r/484579 (https://phabricator.wikimedia.org/T201366) [00:51:30] (03PS4) 10Dzahn: testreduce: use component/node10 for node 10 on stretch [puppet] - 10https://gerrit.wikimedia.org/r/484579 (https://phabricator.wikimedia.org/T201366) [00:52:19] (03CR) 10Dzahn: [C: 03+2] "thanks Alex! https://puppet-compiler.wmflabs.org/compiler1002/14364/scandium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/484579 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [01:00:04] twentyafterfour: My dear minions, it's time we take the moon! Just kidding. Time for Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190117T0100). [01:00:22] (03PS1) 10Dzahn: testreduce: no require_package for nodejs, avoid dependency cycle [puppet] - 10https://gerrit.wikimedia.org/r/484811 (https://phabricator.wikimedia.org/T201366) [01:06:01] (03CR) 10Dzahn: [C: 03+2] "follow-up https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/484811/" [puppet] - 10https://gerrit.wikimedia.org/r/484579 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [01:10:45] (03CR) 10Dzahn: [C: 04-2] "this also doesn't work, leads to duplicate declaration then https://puppet-compiler.wmflabs.org/compiler1002/14365/ruthenium.eqiad.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/484811 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [01:11:01] (03CR) 10Dzahn: [C: 04-2] "bangs head" [puppet] - 10https://gerrit.wikimedia.org/r/484811 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [01:15:03] (03PS1) 10Dzahn: testreduce: remove "Before" for apt:repo dependency cycle [puppet] - 10https://gerrit.wikimedia.org/r/484815 [01:16:06] (03CR) 10Dzahn: [C: 03+2] testreduce: remove "Before" for apt:repo dependency cycle [puppet] - 10https://gerrit.wikimedia.org/r/484815 (owner: 10Dzahn) [01:26:50] RECOVERY - MariaDB Slave Lag: s7 on db2040 is OK: OK slave_sql_lag Replication lag: 58.44 seconds [01:26:56] RECOVERY - MariaDB Slave Lag: s7 on db2061 is OK: OK slave_sql_lag Replication lag: 57.09 seconds [01:27:12] RECOVERY - MariaDB Slave Lag: s7 on db2095 is OK: OK slave_sql_lag Replication lag: 51.21 seconds [01:27:16] RECOVERY - MariaDB Slave Lag: s7 on db2086 is OK: OK slave_sql_lag Replication lag: 48.22 seconds [01:27:18] RECOVERY - MariaDB Slave Lag: s7 on db2047 is OK: OK slave_sql_lag Replication lag: 48.33 seconds [01:27:24] RECOVERY - MariaDB Slave Lag: s7 on db2077 is OK: OK slave_sql_lag Replication lag: 45.43 seconds [01:27:40] RECOVERY - MariaDB Slave Lag: s7 on db2054 is OK: OK slave_sql_lag Replication lag: 40.88 seconds [01:27:42] RECOVERY - MariaDB Slave Lag: s7 on db2087 is OK: OK slave_sql_lag Replication lag: 39.58 seconds [01:54:33] RECOVERY - MariaDB Slave Lag: s7 on db2068 is OK: OK slave_sql_lag Replication lag: 55.37 seconds [02:05:05] 10Operations, 10Analytics, 10Analytics-EventLogging, 10Performance-Team: New MongoDB version is not DFSG-compatible, dropped by Debian - https://phabricator.wikimedia.org/T213996 (10MaxSem) [02:06:11] 10Operations, 10MediaWiki-Debug-Logger, 10Performance-Team: Set up request profiling for PHP 7 - https://phabricator.wikimedia.org/T206152 (10MaxSem) Speaking of Mongo: {T213996}. [02:06:46] 10Operations, 10Analytics, 10Analytics-EventLogging, 10Performance-Team, 10Software-Licensing: New MongoDB version is not DFSG-compatible, dropped by Debian - https://phabricator.wikimedia.org/T213996 (10Peachey88) [02:15:48] 10Operations, 10Analytics, 10Analytics-EventLogging, 10Performance-Team, 10Software-Licensing: New MongoDB version is not DFSG-compatible, dropped by Debian - https://phabricator.wikimedia.org/T213996 (10MaxSem) [02:20:59] 10Operations, 10Analytics, 10Analytics-EventLogging, 10Performance-Team, 10Software-Licensing: New MongoDB version is not DFSG-compatible, dropped by Debian - https://phabricator.wikimedia.org/T213996 (10Legoktm) SSPL v2 is not substantially different, and IMO the perspective on the OSI's license-review... [02:25:23] PROBLEM - HHVM rendering on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:25:33] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is CRITICAL: 132 ge 130 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [02:26:29] RECOVERY - HHVM rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 200 OK - 77368 bytes in 0.201 second response time [02:48:39] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:51:05] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 32, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:59:01] RECOVERY - MariaDB Slave Lag: s7 on db2047 is OK: OK slave_sql_lag Replication lag: 49.13 seconds [02:59:15] RECOVERY - MariaDB Slave Lag: s7 on db2040 is OK: OK slave_sql_lag Replication lag: 5.20 seconds [02:59:25] RECOVERY - MariaDB Slave Lag: s7 on db2061 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [02:59:51] RECOVERY - MariaDB Slave Lag: s7 on db2095 is OK: OK slave_sql_lag Replication lag: 0.12 seconds [02:59:55] RECOVERY - MariaDB Slave Lag: s7 on db2086 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [03:00:01] RECOVERY - MariaDB Slave Lag: s7 on db2077 is OK: OK slave_sql_lag Replication lag: 0.24 seconds [03:00:07] RECOVERY - MariaDB Slave Lag: s7 on db2087 is OK: OK slave_sql_lag Replication lag: 0.19 seconds [03:00:09] RECOVERY - MariaDB Slave Lag: s7 on db2054 is OK: OK slave_sql_lag Replication lag: 0.48 seconds [03:01:05] RECOVERY - MariaDB Slave Lag: s7 on db2068 is OK: OK slave_sql_lag Replication lag: 0.11 seconds [03:16:15] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [03:17:27] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy [03:27:55] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:29:09] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 32, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:31:53] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [03:33:41] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [03:33:43] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [03:34:15] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy [03:34:39] PROBLEM - parsoid on wtp2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:34:53] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy [03:35:41] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [03:36:03] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy [03:36:49] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy [03:37:23] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [03:38:19] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [03:38:31] RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy [03:39:20] PROBLEM - LVS HTTP IPv4 on parsoid.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:39:20] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [03:39:27] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy [03:39:31] RECOVERY - parsoid on wtp2016 is OK: HTTP OK: HTTP/1.1 200 OK - 1051 bytes in 3.656 second response time [03:41:01] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [03:41:19] PROBLEM - parsoid on wtp2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:41:35] PROBLEM - parsoid on wtp2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:42:13] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy [03:42:21] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [03:42:25] RECOVERY - parsoid on wtp2005 is OK: HTTP OK: HTTP/1.1 200 OK - 1051 bytes in 0.083 second response time [03:42:39] PROBLEM - parsoid on wtp2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:42:41] RECOVERY - parsoid on wtp2018 is OK: HTTP OK: HTTP/1.1 200 OK - 1051 bytes in 0.110 second response time [03:42:45] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [03:42:51] PROBLEM - parsoid on wtp2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:43:04] RECOVERY - LVS HTTP IPv4 on parsoid.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1051 bytes in 9.870 second response time [03:43:05] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [03:43:07] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [03:43:29] PROBLEM - parsoid on wtp2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:43:43] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [03:43:55] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [03:43:55] PROBLEM - parsoid on wtp2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:44:16] codfw issues generally? [03:44:21] PROBLEM - restbase endpoints health on restbase2008 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [03:44:23] PROBLEM - parsoid on wtp2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:44:31] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [03:44:37] PROBLEM - parsoid on wtp2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:44:49] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [03:44:53] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [03:44:59] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [03:45:33] PROBLEM - parsoid on wtp2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:45:37] PROBLEM - parsoid on wtp2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:45:39] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [03:45:43] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy [03:45:59] PROBLEM - parsoid on wtp2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:46:05] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [03:46:21] PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - CRITICAL - parsoid_8000: Servers wtp2004.codfw.wmnet, wtp2007.codfw.wmnet, wtp2012.codfw.wmnet, wtp2010.codfw.wmnet, wtp2013.codfw.wmnet, wtp2008.codfw.wmnet, wtp2019.codfw.wmnet, wtp2001.codfw.wmnet, wtp2020.codfw.wmnet, wtp2018.codfw.wmnet, wtp2006.codfw.wmnet, wtp2003.codfw.wmnet, wtp2009.codfw.wmnet, wtp2016.codfw.wmnet, wtp2017.codfw.wmnet, wtp201 [03:46:21] p2014.codfw.wmnet are marked down but pooled [03:46:23] PROBLEM - parsoid on wtp2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:46:39] PROBLEM - parsoid on wtp2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:47:03] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy [03:47:03] RECOVERY - restbase endpoints health on restbase2008 is OK: All endpoints are healthy [03:47:04] PROBLEM - LVS HTTP IPv4 on parsoid.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:47:04] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy [03:47:04] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [03:47:04] PROBLEM - parsoid on wtp2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:47:04] PROBLEM - parsoid on wtp2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:47:16] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - CRITICAL - parsoid_8000: Servers wtp2018.codfw.wmnet, wtp2006.codfw.wmnet, wtp2002.codfw.wmnet, wtp2007.codfw.wmnet, wtp2019.codfw.wmnet, wtp2016.codfw.wmnet, wtp2009.codfw.wmnet, wtp2008.codfw.wmnet, wtp2012.codfw.wmnet, wtp2004.codfw.wmnet, wtp2013.codfw.wmnet, wtp2017.codfw.wmnet, wtp2010.codfw.wmnet, wtp2020.codfw.wmnet, wtp2001.codfw.wmnet, wtp201 [03:47:16] p2003.codfw.wmnet, wtp2005.codfw.wmnet, wtp2011.codfw.wmnet, wtp2014.codfw.wmnet are marked down but pooled [03:47:54] PROBLEM - PyBal IPVS diff check on lvs2006 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([wtp2006.codfw.wmnet, wtp2004.codfw.wmnet, wtp2007.codfw.wmnet, wtp2012.codfw.wmnet, wtp2013.codfw.wmnet, wtp2017.codfw.wmnet, wtp2010.codfw.wmnet, wtp2001.codfw.wmnet, wtp2020.codfw.wmnet, wtp2018.codfw.wmnet, wtp2009.codfw.wmnet, wtp2019.codfw.wmnet, wtp2003.codfw.wmnet, wtp2016.codfw.wmnet, wtp2008.codfw.wmnet, wtp2011.cod [03:47:54] .codfw.wmnet]) [03:48:04] PROBLEM - parsoid on wtp2006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:48:16] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy [03:48:18] RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy [03:48:24] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy [03:48:32] Pchelolo: your doing? [03:48:52] Krinkle: here [03:48:59] not doing anything [03:49:16] I assume you got the page? (/me didn't, just heard from James_F ) [03:49:28] see _security [03:49:42] Krinkle: gimme a second to read through it [03:50:06] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [03:50:06] mobrovac: !!! [03:50:14] PROBLEM - parsoid on wtp2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:50:20] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy [03:50:38] PROBLEM - PyBal IPVS diff check on lvs2003 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([wtp2006.codfw.wmnet, wtp2004.codfw.wmnet, wtp2007.codfw.wmnet, wtp2012.codfw.wmnet, wtp2013.codfw.wmnet, wtp2017.codfw.wmnet, wtp2002.codfw.wmnet, wtp2010.codfw.wmnet, wtp2020.codfw.wmnet, wtp2001.codfw.wmnet, wtp2015.codfw.wmnet, wtp2018.codfw.wmnet, wtp2009.codfw.wmnet, wtp2019.codfw.wmnet, wtp2003.codfw.wmnet, wtp2016.cod [03:50:38] .codfw.wmnet, wtp2005.codfw.wmnet, wtp2011.codfw.wmnet, wtp2014.codfw.wmnet]) [03:50:54] PROBLEM - restbase endpoints health on restbase2007 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [03:51:14] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [03:51:20] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [03:51:26] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [03:51:58] PROBLEM - parsoid on wtp2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:52:20] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy [03:52:46] PROBLEM - parsoid on wtp2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:53:16] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy [03:53:30] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [03:53:30] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy [03:53:48] PROBLEM - parsoid on wtp2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:53:50] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy [03:54:20] PROBLEM - restbase endpoints health on restbase2008 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [03:54:26] RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy [03:55:24] Krinkle: I know how to mitigate the issue for tonight. wait [03:55:42] 10Operations, 10Analytics, 10Analytics-EventLogging, 10Performance-Team, 10Software-Licensing: New MongoDB version is not DFSG-compatible, dropped by Debian - https://phabricator.wikimedia.org/T213996 (10Krinkle) XHGui is scheduled to be migrated from tungsten (Jessie; Mongo 2.4.10) to webperf1002 (Stret... [03:56:13] not a great way though... [03:56:27] Restart the Parsoid codfw cluster? [03:56:30] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [03:56:36] (The app, not the whole box.) [03:56:48] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [03:57:28] RECOVERY - restbase endpoints health on restbase2008 is OK: All endpoints are healthy [03:57:35] let's try restartinf Parsoid [03:57:37] Pchelolo: how do you feel about depooling all of codfw for restbase and codfw? [03:57:40] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy [03:57:44] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [03:57:57] and, do you understand the nature of the outage? [03:57:57] cdanis: that would be not good [03:58:00] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy [03:58:02] okay [03:58:06] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy [03:58:12] RECOVERY - restbase endpoints health on restbase2007 is OK: All endpoints are healthy [03:58:29] Seeing this in the logs repeatedly: wt2html: Exceeded max resource use: tableCell. Aborting! [03:58:46] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [03:58:51] That's definitely Parsoid [03:58:58] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy [03:58:59] let's try restarting it [03:59:19] I have no rights to restart Parsoid [03:59:30] Per https://www.mediawiki.org/wiki/Parsoid/Deployments there's not been a code update since yesterday. [03:59:38] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy [03:59:48] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy [04:00:02] PROBLEM - parsoid on wtp2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:00:06] I can attempt doing so. The parsoid docs have a restart procedure [04:00:28] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [04:00:46] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [04:00:50] `cd /srv/deployment/parsoid/deploy && scap deploy --service-restart` from deployment.eqiad.wmnet [04:01:09] scap deploy --service-restart from [04:01:32] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [04:01:37] Is there a way to limit the restart to just codfw? [04:01:48] Is it possible just to restart via systemctl? [04:01:53] Or does it require a scap deploy? [04:01:54] -l *.codfw.wmnet [04:02:02] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [04:02:28] !log cdanis@deploy1001 Started restart [parsoid/deploy@4b82683]: (no justification provided) [04:02:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:02:48] PROBLEM - restbase endpoints health on restbase2007 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [04:02:53] wtp2001,wtp2002 restarted [04:03:10] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [04:03:16] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [04:03:52] RECOVERY - restbase endpoints health on restbase2007 is OK: All endpoints are healthy [04:04:01] cdanis: does it look like that restart procedure will get all the wtp2* nodes? [04:04:14] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy [04:04:18] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy [04:04:18] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [04:04:22] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy [04:04:22] I believe so. scap has done a canary restart on two nodes, and I'd love to see them go green in icinga [04:04:40] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [04:04:44] of course if most of the cluster is gone, those two may not be able to keep up with load, either [04:05:06] yeah, I just decided to continue anyway for that reason bblack [04:05:30] RECOVERY - parsoid on wtp2012 is OK: HTTP OK: HTTP/1.1 200 OK - 1051 bytes in 6.772 second response time [04:05:32] RECOVERY - parsoid on wtp2004 is OK: HTTP OK: HTTP/1.1 200 OK - 1051 bytes in 9.705 second response time [04:05:34] Pchelolo: do you have an idea of how long Parsoid usually takes to start up / warm up and be ready to serve? [04:05:39] that looks promising [04:05:40] RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy [04:05:40] RECOVERY - parsoid on wtp2013 is OK: HTTP OK: HTTP/1.1 200 OK - 1051 bytes in 1.362 second response time [04:05:58] lots of "startup finished" in the logs now [04:06:02] RECOVERY - parsoid on wtp2017 is OK: HTTP OK: HTTP/1.1 200 OK - 1051 bytes in 2.629 second response time [04:06:09] scap run completed [04:06:34] RECOVERY - parsoid on wtp2014 is OK: HTTP OK: HTTP/1.1 200 OK - 1051 bytes in 0.079 second response time [04:06:42] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy [04:06:42] PROBLEM - restbase endpoints health on restbase2008 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [04:06:52] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [04:07:08] RECOVERY - parsoid on wtp2005 is OK: HTTP OK: HTTP/1.1 200 OK - 1051 bytes in 0.080 second response time [04:07:32] PROBLEM - restbase endpoints health on restbase2007 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [04:07:52] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [04:07:54] RECOVERY - parsoid on wtp2019 is OK: HTTP OK: HTTP/1.1 200 OK - 1051 bytes in 8.172 second response time [04:07:56] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [04:08:04] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [04:08:04] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [04:08:09] hm [04:08:26] RECOVERY - parsoid on wtp2011 is OK: HTTP OK: HTTP/1.1 200 OK - 1051 bytes in 6.895 second response time [04:08:30] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy [04:08:31] Restbase depends on Parsoid so there'll be a lag on recovery. [04:08:32] PROBLEM - parsoid on wtp2008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:08:41] (And text-lb depends on Restbase.) [04:09:03] leme try some very random idea [04:09:12] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy [04:09:12] PROBLEM - parsoid on wtp2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:09:12] PROBLEM - parsoid on wtp2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:09:28] PROBLEM - parsoid on wtp2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:09:36] RECOVERY - parsoid on wtp2009 is OK: HTTP OK: HTTP/1.1 200 OK - 1051 bytes in 0.086 second response time [04:09:36] well, that particular text-lb alert depends on restbase. normal mediawiki requests don't necessarily [04:09:40] RECOVERY - parsoid on wtp2016 is OK: HTTP OK: HTTP/1.1 200 OK - 1051 bytes in 4.301 second response time [04:09:44] Sorry, yes. [04:09:48] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [04:09:56] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy [04:10:00] RECOVERY - restbase endpoints health on restbase2007 is OK: All endpoints are healthy [04:10:02] RECOVERY - parsoid on wtp2015 is OK: HTTP OK: HTTP/1.1 200 OK - 1051 bytes in 9.239 second response time [04:10:06] Pchelolo: what's the issue with just depooling all of codfw restbase? [04:10:08] RECOVERY - parsoid on wtp2003 is OK: HTTP OK: HTTP/1.1 200 OK - 1051 bytes in 8.117 second response time [04:10:11] (at e.g. the varnish level) [04:10:18] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy [04:10:20] RECOVERY - parsoid on wtp2020 is OK: HTTP OK: HTTP/1.1 200 OK - 1051 bytes in 9.768 second response time [04:10:24] !log ppchelko@deploy1001 Started deploy [mobileapps/deploy@89c4d8d]: revert new summary [04:10:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:10:30] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy [04:10:42] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy [04:11:30] RECOVERY - restbase endpoints health on restbase2008 is OK: All endpoints are healthy [04:11:48] Pchelolo: Oh. Hmm, yes, possibly though https://gerrit.wikimedia.org/r/c/mediawiki/services/mobileapps/deploy/+/484766 doesn't ping with anything obvious. [04:12:10] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy [04:12:20] !log ppchelko@deploy1001 Finished deploy [mobileapps/deploy@89c4d8d]: revert new summary (duration: 01m 55s) [04:12:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:12:47] RECOVERY - LVS HTTP IPv4 on parsoid.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1051 bytes in 5.829 second response time [04:12:52] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy [04:12:56] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy [04:13:00] RECOVERY - parsoid on wtp2013 is OK: HTTP OK: HTTP/1.1 200 OK - 1051 bytes in 0.319 second response time [04:13:09] "xxxxxxx Update node module dependencies" is kind of a black box for breakage though :) [04:13:20] RECOVERY - parsoid on wtp2008 is OK: HTTP OK: HTTP/1.1 200 OK - 1051 bytes in 0.101 second response time [04:13:26] RECOVERY - parsoid on wtp2002 is OK: HTTP OK: HTTP/1.1 200 OK - 1051 bytes in 0.080 second response time [04:13:38] RECOVERY - parsoid on wtp2001 is OK: HTTP OK: HTTP/1.1 200 OK - 1051 bytes in 9.630 second response time [04:13:39] That's just node build updates, shouldn't have any meaningful prod code changes. [04:13:50] RECOVERY - parsoid on wtp2018 is OK: HTTP OK: HTTP/1.1 200 OK - 1051 bytes in 6.990 second response time [04:13:58] RECOVERY - parsoid on wtp2012 is OK: HTTP OK: HTTP/1.1 200 OK - 1051 bytes in 0.086 second response time [04:14:04] RECOVERY - parsoid on wtp2004 is OK: HTTP OK: HTTP/1.1 200 OK - 1051 bytes in 5.480 second response time [04:14:14] (But yes.) [04:14:17] I guess that depends on your definition of things, but it definitely updates a bunch of code that prod runs :) [04:14:20] Well, things seem to be coming back up. [04:14:32] RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy [04:14:34] RECOVERY - parsoid on wtp2007 is OK: HTTP OK: HTTP/1.1 200 OK - 1051 bytes in 0.082 second response time [04:15:09] That one looks like it did the trick. [04:15:10] RECOVERY - parsoid on wtp2010 is OK: HTTP OK: HTTP/1.1 200 OK - 1051 bytes in 0.078 second response time [04:15:10] RECOVERY - parsoid on wtp2006 is OK: HTTP OK: HTTP/1.1 200 OK - 1051 bytes in 5.911 second response time [04:15:21] Yeah. [04:15:24] well it could've been just the wtp* restart too, and a little lag on recoveries [04:15:27] it's hard to say [04:15:29] ok gentlemen. I think I know what happened [04:15:44] (based just on IRC output / icinga) [04:16:08] I will confirm my gut feeling ideas on an incident report tomorrow [04:16:11] bblack: nah, there were continued problem reports after the restart, even on hosts that had reported healthy at the time of being restarted [04:16:21] Pchelolo: what's going on? [04:16:27] RECOVERY - PyBal IPVS diff check on lvs2003 is OK: OK: no difference between hosts in IPVS/PyBal [04:17:01] ok [04:17:25] sorry bblack I was trying to get all the logs together and get to understanding what happened, so I couldn't write up right away why here depooling codfw would not have helped. [04:17:44] OK, everything in https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=all&servicestatustypes=16&hoststatustypes=3&serviceprops=2097162 was there hours ago, except for lvs2006 [04:18:00] the lvs2006 one will recover too, it just takes time to reach the next natural execution of the check [04:18:19] * James_F nods. [04:18:31] RECOVERY - PyBal IPVS diff check on lvs2006 is OK: OK: no difference between hosts in IPVS/PyBal [04:18:36] And there we go. [04:19:17] (Woah, loads of s2 DB replag.) [04:19:43] yeah that's been going on all day even back when more SRE were awake, I kind of assume someone knows and doesn't care [04:20:05] DB replag is not something I would expect if my theory of the outage is correct [04:20:05] Or "loads" is better than it was before. [04:20:15] Pchelolo: Yeah, I don't think they're connected. [04:20:30] bblack: Should we ACK the s2 lag though? [04:20:41] not unless you understand it and know why it's ok [04:20:50] (I don't!) [04:20:53] https://etherpad.wikimedia.org/p/parsoid-2019-01-16 [04:20:53] Yeah, "we" means "not me". ;-) [04:20:56] btw [04:21:04] i've started a timeline [04:21:15] So TLDR of my theory of the outage: there's a bug in RB which I claim responcibility for. [04:21:59] cdanis: Thanks. [04:22:26] and the mobileapps change aggravated the bug? [04:23:14] then when MCS has deployed a new version of summary it caused a bug to backfire and start rerendering all the summaries - which depend on Parsoid HTML - and somehow made RB rerender Parsoid. As soon as the critical mass of summaries were out of Varnish - it got PArsoid overwhelmed [04:23:27] also, why only codfw affected if parsoid, restbase, and mobileapps are all active/active and getting hit by the same stuff on all sides? Maybe codfw has higher base load because it's still the one running some excess async jobs, so is more sensitive? [04:23:43] Is that because we bumped the summary version today? [04:24:02] bblack: We're running in both eqiad and codfw. There's no duplicate capacity. :-( [04:24:45] that seems like a non-sequitur [04:24:47] is parsoid as a service not N+1? i.e. both codfw and eqiad are needed to be active to support peak load? [04:25:14] cdanis: Yup. :-( [04:25:15] the whole point of having any service be active/active across the two core DCs is not to share load. either DC should be capable of handling the globe. [04:25:22] cdanis: no, we have tried a single DC during this swithover [04:25:37] RECOVERY - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is OK: (C)130 ge (W)110 ge 104.3 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [04:25:57] just when it's overloaded in both DCs getting rid on half the capacity is not gonna help [04:26:32] but in this case, we were only seeing alerts in one DC. I'm still guessing that's because there's excess baseline load in codfw from async stuff. [04:26:48] but in general from a naive pov, if we see an alert on only one side, it would make sense to depool that side as faulty. [04:27:43] so you do think it was an overload condition, Pchelolo? do you have an idea of what underlying resource was overloaded? I don't believe it was CPU or network: https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&var-datasource=codfw%20prometheus%2Fops&var-cluster=parsoid&var-instance=All&from=now-3h&to=now [04:27:44] one DC does the MCS requests triggered by ChangeProp, the other one handles the live user requests [04:28:09] both DCs handle live user requests, according to current config, at the restbase<->user level [04:28:52] oh, then my understanding of how things work is outdated [04:29:26] cdanis, Pchelolo: Have moved details to https://wikitech.wikimedia.org/wiki/Incident_documentation/20190117-Parsoid [04:29:41] thank you James_F [04:29:45] thanks! [04:29:55] I don't know enough to fill out more though. [04:30:10] yeah, at this point I have only questions :) [04:30:16] me too! [04:30:25] My main question is "what can I do to help?". [04:30:26] if we're stable, they can wait for another time [04:30:47] Yeah, but presumably the MCS team (hey bearND) need to know that they can't deploy for a bit. [04:31:13] but I'm not comfortable about the "not really A/A" thing, and it seems strange to me that's been the known state since the last switchover test and we've not done something to alleviate it or reconfigure or document it very loudly for incident response [04:31:16] But given that most of the team are either asleep now or will be soon, if we know it's not going to fall over right now that can wait for the morning. [04:31:20] https://grafana.wikimedia.org/d/000000183/mobileapps?orgId=1 has a bit of correlation [04:31:44] I honestly don't really know what exactly has hapened [04:32:37] PROBLEM - MariaDB Slave Lag: s7 on db2068 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.41 seconds [04:33:25] PROBLEM - MariaDB Slave Lag: s7 on db2086 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.10 seconds [04:33:29] PROBLEM - MariaDB Slave Lag: s7 on db2047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.36 seconds [04:33:34] clearly seen here too https://grafana.wikimedia.org/d/000000068/restbase?orgId=1&from=now-1h&to=now [04:33:55] PROBLEM - MariaDB Slave Lag: s7 on db2077 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.58 seconds [04:33:59] PROBLEM - MariaDB Slave Lag: s7 on db2040 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.32 seconds [04:34:05] PROBLEM - MariaDB Slave Lag: s7 on db2054 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 312.00 seconds [04:34:17] PROBLEM - MariaDB Slave Lag: s7 on db2061 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 315.53 seconds [04:35:21] PROBLEM - MariaDB Slave Lag: s7 on db2087 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 337.97 seconds [04:35:33] Pchelolo: The 404 responses look like they're massively driving things? [04:37:37] those might be coincidental. [04:37:57] is 'v1_page_summary_-title-' what MCS calls the requests it is making to Parsoid for page HTML? [04:38:32] that was my theory. [04:38:46] cdanis: No, that's the request to MCS. [04:38:57] or RB [04:39:18] ah okay, but that request path is ... MCS->Restbase->Parsoid? [04:39:21] MCS calls RM with calls Parsoid if the HTML is not in storage [04:39:29] right [04:41:46] the rot cause of this is not clear to me at all. [04:41:58] weirdly, according to lgostash, there is a spike in Parsoid-logged errors after the outage is over [04:42:19] Doesn't RB trigger MCS to pre-cache all the summaries? Or am I mis-remembering? [04:42:36] James_F: that's the case yes. [04:42:49] CP->RB->MCS [04:43:02] Right. Does MCS updating the summary end-point cause RB to re-trigger for all possible pages? [04:43:08] lots of "Parsoid id found on element without a matching data-parsoid entry" [04:43:29] cdanis: that's a Parsoid bug I think [04:44:05] ChangeProp requests to RB>MCS had a big spike: https://grafana.wikimedia.org/d/000000183/mobileapps?orgId=1&panelId=14&fullscreen [04:44:20] latencies [04:44:43] ye cause they ask for HTML from RB [04:45:11] PROBLEM - MariaDB Slave Lag: s7 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 496.29 seconds [04:47:40] ok. I'll try to investigate properly tomorrow [04:47:50] for now things seem to be stable [04:50:05] bearND: just don't deploy MCS tomorrow morning please :) [04:50:31] Pchelolo: ok, keep us in the loop (cc: mdholloway|afk ) [04:50:33] here's something interesting: restbase was logging more and more and more events since the mobileapps deploy at 21:10 [04:51:02] https://logstash.wikimedia.org/goto/09a23b8af73aa5d9c57fbe480d901216 [04:51:51] cdanis: if my theory of several bugs working together is correct - that' expectable [04:52:26] FWIW, the deploy today included a few things for summary: (1) a fix for a transformation issue, (2) a new field was added for the wikibase_item, (3) -> version bump. [04:52:50] bearND: my theory is that version bump was the cause [04:53:13] yeah, the other two things seem benign to me [04:53:14] and the steady increase in logs - pages goes out ot varnish cache [04:53:50] was restbase not storing the newly-versioned HTML, or something? [04:53:59] Massive upramp in "upgrade_failed" and then finally error/requests when it's overloaded? [04:58:08] * bearND suspects it has to do with https://github.com/wikimedia/restbase/blob/master/v1/summary_new.yaml#L20 [04:59:06] and https://github.com/wikimedia/restbase/blob/master/lib/ensure_content_type.js#L21-L22 [05:00:22] https://github.com/wikimedia/restbase/blob/master/lib/ensure_content_type.js#L65 is the log message we have thousands/minute of :) [05:00:43] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 50 probes of 373 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [05:01:49] but it's weird that it would cause an issue now. The expected version had not been updated in a while. It's still on 1.3.7 (https://github.com/wikimedia/restbase/blob/master/lib/ensure_content_type.js#L21-L22). [05:03:16] the logs show that 1.3.7 was 'expected' but 'actual' was 1.3.10 [05:04:42] okay, I'm stepping away from keys and going to sleep [05:05:18] Ok, I have a theory. It's because https://github.com/wikimedia/restbase/blob/master/lib/ensure_content_type.js#L54 the versions compared are strings, not integers, and [05:05:22] https://www.irccloud.com/pastebin/4mbsRrFB/ [05:05:43] in node [05:05:57] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 24 probes of 373 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [05:06:08] bearND: Isn't JS wonderful? :-) [05:06:18] OK, I'm clocking off too. [05:06:28] good night [05:26:37] RECOVERY - Memory correctable errors -EDAC- on kafka1023 is OK: (C)4 ge (W)2 ge 1 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=kafka1023&var-datasource=eqiad+prometheus/ops [05:44:18] (03PS1) 10BryanDavis: Code formatting and cleanup [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/484848 [05:44:20] (03PS1) 10BryanDavis: Preserve restart attempt timestamps between runs [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/484849 (https://phabricator.wikimedia.org/T107878) [05:50:28] 10Operations, 10DBA, 10Jade, 10TechCom-RFC, and 3 others: Introduce a new namespace for collaborative judgements about wiki entities - https://phabricator.wikimedia.org/T200297 (10Krinkle) [05:52:37] 10Operations, 10DBA, 10Jade, 10TechCom-RFC, and 3 others: Introduce a new namespace for collaborative judgements about wiki entities - https://phabricator.wikimedia.org/T200297 (10Krinkle) ## 16 Jan 2016 - Draft phabricator reply regarding Jade I'm writing to summarise the meeting with the Scoring team ab... [06:10:51] !log Downtime s3 hosts for 2 hours - T213858 [06:10:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:54] T213858: s3 master emergency failover (db1075) - https://phabricator.wikimedia.org/T213858 [06:14:07] !log Disable gtid on s3 hosts - T213858 [06:14:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:58] !log Change s3 topology to get ready for s3 failover - T213858 [06:20:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:01] T213858: s3 master emergency failover (db1075) - https://phabricator.wikimedia.org/T213858 [06:26:52] !log Enable GTID back on all hosts but db1075 db1078 - T213858 [06:26:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:55] T213858: s3 master emergency failover (db1075) - https://phabricator.wikimedia.org/T213858 [06:28:43] PROBLEM - netbox HTTPS on netmon1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 547 bytes in 0.009 second response time [06:29:51] PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:30:27] (03PS3) 10Marostegui: mariadb: Promote db1078 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/484612 (https://phabricator.wikimedia.org/T213858) [06:30:45] !log Disable puppet on db1075 and db1078 - T213858 [06:30:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:43] RECOVERY - MariaDB Slave Lag: s7 on db2077 is OK: OK slave_sql_lag Replication lag: 20.63 seconds [06:32:51] RECOVERY - MariaDB Slave Lag: s7 on db2040 is OK: OK slave_sql_lag Replication lag: 1.14 seconds [06:32:57] RECOVERY - MariaDB Slave Lag: s7 on db2054 is OK: OK slave_sql_lag Replication lag: 0.23 seconds [06:32:59] RECOVERY - MariaDB Slave Lag: s7 on db2087 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [06:33:03] RECOVERY - MariaDB Slave Lag: s7 on db2095 is OK: OK slave_sql_lag Replication lag: 0.32 seconds [06:33:09] RECOVERY - MariaDB Slave Lag: s7 on db2061 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [06:33:27] RECOVERY - MariaDB Slave Lag: s7 on db2086 is OK: OK slave_sql_lag Replication lag: 0.36 seconds [06:33:33] RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational [06:33:33] RECOVERY - MariaDB Slave Lag: s7 on db2047 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [06:33:37] RECOVERY - netbox HTTPS on netmon1002 is OK: HTTP OK: HTTP/1.1 302 Found - 348 bytes in 0.552 second response time [06:35:32] (03CR) 10Marostegui: mariadb: Promote db1078 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/484612 (https://phabricator.wikimedia.org/T213858) (owner: 10Marostegui) [06:37:08] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1078 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/484612 (https://phabricator.wikimedia.org/T213858) (owner: 10Marostegui) [06:37:57] (03PS4) 10Marostegui: db-eqiad.php: Set s3 to read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484613 (https://phabricator.wikimedia.org/T213858) [06:40:03] RECOVERY - MariaDB Slave Lag: s7 on db2068 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [06:40:57] 10Operations, 10netops: Netbox Dies Mysteriously Sometimes - https://phabricator.wikimedia.org/T214008 (10crusnov) [06:43:35] 10Operations, 10MediaWiki-General-or-Unknown, 10media-storage: Lost file Juan_Guaidó.jpg - https://phabricator.wikimedia.org/T213655 (10jcrespo) Thanks, yes, as Filippo said avobe, it has been deleted (and it is available) on swift, but not on metadata. We can do 2 things- reupload it, or perform a deletion... [06:45:31] jynus: I 5 minutes I will merge (not sync) https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/484613/ so I can create the reverts and rebase the other change so we get ahead of CI [06:46:55] shouldn't we merge read-write and switch at the same time? [06:47:10] not necessarily on the same commit [06:47:11] we need to merge db1078 to master first [06:47:21] ? [06:47:30] why "first" [06:47:35] https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/484614/ [06:47:36] why not at the same time [06:47:40] yes, [06:47:52] that and the revert, then deploy at the same time [06:48:02] ah [06:48:09] you mean the read-only+db1078 [06:48:14] no [06:48:16] well [06:48:20] it doesn't matter [06:48:31] don't know what you mean :) [06:48:43] more like, switch to rw + promote new master at the same time [06:48:46] ah [06:48:59] to minimize read only time [06:49:03] I think we discussed that a few failovers back and we concluded it was safer to do it on separated ones [06:49:10] ok [06:49:14] but we can [06:49:43] (03CR) 10Marostegui: db-eqiad.php: Set s3 to read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484613 (https://phabricator.wikimedia.org/T213858) (owner: 10Marostegui) [06:50:40] Going to merge that one then [06:51:06] We are taking over mediawiki-config deployment for the s3 failover, please talk to us before deploying anything [06:52:26] how do we divide tasks? [06:53:47] I do the failover+deployment and you do the verification and checks? [06:54:04] ok [06:54:07] let me prepare [06:54:17] I will paste the output of the switchover [06:54:27] And once you confirm it is good, we can remove the read only [06:54:36] line 41 ? [06:54:44] yeah [06:54:54] once that is confirmed I will do line 62 and 64 [06:54:59] going to merge but not deploy [06:55:04] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Set s3 to read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484613 (https://phabricator.wikimedia.org/T213858) (owner: 10Marostegui) [06:56:11] (03Merged) 10jenkins-bot: db-eqiad.php: Set s3 to read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484613 (https://phabricator.wikimedia.org/T213858) (owner: 10Marostegui) [06:56:18] (03PS1) 10Marostegui: Revert "db-eqiad.php: Set s3 to read only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484852 [06:56:20] (03CR) 10Marostegui: db-eqiad.php: Promote db1078 to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484614 (https://phabricator.wikimedia.org/T213858) (owner: 10Marostegui) [06:56:24] (03PS3) 10Marostegui: db-eqiad.php: Promote db1078 to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484614 (https://phabricator.wikimedia.org/T213858) [06:56:44] so rebase the first one, merge the others already? [06:56:51] or at least one [06:56:53] ? [06:57:01] PROBLEM - HHVM jobrunner on mw1336 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [06:57:14] I am going to: merge read only, commit promote db1078 (but not merge) + rebase the read only off [06:57:26] cool [06:57:34] read only on is merged [06:57:39] I will log in to deploment just in case [06:57:44] yep [06:57:48] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Promote db1078 to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484614 (https://phabricator.wikimedia.org/T213858) (owner: 10Marostegui) [06:58:08] but not touch anything unless you misteriously disappear [06:58:15] RECOVERY - HHVM jobrunner on mw1336 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.008 second response time [06:58:17] yeah [06:58:32] I think we are ready [06:58:42] no errors on mw dbs [06:58:52] (03Merged) 10jenkins-bot: db-eqiad.php: Promote db1078 to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484614 (https://phabricator.wikimedia.org/T213858) (owner: 10Marostegui) [06:59:06] (03PS2) 10Marostegui: Revert "db-eqiad.php: Set s3 to read only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484852 [06:59:21] RECOVERY - MariaDB Slave Lag: s2 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 254.69 seconds [06:59:37] what is an s3 wiki? [07:00:01] jynus: zuwiki [07:00:04] marostegui and jynus: It is that lovely time of the day again! You are hereby commanded to deploy Important database maintenance. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190117T0700). [07:00:05] jynus: ready to start? [07:00:08] enwikinews [07:00:10] yeah [07:00:14] !log Start s3 failover T213858 [07:00:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:17] T213858: s3 master emergency failover (db1075) - https://phabricator.wikimedia.org/T213858 [07:00:19] Going read only then [07:00:56] can edit for now [07:01:01] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Set s3 on read-only T213858 (duration: 00m 31s) [07:01:02] read only set [07:01:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:12] Warning: The database has been locked for maintenance [07:01:16] so we are ok [07:01:17] ok, starting failover [07:01:54] done: https://phabricator.wikimedia.org/P8001 [07:02:03] some exceptions, but that is life [07:02:08] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Set s3 to read only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484852 (owner: 10Marostegui) [07:02:12] ^ not mergning, just committing [07:02:20] ready for db1078 to be master on db-eqiad? [07:02:22] if it says successful, it is [07:02:25] (03CR) 10jenkins-bot: db-eqiad.php: Set s3 to read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484613 (https://phabricator.wikimedia.org/T213858) (owner: 10Marostegui) [07:02:26] go on [07:02:27] (03CR) 10jenkins-bot: db-eqiad.php: Promote db1078 to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484614 (https://phabricator.wikimedia.org/T213858) (owner: 10Marostegui) [07:02:42] ok, deploying [07:03:05] jobs are expected to have temporary issues and be retried [07:03:09] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Switchover s3master eqiad from db1075 to db1078 T213858 (duration: 00m 30s) [07:03:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:12] removing read only then? [07:03:13] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Set s3 to read only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484852 (owner: 10Marostegui) [07:03:19] lets double check [07:03:26] db1075 looks like a slave to me [07:03:39] indeed [07:03:41] go on [07:03:43] db1078 read only off [07:03:44] going on [07:04:20] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove s3 ready only T213858 (duration: 00m 30s) [07:04:20] we are writabble [07:04:22] checking [07:04:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:43] yeah, it works [07:04:49] checking rcs [07:05:12] not a lot of activity at this time on enwikinews [07:05:20] I can edit zuwiki [07:05:35] yeah [07:05:42] let me check another wiki [07:05:45] what is the most active wiki at this time? [07:06:11] testing eswikivoyage [07:06:13] write activity went back up [07:06:30] maybe ja or zh [07:06:44] eswikivoyage works fine [07:06:48] 10Operations, 10netops: Netbox Dies Mysteriously Sometimes - https://phabricator.wikimedia.org/T214008 (10Peachey88) [07:06:57] yeah, I just want to see edits from other people [07:07:02] yeah [07:07:06] I am checking zu rc [07:07:47] yeah, I can see some on chinese projects [07:08:03] just saw some writes to bewiki [07:08:22] also on ja projects other than wikipedia [07:08:40] there is not a lot of activity individually, so maybe not a lot of impact [07:08:57] checking alerts [07:09:16] none relevant that I can see [07:09:22] https://wikistream.wmflabs.org/ this is a god place to watch, though you might be overwhelmed with wikidata edits [07:09:22] did we run puppet? [07:09:28] not yet [07:09:35] I will do it [07:09:49] apergos: the problem is it is the lower 5-10% of active projects [07:09:56] yeah [07:10:24] with lots of them, so the impact is lower, which means more difficult to debug if "are we 100% healthy" [07:10:41] it sure is [07:10:43] if it was enwiki, we would have people connected here immediately to tell us :-) [07:10:54] indeed. or wikidata too [07:11:05] but we care equally about the top or the bottom 10% :-D [07:11:27] I am going to update tendril for the sake of having it right [07:11:37] yeah, I was going to suggest that [07:11:52] then we can go with semi sync and the small details [07:12:10] please give us a heads up if you see something strange [07:12:19] other than some exceptions a few minutes ago [07:12:20] tendril looking good [07:13:07] "only" 300 exceptions, which are edits that failed, human, bot or jobqueue [07:13:49] read only from 07:01 to 07:04, which is almost as fast as a couple of mw deploys can go [07:14:00] errors are low [07:14:35] s3 db writes are at a similar level [07:14:38] as before [07:14:58] edit rate seems not to be affected [07:15:09] but I would like to see some edit rate per wiki/ per section [07:15:23] 10Operations, 10DBA, 10Patch-For-Review: s3 master emergency failover (db1075) - https://phabricator.wikimedia.org/T213858 (10Marostegui) [07:15:38] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Set s3 to read only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484852 (owner: 10Marostegui) [07:16:46] mediawikiwiki edits work https://www.mediawiki.org/wiki/Special:RecentChanges?hidebots=1&translations=filter&hidecategorization=1&hideWikibase=1&limit=50&days=7&urlversion=2 [07:17:00] marostegui: jouncebot it might cause wikidata dispatching to scream soon [07:17:10] *jynus [07:17:12] Amir1: what? [07:17:20] the readonly [07:17:27] no read only on wikidata [07:17:30] jynus: I am going to enable GTID on db1075 and then enable semisync and all that [07:17:39] even it's done already it might mkae issues [07:17:40] or wikitech [07:17:54] Amir1: no , I mean wikidata was never set in read only [07:17:56] jynus: yes, I know but the dispatching scripts writes to clients [07:17:58] neither was wikitech [07:18:00] ok [07:18:11] !log Enable GTID on db1075 - T213858 [07:18:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:14] T213858: s3 master emergency failover (db1075) - https://phabricator.wikimedia.org/T213858 [07:18:21] (changes wbc_entity_usage table and some other things) [07:18:29] I see [07:18:46] cannot it retry failed stuff? [07:18:56] as the jobque tries [07:19:31] couple of edits seen on el wp [07:19:34] (03PS1) 10Marostegui: wmnet: Update s3 alias [dns] - 10https://gerrit.wikimedia.org/r/484860 (https://phabricator.wikimedia.org/T213858) [07:19:50] (03CR) 10Muehlenhoff: [C: 03+1] admin: allow users to be deployed without ssh keys configured [puppet] - 10https://gerrit.wikimedia.org/r/482275 (https://phabricator.wikimedia.org/T212949) (owner: 10Elukey) [07:20:10] not saying it should do that now, Amir1 but maybe a task to be filed for the future [07:20:19] jynus: yes which is normal, but there are some alerts if the median goes high [07:20:25] ok [07:20:39] !log Change thread_pool_stall_limit on db1075 and db1078 - T213858 [07:20:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:42] if you are commenting only for alerting, thnaks, that is helpful [07:20:54] yeah, it's basically it [07:20:54] as we would be worried otherwise [07:21:08] also in future cases :D [07:21:17] https://grafana.wikimedia.org/d/000000156/wikidata-dispatch?refresh=1m&orgId=1 [07:21:20] https://www.wikidata.org/wiki/Special:DispatchStats [07:21:21] sorry, it wasn't clear at first to me what you meant ;-) [07:21:37] Yeah, sorry. I just woke up. It's so not me [07:21:40] it would be nice to have an s3 only rc stream we could look at someday [07:21:48] oh, I did also :-) [07:22:12] apergos: yeah, that is why for example something like edit rate per section [07:22:20] per wiki maybe too much [07:22:29] but per section would be usefui to debug issues [07:22:34] it sure would [07:22:36] maybe it exists but it is not plotted [07:23:16] 10Operations, 10MediaWiki-Cache, 10Patch-For-Review, 10Performance-Team (Radar), 10User-Elukey: Apply -R 200 to all the memcached mw object cache instances running in eqiad/codfw - https://phabricator.wikimedia.org/T208844 (10elukey) [07:23:19] I have not heard of such a tool tbh [07:24:04] huwiktionary seemed to be the most active s3 project [07:25:18] I don't see performance problems either, but that was expected [07:25:47] (03CR) 10Marostegui: [C: 03+2] wmnet: Update s3 alias [dns] - 10https://gerrit.wikimedia.org/r/484860 (https://phabricator.wikimedia.org/T213858) (owner: 10Marostegui) [07:29:15] well el wiki is moving right along, maybe 10 edits since the dbs went back to r/w [07:29:28] at this hour, it's a lot of edits :-) [07:32:20] 10Operations, 10ops-eqiad, 10Analytics, 10DBA: swap a2-eqiad PDU with on-site spare - https://phabricator.wikimedia.org/T213748 (10Marostegui) [07:32:25] 10Operations, 10DBA, 10Patch-For-Review: s3 master emergency failover (db1075) - https://phabricator.wikimedia.org/T213858 (10Marostegui) 05Open→03Resolved a:03Marostegui This was done: Read only ON at: 07:01:00 Read only OFF at: 07:04:20 Total time read only time: 03:20 minutes If you see something... [07:32:58] so aside from the remainging small pending stuff, we should prepare for the later maintenance [07:33:10] I was doing that now :) [07:33:25] I would like to do a cold copy of db1075 [07:33:41] let's coordinate on databases [07:33:43] a binary one? [07:33:45] sure [07:33:50] 10Operations, 10ops-eqiad, 10Analytics, 10DBA: swap a2-eqiad PDU with on-site spare - https://phabricator.wikimedia.org/T213748 (10Marostegui) [07:36:09] 10Operations, 10ops-eqiad, 10Analytics, 10DBA: swap a2-eqiad PDU with on-site spare - https://phabricator.wikimedia.org/T213748 (10Peachey88) [07:38:33] (03PS2) 10DCausse: [cirrus] Enable CirrusSearchCrossClusterSearch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484720 (https://phabricator.wikimedia.org/T210381) [07:38:35] (03PS29) 10DCausse: [cirrus] Cleanup transitional states [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381) [07:38:37] (03PS1) 10Marostegui: db-eqiad.php: Increase weight for db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484864 [07:40:09] (03PS1) 10ArielGlenn: update freemirror.org mirror url for xml/sql dumps [puppet] - 10https://gerrit.wikimedia.org/r/484865 [07:40:34] (03CR) 10Jcrespo: [C: 03+1] db-eqiad.php: Increase weight for db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484864 (owner: 10Marostegui) [07:41:10] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Increase weight for db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484864 (owner: 10Marostegui) [07:41:30] (03CR) 10ArielGlenn: [C: 03+2] update freemirror.org mirror url for xml/sql dumps [puppet] - 10https://gerrit.wikimedia.org/r/484865 (owner: 10ArielGlenn) [07:42:13] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight for db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484864 (owner: 10Marostegui) [07:43:25] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Increase weight for db1123 (duration: 00m 53s) [07:43:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:32] anomie: you can restart s3 migration script [07:43:37] 10Operations, 10DBA, 10Patch-For-Review: s3 master emergency failover (db1075) - https://phabricator.wikimedia.org/T213858 (10Marostegui) @anomie you can restart s3 migration script [07:45:02] 10Operations, 10ops-eqiad, 10Analytics, 10DBA: swap a2-eqiad PDU with on-site spare - https://phabricator.wikimedia.org/T213748 (10Marostegui) db1075, s3 primary master, was failed over to db1078 which is in row C. [07:45:05] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight for db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484864 (owner: 10Marostegui) [07:48:26] (03PS1) 10Marostegui: db-eqiad.php: Depool db1103 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484872 (https://phabricator.wikimedia.org/T213859) [07:49:13] (03PS2) 10Marostegui: db-eqiad.php: Depool db1103 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484872 (https://phabricator.wikimedia.org/T213859) [07:56:04] 10Operations, 10ops-eqiad, 10Analytics, 10DBA: swap a2-eqiad PDU with on-site spare - https://phabricator.wikimedia.org/T213748 (10Marostegui) @RobH is this happening today too along with a3 maintenance or is this finally moved to Tue 22nd? [08:12:03] PROBLEM - MariaDB Slave Lag: s7 on db2068 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.33 seconds [08:12:09] 10Operations, 10MediaWiki-Cache, 10Patch-For-Review, 10Performance-Team (Radar), and 2 others: Apply -R 200 to all the memcached mw object cache instances running in eqiad/codfw - https://phabricator.wikimedia.org/T208844 (10jijiki) [08:12:49] PROBLEM - MariaDB Slave Lag: s7 on db2086 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.57 seconds [08:12:55] PROBLEM - MariaDB Slave Lag: s7 on db2047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.18 seconds [08:13:17] PROBLEM - MariaDB Slave Lag: s7 on db2077 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 306.66 seconds [08:13:25] PROBLEM - MariaDB Slave Lag: s7 on db2040 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 308.11 seconds [08:13:33] PROBLEM - MariaDB Slave Lag: s7 on db2054 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.62 seconds [08:13:33] PROBLEM - MariaDB Slave Lag: s7 on db2087 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.80 seconds [08:13:37] PROBLEM - MariaDB Slave Lag: s7 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.45 seconds [08:13:43] PROBLEM - MariaDB Slave Lag: s7 on db2061 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 312.26 seconds [08:24:55] !log Disabling puppet on rdb1005 and switch redis::misc::master to rdb1006 - T213859 [08:24:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:58] T213859: eqiad: rack a3 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 [08:27:44] (03CR) 10Effie Mouzeli: [C: 03+2] role::eqiad::scb: Switch rdb1006 to redis::misc::master [puppet] - 10https://gerrit.wikimedia.org/r/484572 (https://phabricator.wikimedia.org/T213859) (owner: 10Effie Mouzeli) [08:27:58] (03PS3) 10Effie Mouzeli: role::eqiad::scb: Switch rdb1006 to redis::misc::master [puppet] - 10https://gerrit.wikimedia.org/r/484572 (https://phabricator.wikimedia.org/T213859) [08:28:26] !log Drop table tag_summary from enwiki - T212255 [08:28:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:31] T212255: Drop tag_summary table - https://phabricator.wikimedia.org/T212255 [08:31:35] !log Deploy schema change on s3 codfw, lag will be generated - T85757 [08:31:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:38] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [08:32:29] !log stop, upgrade and restart db1075 [08:32:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:47] !log Restarting nutcracker on scb100* for 484572 - T213859 [08:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:51] T213859: eqiad: rack a3 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 [08:36:20] (03PS1) 10Mathew.onipe: wdqs: convert prom exporter script tp py3 [puppet] - 10https://gerrit.wikimedia.org/r/484974 (https://phabricator.wikimedia.org/T213305) [08:36:52] (03CR) 10jerkins-bot: [V: 04-1] wdqs: convert prom exporter script tp py3 [puppet] - 10https://gerrit.wikimedia.org/r/484974 (https://phabricator.wikimedia.org/T213305) (owner: 10Mathew.onipe) [08:37:40] !log installing remaining systemd security updates on stretch [08:37:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:36] !log Enabling puppet on rdb1005 and switch redis::misc::master to rdb1006 - T213859 [08:42:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:40] T213859: eqiad: rack a3 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 [08:44:40] (03CR) 10GTirloni: [C: 03+1] Code formatting and cleanup [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/484848 (owner: 10BryanDavis) [08:45:05] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:47:22] (03CR) 10GTirloni: [C: 03+1] Preserve restart attempt timestamps between runs [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/484849 (https://phabricator.wikimedia.org/T107878) (owner: 10BryanDavis) [08:53:47] PROBLEM - puppet last run on db2084 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[initramfs-tools] [08:56:26] 10Puppet, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current), and 2 others: ORES services should bind to ores config files - https://phabricator.wikimedia.org/T210719 (10Ladsgroup) My biggest problem is that it's not documented that config changes in puppet doesn't cause the service to restart.... [08:58:37] 10Puppet, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current), and 2 others: ORES services should bind to ores config files - https://phabricator.wikimedia.org/T210719 (10Ladsgroup) 05Open→03Declined In favor of {T213743} [08:58:59] RECOVERY - puppet last run on db2084 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [08:59:19] (03PS3) 10Elukey: Configure analytics1028->41 as Hadoop Analytics test cluster [puppet] - 10https://gerrit.wikimedia.org/r/484374 (https://phabricator.wikimedia.org/T212256) [09:00:39] PROBLEM - DPKG on db1101 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [09:02:12] !log rolling NIC firmware upgrade cp[1081-1090] - T203194 [09:02:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:15] T203194: cp1075-90 - bnxt_en transmit hangs - https://phabricator.wikimedia.org/T203194 [09:04:21] RECOVERY - DPKG on db1101 is OK: All packages OK [09:04:55] RECOVERY - MariaDB Slave Lag: s2 on db2056 is OK: OK slave_sql_lag Replication lag: 58.93 seconds [09:05:11] RECOVERY - MariaDB Slave Lag: s2 on db2035 is OK: OK slave_sql_lag Replication lag: 56.07 seconds [09:05:23] RECOVERY - MariaDB Slave Lag: s2 on db2091 is OK: OK slave_sql_lag Replication lag: 51.20 seconds [09:05:49] RECOVERY - MariaDB Slave Lag: s2 on db2095 is OK: OK slave_sql_lag Replication lag: 46.91 seconds [09:05:55] RECOVERY - MariaDB Slave Lag: s2 on db2041 is OK: OK slave_sql_lag Replication lag: 45.49 seconds [09:06:01] RECOVERY - MariaDB Slave Lag: s2 on db2088 is OK: OK slave_sql_lag Replication lag: 42.24 seconds [09:06:07] RECOVERY - MariaDB Slave Lag: s2 on db2063 is OK: OK slave_sql_lag Replication lag: 41.89 seconds [09:17:16] (03CR) 10Elukey: [C: 03+2] Configure analytics1028->41 as Hadoop Analytics test cluster [puppet] - 10https://gerrit.wikimedia.org/r/484374 (https://phabricator.wikimedia.org/T212256) (owner: 10Elukey) [09:18:51] !log Deploy schema change on db1095:3313 - T85757 [09:18:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:54] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [09:19:09] 10Operations, 10ORES, 10Scoring-platform-team, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Backlog): The continuous release pipeline should support more than one service per repo - https://phabricator.wikimedia.org/T210267 (10Ladsgroup) >>! In T210267#4873881, @thcipriani wrote: > I think t... [09:21:46] (03Abandoned) 10Ladsgroup: ores: Notify ores services when the config changes [puppet] - 10https://gerrit.wikimedia.org/r/476850 (https://phabricator.wikimedia.org/T210719) (owner: 10Ladsgroup) [09:24:35] !log power off graphite1003 for later hw maintenance (T213859) [09:24:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:38] T213859: eqiad: rack a3 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 [09:25:36] !log Poweroff dbstore1003 for hw maintenance T213859 [09:25:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:44] 10Operations, 10ops-eqiad, 10Patch-For-Review: eqiad: rack a3 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 (10MoritzMuehlenhoff) [09:30:59] 10Operations, 10ops-eqiad, 10Patch-For-Review: eqiad: rack a3 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 (10Marostegui) [09:35:21] 10Operations, 10Core Platform Team, 10Performance-Team, 10serviceops, and 3 others: Set up a beta feature offering the use of PHP7 - https://phabricator.wikimedia.org/T213934 (10Mainframe98) This should probably go in Tech News, as the HHVM beta feature was too. It could repurpose from [[ https://meta.wiki... [09:35:58] (03PS2) 10Mathew.onipe: wdqs: convert prom exporter script tp py3 [puppet] - 10https://gerrit.wikimedia.org/r/484974 (https://phabricator.wikimedia.org/T213305) [09:36:42] (03PS4) 10Muehlenhoff: Enable base::service_auto_restart for uwsgi-striker [puppet] - 10https://gerrit.wikimedia.org/r/483114 (https://phabricator.wikimedia.org/T135991) [09:39:58] 10Operations, 10Core Platform Team, 10Performance-Team, 10serviceops, and 3 others: Set up a beta feature offering the use of PHP7 - https://phabricator.wikimedia.org/T213934 (10Joe) Very good point @Mainframe98 - in fact I was planning to write an email to wikitech-l once the beta feature is set up and I... [09:48:49] (03CR) 10Muehlenhoff: [C: 03+2] Enable base::service_auto_restart for uwsgi-striker [puppet] - 10https://gerrit.wikimedia.org/r/483114 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [09:48:55] 10Operations, 10netops: Netbox Dies Mysteriously Sometimes - https://phabricator.wikimedia.org/T214008 (10elukey) @crusnov hi! I think this is the same issue as T212697 [09:51:26] 10Operations, 10Core Platform Team, 10Performance-Team, 10serviceops, and 3 others: Set up a beta feature offering the use of PHP7 - https://phabricator.wikimedia.org/T213934 (10Aklapper) > ` > Please [[phab:|report bugs]] if you see them. > ` To notify Tech News, see #notice. Small nitpick: Please avoid l... [09:54:15] 10Operations, 10Core Platform Team, 10Performance-Team, 10serviceops, and 3 others: Set up a beta feature offering the use of PHP7 - https://phabricator.wikimedia.org/T213934 (10Joe) @Krinkle did mention he saw a couple fatal errors that looked worrisome, so I'd wait for him to comment before backporting t... [09:57:22] 10Operations, 10Core Platform Team, 10Performance-Team, 10serviceops, and 3 others: Set up a beta feature offering the use of PHP7 - https://phabricator.wikimedia.org/T213934 (10Mainframe98) >>! In T213934#4887396, @Aklapper wrote: >> ` >> Please [[phab:|report bugs]] if you see them. >> ` > [...] Small ni... [09:58:56] (03PS1) 10Marostegui: db-eqiad.php: Depool hosts for a2 rack maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484987 (https://phabricator.wikimedia.org/T213748) [09:59:29] !log Poweroff dbproxy1001 dbproxy1002 dbproxy1003 for a3 maintenance - T213859 [09:59:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:32] T213859: eqiad: rack a3 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 [10:00:27] 10Operations, 10Core Platform Team, 10Performance-Team, 10serviceops, and 3 others: Set up a beta feature offering the use of PHP7 - https://phabricator.wikimedia.org/T213934 (10Joe) >>! In T213934#4887401, @Mainframe98 wrote: >>>! In T213934#4887396, @Aklapper wrote: >>> ` >>> Please [[phab:|report bugs]]... [10:01:22] 10Operations, 10ops-eqiad, 10Patch-For-Review: eqiad: rack a3 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 (10Marostegui) [10:01:36] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1103 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484872 (https://phabricator.wikimedia.org/T213859) (owner: 10Marostegui) [10:02:41] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1103 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484872 (https://phabricator.wikimedia.org/T213859) (owner: 10Marostegui) [10:02:53] RECOVERY - MariaDB Slave Lag: s2 on db2049 is OK: OK slave_sql_lag Replication lag: 52.58 seconds [10:03:56] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1103 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484872 (https://phabricator.wikimedia.org/T213859) (owner: 10Marostegui) [10:04:14] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1103 - T213859 (duration: 00m 53s) [10:04:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:58] (03PS1) 10Gilles: Add pycountry to analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/484994 (https://phabricator.wikimedia.org/T209857) [10:06:17] (03CR) 10Alexandros Kosiaris: [C: 04-1] scb: enable statsd_exporter and add matching rules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/484586 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [10:08:04] (03PS2) 10Marostegui: db-eqiad.php: Depool hosts for a2 rack maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484987 (https://phabricator.wikimedia.org/T213748) [10:08:19] !log installing krb5 security updates on trusty [10:08:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:07] !log Stop MySQL on db1103:3312 and db1103:3314, also poweroff the server - T213859 [10:09:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:10] T213859: eqiad: rack a3 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 [10:12:09] 10Operations, 10ops-eqiad, 10Patch-For-Review: eqiad: rack a3 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 (10Marostegui) [10:13:19] 10Operations, 10ops-eqiad, 10Patch-For-Review: eqiad: rack a3 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 (10Marostegui) Databases involved are fully ready for this maintenance. They are all off but pc1004 which is not reachable and not powered off per T213859#4883727 but it... [10:14:33] 10Operations, 10monitoring, 10Patch-For-Review: Remove Diamond from production - https://phabricator.wikimedia.org/T212231 (10MoritzMuehlenhoff) [10:15:01] (03CR) 10Alexandros Kosiaris: [C: 04-1] "almost there. Minor comment inline, rest LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/483222 (owner: 10Dzahn) [10:18:05] (03PS1) 10Elukey: profile::hadoop:monitoring::journalnode: use contain [puppet] - 10https://gerrit.wikimedia.org/r/485000 (https://phabricator.wikimedia.org/T212256) [10:23:26] (03Abandoned) 10Elukey: profile::hadoop:monitoring::journalnode: use contain [puppet] - 10https://gerrit.wikimedia.org/r/485000 (https://phabricator.wikimedia.org/T212256) (owner: 10Elukey) [10:24:15] (03CR) 10Jcrespo: [C: 03+1] db-eqiad.php: Depool hosts for a2 rack maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484987 (https://phabricator.wikimedia.org/T213748) (owner: 10Marostegui) [10:27:30] jouncebot: next [10:27:30] In 1 hour(s) and 32 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190117T1200) [10:29:44] !log installing ruby-loofah security updates [10:29:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:10] !log T213859 icinga downtime cloudservices1004 for 1 day [10:30:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:13] T213859: eqiad: rack a3 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 [10:30:45] 10Operations, 10monitoring, 10Graphite: Duplicate definitions found in Icinga configuration - https://phabricator.wikimedia.org/T211692 (10Volans) Bump, this is still happening. [10:39:11] !log installing libcaca security updates [10:39:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:45] (03PS1) 10Joal: Add is_pageview to webrequest turnilo datasource [puppet] - 10https://gerrit.wikimedia.org/r/485002 (https://phabricator.wikimedia.org/T212778) [10:40:54] elukey: if you have a minute --^ [10:41:33] lol, do we really use libcaca? [10:42:06] vgutierrez, moritzm - no shit? [10:42:18] joal: can you modify the commit msg to "turnilo: add is_pageview to webrequest turnilo datasource" ? I'll merge after that :) [10:42:31] doing elukey [10:43:24] (03PS2) 10Joal: turnilo: Add is_pageview to webrequest datasource [puppet] - 10https://gerrit.wikimedia.org/r/485002 (https://phabricator.wikimedia.org/T212778) [10:43:40] elukey: --^ [10:43:59] vgutierrez: yeah, it gets pulled in via SDL, which is used by a number of packages [10:44:08] that makes sense [10:44:10] joal: <3 [10:44:35] (03PS3) 10Elukey: turnilo: add is_pageview to webrequest datasource [puppet] - 10https://gerrit.wikimedia.org/r/485002 (https://phabricator.wikimedia.org/T212778) (owner: 10Joal) [10:44:41] (03CR) 10Elukey: [V: 03+2 C: 03+2] turnilo: add is_pageview to webrequest datasource [puppet] - 10https://gerrit.wikimedia.org/r/485002 (https://phabricator.wikimedia.org/T212778) (owner: 10Joal) [10:44:59] joal: yeah, it's quite childish. it's a fork/improved alternative to libaa .... [10:45:54] moritzm: Having a kids of 4, it actually doesn't bother me anymore - I make jokes with it :) [10:46:15] RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 405, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:46:59] yey.. looks like AS6939 is back :) [10:52:27] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool hosts for a2 rack maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484987 (https://phabricator.wikimedia.org/T213748) (owner: 10Marostegui) [10:53:19] (03PS1) 10Muehlenhoff: Remove Diamond from Kafka hosts [puppet] - 10https://gerrit.wikimedia.org/r/485004 (https://phabricator.wikimedia.org/T212231) [10:53:33] (03Merged) 10jenkins-bot: db-eqiad.php: Depool hosts for a2 rack maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484987 (https://phabricator.wikimedia.org/T213748) (owner: 10Marostegui) [10:54:46] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool DBs on A2 rack T213748 (duration: 00m 54s) [10:54:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:50] T213748: swap a2-eqiad PDU with on-site spare - https://phabricator.wikimedia.org/T213748 [10:54:51] !log Stop MySQL on db1082 db1081 db1080 db1079 db1075 db1074 es1012 es1011 - T213748 [10:54:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:23] !log Lag will be generated on labs due to maintenance on sanitarium db masters [10:55:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:06] 10Operations, 10ops-eqiad, 10Patch-For-Review: eqiad: rack a3 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 (10aborrero) >>! In T213859#4883309, @bd808 wrote: > **cloudservices1004** is the hot-spare in the cloudservices100[34] cluster supporting the eqiad1-r region of our Open... [11:04:55] (03CR) 10jenkins-bot: db-eqiad.php: Depool hosts for a2 rack maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484987 (https://phabricator.wikimedia.org/T213748) (owner: 10Marostegui) [11:09:23] !log stop eventlogging on eventlog1002 and eventlogging replication on db1108 as prep step for db1107 maintenance [11:09:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:31] (03CR) 10Elukey: [C: 03+1] Remove Diamond from Kafka hosts [puppet] - 10https://gerrit.wikimedia.org/r/485004 (https://phabricator.wikimedia.org/T212231) (owner: 10Muehlenhoff) [11:11:38] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: cp1075-90 - bnxt_en transmit hangs - https://phabricator.wikimedia.org/T203194 (10Vgutierrez) firmware upgrade completed for all the affected systems. [11:12:07] PROBLEM - Hadoop NodeManager on analytics1031 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [11:13:36] this is the testing cluster, it should be silenced [11:13:37] sigh [11:16:52] PROBLEM - MariaDB Slave Lag: s1 on db2048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.29 seconds [11:16:52] PROBLEM - MariaDB Slave Lag: s1 on db2092 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.32 seconds [11:16:53] !log shutdown elastic103[0-5] to prepare for T213859 [11:16:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:56] T213859: eqiad: rack a3 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 [11:17:16] PROBLEM - MariaDB Slave Lag: s1 on db2055 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 311.99 seconds [11:17:18] PROBLEM - Check systemd state on analytics1028 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:17:20] PROBLEM - MariaDB Slave Lag: s1 on db2088 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 312.41 seconds [11:17:28] PROBLEM - MariaDB Slave Lag: s1 on db2085 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 313.81 seconds [11:17:28] PROBLEM - MariaDB Slave Lag: s1 on db2071 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 314.33 seconds [11:17:30] PROBLEM - MariaDB Slave Lag: s1 on db2072 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 314.63 seconds [11:17:32] PROBLEM - MariaDB Slave Lag: s1 on db2070 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 314.45 seconds [11:17:36] PROBLEM - puppet last run on analytics1028 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 5 minutes ago with 6 failures. Failed resources (up to 3 shown): Exec[hadoop-namenode-format],Exec[create_hdfs_user_directories],Exec[hdfs_put_mysql-analytics-research-client-pw.txt],Exec[hdfs_put_mysql-analytics-labsdb-client-pw.txt] [11:17:44] PROBLEM - Hadoop NodeManager on analytics1034 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [11:18:06] sorry again [11:20:00] PROBLEM - puppet last run on analytics1029 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 3 minutes ago with 3 failures. Failed resources (up to 3 shown): Service[hadoop-yarn-resourcemanager],Exec[hadoop-namenode-format],Exec[cdh::hadoop::directory /user/druid] [11:25:00] (03PS1) 10Elukey: Fix typos in Hadoop Test cluster configuration [puppet] - 10https://gerrit.wikimedia.org/r/485006 (https://phabricator.wikimedia.org/T212256) [11:25:51] (03CR) 10Elukey: [C: 03+2] Fix typos in Hadoop Test cluster configuration [puppet] - 10https://gerrit.wikimedia.org/r/485006 (https://phabricator.wikimedia.org/T212256) (owner: 10Elukey) [11:29:35] (03PS1) 10Vgutierrez: acme_requests: Handle TCP/HTTPS errors [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/485007 (https://phabricator.wikimedia.org/T209980) [11:29:37] (03PS1) 10Vgutierrez: certcentral: Bump acme to the latest version shipped in stretch-backports [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/485008 (https://phabricator.wikimedia.org/T213820) [11:29:39] (03PS1) 10Vgutierrez: certcentral: Bump josepy to the latest version shipped in stretch-bp [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/485009 (https://phabricator.wikimedia.org/T213820) [11:29:41] (03PS1) 10Vgutierrez: certcentral: Allow specifying authorized hosts and regex in the config [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/485010 (https://phabricator.wikimedia.org/T213301) [11:29:43] (03PS1) 10Vgutierrez: Release 0.8 [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/485011 (https://phabricator.wikimedia.org/T209980) [11:30:53] (03CR) 10jerkins-bot: [V: 04-1] acme_requests: Handle TCP/HTTPS errors [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/485007 (https://phabricator.wikimedia.org/T209980) (owner: 10Vgutierrez) [11:33:18] (03CR) 10Vgutierrez: [V: 03+2 C: 03+2] "tests error caused by CI container upgrade" [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/485007 (https://phabricator.wikimedia.org/T209980) (owner: 10Vgutierrez) [11:33:43] (03CR) 10Vgutierrez: [C: 03+2] certcentral: Bump acme to the latest version shipped in stretch-backports [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/485008 (https://phabricator.wikimedia.org/T213820) (owner: 10Vgutierrez) [11:33:52] (03CR) 10Vgutierrez: [C: 03+2] certcentral: Bump josepy to the latest version shipped in stretch-bp [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/485009 (https://phabricator.wikimedia.org/T213820) (owner: 10Vgutierrez) [11:34:00] (03CR) 10Vgutierrez: [C: 03+2] certcentral: Allow specifying authorized hosts and regex in the config [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/485010 (https://phabricator.wikimedia.org/T213301) (owner: 10Vgutierrez) [11:34:09] (03CR) 10Vgutierrez: [C: 03+2] Release 0.8 [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/485011 (https://phabricator.wikimedia.org/T209980) (owner: 10Vgutierrez) [11:34:18] RECOVERY - MariaDB Slave Lag: s7 on db2047 is OK: OK slave_sql_lag Replication lag: 1.29 seconds [11:34:26] RECOVERY - MariaDB Slave Lag: s7 on db2095 is OK: OK slave_sql_lag Replication lag: 0.17 seconds [11:34:30] RECOVERY - MariaDB Slave Lag: s7 on db2040 is OK: OK slave_sql_lag Replication lag: 0.13 seconds [11:34:44] RECOVERY - MariaDB Slave Lag: s7 on db2061 is OK: OK slave_sql_lag Replication lag: 0.28 seconds [11:34:47] (03CR) 10jenkins-bot: acme_requests: Handle TCP/HTTPS errors [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/485007 (https://phabricator.wikimedia.org/T209980) (owner: 10Vgutierrez) [11:34:52] RECOVERY - MariaDB Slave Lag: s7 on db2054 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [11:35:00] RECOVERY - MariaDB Slave Lag: s7 on db2077 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [11:35:04] RECOVERY - MariaDB Slave Lag: s7 on db2086 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [11:35:06] RECOVERY - MariaDB Slave Lag: s7 on db2087 is OK: OK slave_sql_lag Replication lag: 0.37 seconds [11:35:06] RECOVERY - MariaDB Slave Lag: s7 on db2068 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [11:35:40] (03CR) 10jenkins-bot: certcentral: Bump acme to the latest version shipped in stretch-backports [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/485008 (https://phabricator.wikimedia.org/T213820) (owner: 10Vgutierrez) [11:35:43] (03CR) 10jenkins-bot: certcentral: Bump josepy to the latest version shipped in stretch-bp [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/485009 (https://phabricator.wikimedia.org/T213820) (owner: 10Vgutierrez) [11:35:50] (03CR) 10jenkins-bot: Release 0.8 [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/485011 (https://phabricator.wikimedia.org/T209980) (owner: 10Vgutierrez) [11:35:56] (03CR) 10jenkins-bot: certcentral: Allow specifying authorized hosts and regex in the config [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/485010 (https://phabricator.wikimedia.org/T213301) (owner: 10Vgutierrez) [11:36:30] PROBLEM - MariaDB Slave Lag: s1 on db2062 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 498.62 seconds [11:36:30] (03PS1) 10Elukey: network::constants: add Hadoop testing masters [puppet] - 10https://gerrit.wikimedia.org/r/485012 (https://phabricator.wikimedia.org/T212256) [11:36:45] !log mvolz@deploy1001 scap-helm zotero upgrade production -f zotero-values-codfw.yaml stable/zotero [namespace: zotero, clusters: codfw] [11:36:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:47] !log mvolz@deploy1001 scap-helm zotero cluster codfw completed [11:36:47] !log mvolz@deploy1001 scap-helm zotero finished [11:36:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:52] PROBLEM - ElasticSearch health check for shards on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - elasticsearch inactive shards 1748 threshold =0.15 breach: status: yellow, number_of_nodes: 29, unassigned_shards: 1708, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3219, task_max_waiting_in_queue_millis: 0, cluster_name: production-search-eqiad, relocating_shards: 4, active_shar [11:38:52] ber: 81.9704331645, active_shards: 7929, initializing_shards: 36, number_of_data_nodes: 29, delayed_unassigned_shards: 0 [11:38:59] (03CR) 10Elukey: [C: 03+2] network::constants: add Hadoop testing masters [puppet] - 10https://gerrit.wikimedia.org/r/485012 (https://phabricator.wikimedia.org/T212256) (owner: 10Elukey) [11:39:35] onimisionipe: --^ [11:39:36] health check is expected [11:39:41] ah good :) [11:39:42] (03PS1) 10Vgutierrez: debian: Add release 0.8 to changelog [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/485014 (https://phabricator.wikimedia.org/T209980) [11:39:44] hello dcausse ! [11:39:49] elukey: hey! [11:40:03] Oops... [11:40:08] elukey: that's me [11:40:41] onimisionipe: np! I just pinged you as FYI, I saw the !log before :) [11:43:03] !log Poweroff db1082 db1081 db1080 db1079 db1075 db1074 es1012 es1011 - T213748 [11:43:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:06] T213748: swap a2-eqiad PDU with on-site spare - https://phabricator.wikimedia.org/T213748 [11:43:43] 10Operations, 10ops-eqiad, 10Analytics, 10DBA, 10Patch-For-Review: swap a2-eqiad PDU with on-site spare - https://phabricator.wikimedia.org/T213748 (10Marostegui) All the systems owned by the DBAs are now off. [11:44:04] 10Operations, 10hardware-requests: Two test hosts for SREs - https://phabricator.wikimedia.org/T214024 (10MoritzMuehlenhoff) [11:51:07] (03PS1) 10Ladsgroup: Add 'yue' to langlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485015 (https://phabricator.wikimedia.org/T211530) [11:53:37] (03CR) 10Ladsgroup: "Please take a look and tell me if having two entries pointing to the same language is not okay. I will merge and deploy this tomorrow." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485015 (https://phabricator.wikimedia.org/T211530) (owner: 10Ladsgroup) [11:55:00] !log mvolz@deploy1001 scap-helm zotero upgrade staging -f zotero-values-staging.yaml stable/zotero [namespace: zotero, clusters: staging] [11:55:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:06] !log T209527 copy nfsd-ldap between jessie-wikimedia and stretch-wikimedia in reprepro. It will require a rebuild though bc updated build-deps/deps [11:55:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:09] T209527: Set up scratch and maps NFS services on cloudstore1008/9 - https://phabricator.wikimedia.org/T209527 [11:55:26] PROBLEM - MariaDB Slave Lag: s7 on db2040 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.44 seconds [11:55:40] PROBLEM - MariaDB Slave Lag: s7 on db2061 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 306.67 seconds [11:55:46] PROBLEM - MariaDB Slave Lag: s7 on db2054 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 308.66 seconds [11:55:56] PROBLEM - MariaDB Slave Lag: s7 on db2077 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.71 seconds [11:55:58] PROBLEM - MariaDB Slave Lag: s7 on db2086 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 311.35 seconds [11:56:02] PROBLEM - MariaDB Slave Lag: s7 on db2087 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 312.35 seconds [11:56:04] PROBLEM - MariaDB Slave Lag: s7 on db2068 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 312.68 seconds [11:56:11] !log mvolz@deploy1001 scap-helm zotero upgrade staging -f zotero-values-staging.yaml --version=0.0.1 stable/zotero [namespace: zotero, clusters: staging] [11:56:12] !log mvolz@deploy1001 scap-helm zotero cluster staging completed [11:56:12] !log mvolz@deploy1001 scap-helm zotero finished [11:56:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:26] PROBLEM - MariaDB Slave Lag: s7 on db2047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 319.76 seconds [11:56:34] PROBLEM - MariaDB Slave Lag: s7 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 322.23 seconds [12:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Time to snap out of that daydream and deploy European Mid-day SWAT(Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190117T1200). [12:00:04] dcausse, matthiasmullie, and addshore: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:15] check [12:00:17] o/ [12:00:42] o/ [12:01:17] I suppose I can SWAT :) [12:01:58] o/ [12:02:00] dcausse: nice :) [12:02:09] (I'm around, if you need me) [12:02:58] thanks! [12:03:10] sigh... only code changes... [12:04:02] !log mvolz@deploy1001 scap-helm zotero upgrade production -f zotero-values-eqiad.yaml stable/zotero [namespace: zotero, clusters: eqiad] [12:04:03] !log mvolz@deploy1001 scap-helm zotero cluster eqiad completed [12:04:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:03] !log mvolz@deploy1001 scap-helm zotero finished [12:04:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:24] RECOVERY - ElasticSearch health check for shards on search.svc.eqiad.wmnet is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 29, unassigned_shards: 1417, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3219, task_max_waiting_in_queue_millis: 0, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 85 [12:08:24] ve_shards: 8227, initializing_shards: 29, number_of_data_nodes: 29, delayed_unassigned_shards: 0 [12:08:36] !log stop mariadb and shutdown db1107 to ease rack a3 maintenance [12:08:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:01] elukey: wasn't db1107 on rack 2? [12:09:14] !log mvolz@deploy1001 scap-helm zotero upgrade production -f zotero-values-codfw.yaml stable/zotero [namespace: zotero, clusters: codfw] [12:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:15] !log mvolz@deploy1001 scap-helm zotero cluster codfw completed [12:09:16] !log mvolz@deploy1001 scap-helm zotero finished [12:09:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:37] _joe_: have there been any issues with logs making their way to logstash from mediawiki in the last 48 hours? [12:09:39] marostegui: yes sigh, I am amending the SAL, typo [12:09:51] elukey: Just asking because I got confused many times XD [12:09:57] RECOVERY - MariaDB Slave Lag: s7 on db2047 is OK: OK slave_sql_lag Replication lag: 25.56 seconds [12:10:00] <_joe_> addshore: not that I'm aware of [12:10:04] RECOVERY - MariaDB Slave Lag: s7 on db2095 is OK: OK slave_sql_lag Replication lag: 10.18 seconds [12:10:10] RECOVERY - MariaDB Slave Lag: s7 on db2040 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [12:10:13] <_joe_> addshore: but I'm going to lunch right now, someone else can take a look [12:10:24] RECOVERY - MariaDB Slave Lag: s7 on db2061 is OK: OK slave_sql_lag Replication lag: 0.21 seconds [12:10:28] _joe_: ack, who would be good? :) [12:10:34] RECOVERY - MariaDB Slave Lag: s7 on db2054 is OK: OK slave_sql_lag Replication lag: 0.53 seconds [12:10:40] RECOVERY - MariaDB Slave Lag: s7 on db2077 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [12:10:44] RECOVERY - MariaDB Slave Lag: s7 on db2086 is OK: OK slave_sql_lag Replication lag: 0.49 seconds [12:10:46] RECOVERY - MariaDB Slave Lag: s7 on db2087 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [12:10:50] RECOVERY - MariaDB Slave Lag: s7 on db2068 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [12:10:51] <_joe_> addshore: anyone in SRE really [12:11:07] <_joe_> a ticket might help your cause :P [12:11:15] _joe_: I'll go write one! [12:11:32] PROBLEM - haproxy failover on dbproxy1004 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [12:11:38] !log poweroff ms-be1019 / ms-be1044 / ms-be1045 before A2 maint - T213748 [12:11:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:41] T213748: swap a2-eqiad PDU with on-site spare - https://phabricator.wikimedia.org/T213748 [12:11:50] PROBLEM - haproxy failover on dbproxy1009 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [12:11:54] addshore: yes please a task to wikimedia-logstash would be great, I'll take a look [12:11:57] the proxy is db1107 going down [12:12:02] RECOVERY - MariaDB Slave Lag: s1 on db2092 is OK: OK slave_sql_lag Replication lag: 57.56 seconds [12:12:04] RECOVERY - MariaDB Slave Lag: s1 on db2048 is OK: OK slave_sql_lag Replication lag: 58.66 seconds [12:12:16] RECOVERY - MariaDB Slave Lag: s1 on db2062 is OK: OK slave_sql_lag Replication lag: 54.61 seconds [12:12:34] RECOVERY - MariaDB Slave Lag: s1 on db2055 is OK: OK slave_sql_lag Replication lag: 44.77 seconds [12:12:36] RECOVERY - MariaDB Slave Lag: s1 on db2088 is OK: OK slave_sql_lag Replication lag: 44.56 seconds [12:12:48] RECOVERY - MariaDB Slave Lag: s1 on db2085 is OK: OK slave_sql_lag Replication lag: 39.90 seconds [12:12:50] RECOVERY - MariaDB Slave Lag: s1 on db2071 is OK: OK slave_sql_lag Replication lag: 38.31 seconds [12:12:52] RECOVERY - MariaDB Slave Lag: s1 on db2072 is OK: OK slave_sql_lag Replication lag: 38.51 seconds [12:12:56] RECOVERY - MariaDB Slave Lag: s1 on db2070 is OK: OK slave_sql_lag Replication lag: 35.59 seconds [12:17:36] !log poweroff rdb1005.eqiad.wmnet before A3 maint - T213859 [12:17:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:39] T213859: eqiad: rack a3 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 [12:19:01] !log killing migrateActors.php --wiki=ptwiki on mwmaint, was using outdated db config T188327 [12:19:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:04] T188327: Deploy refactored actor storage - https://phabricator.wikimedia.org/T188327 [12:20:02] PROBLEM - Host rdb1005 is DOWN: PING CRITICAL - Packet loss = 100% [12:22:20] jijiki: you may want to downtime or ack that? [12:22:28] (03PS1) 10Vgutierrez: CI: Run tests against py{35,36,37} with min and latest deps [software/certcentral] - 10https://gerrit.wikimedia.org/r/485017 (https://phabricator.wikimedia.org/T213820) [12:23:10] 10Operations, 10Wikimedia-Logstash, 10User-Addshore: Investigate missing WikibaseQualityConstraints logs in logstash. - https://phabricator.wikimedia.org/T214031 (10Addshore) [12:23:12] godog: filed as https://phabricator.wikimedia.org/T214031 [12:23:15] im slightly confused :) [12:23:26] but i can see some logs on mwlog1001, but they are not on logstash [12:23:55] addshore: ok thanks! I'll take a look later today [12:23:58] thanks! [12:25:05] !log poweroff restbase1010 / restbase1011 before A3 maint - T213859 [12:25:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:08] T213859: eqiad: rack a3 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 [12:25:28] matthiasmullie: CI failed, I've resend a +2 :( [12:25:53] yeah thanks :) [12:27:24] jynus: I had a scheduled downtime [12:27:53] let me check, maybe I did something wrong [12:28:03] (03CR) 10jerkins-bot: [V: 04-1] CI: Run tests against py{35,36,37} with min and latest deps [software/certcentral] - 10https://gerrit.wikimedia.org/r/485017 (https://phabricator.wikimedia.org/T213820) (owner: 10Vgutierrez) [12:28:12] jijiki: maybe it failed [12:28:18] I was mentioning it because it alerted [12:28:27] yes thank you very much jynus [12:28:30] PROBLEM - Host rdb1005 is DOWN: PING CRITICAL - Packet loss = 100% [12:28:48] likely there will be a logstash udp loss alert coming up, known/expecte [12:29:53] !log dcausse@deploy1001 Synchronized php-1.33.0-wmf.12/extensions/CirrusSearch/: Hack around cross cluster search bug (duration: 01m 00s) [12:29:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:08] hmm I can see the attached downtime [12:30:14] what am I missing [12:30:24] jynus: that part of a3 maintenance (rbd1005) [12:31:33] jijiki: let me see [12:32:26] oh [12:32:32] I think I missed ping [12:32:41] I see [12:32:44] indeed [12:32:47] !log akosiaris@deploy1001 Started deploy [citoid/deploy@269c9c7]: (no justification provided) [12:32:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:50] services are down, but not "host" [12:33:00] doesn't the script do everthing? [12:33:19] jijiki: no big deal anyway [12:33:34] !log akosiaris@deploy1001 Finished deploy [citoid/deploy@269c9c7]: (no justification provided) (duration: 00m 48s) [12:33:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:05] !log shutting down relforge1001 for PDU swap - T213859 [12:34:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:08] T213859: eqiad: rack a3 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 [12:37:42] onimisionipe: ^^^ [12:38:31] 10Operations, 10Cloud-Services, 10Wikibase-Quality, 10Wikibase-Quality-Constraints, and 4 others: Flood of WDQS requests from wbqc - https://phabricator.wikimedia.org/T204267 (10Addshore) So there was another flood in the past 48 hours. This can be seen on the WBQC dashboard @ https://grafana.wikimedia.or... [12:39:29] dcausse: you were swatting? [12:39:33] (03PS2) 10Vgutierrez: CI: Run tests with minimum and latest dependencies [software/certcentral] - 10https://gerrit.wikimedia.org/r/485017 (https://phabricator.wikimedia.org/T213820) [12:39:48] addshore: I'm swatting yet [12:40:13] 4 patches that needs to run CI :( [12:40:24] counting yours [12:40:31] dcausse: mind if I also hit +2 on https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/Wikibase/+/484979/ which will need to run CI ? [12:40:34] !log dcausse@deploy1001 Synchronized php-1.33.0-wmf.13/extensions/CirrusSearch/: Hack around cross cluster search bug (duration: 00m 59s) [12:40:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:46] addshore: no problem [12:40:56] done, thanks! :) it shouldnt get in the way [12:41:02] but wikibase CI is long! [12:41:13] !log poweroff kubernetes1001 - T213859 [12:41:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:16] T213859: eqiad: rack a3 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 [12:41:40] (03CR) 10jerkins-bot: [V: 04-1] CI: Run tests with minimum and latest dependencies [software/certcentral] - 10https://gerrit.wikimedia.org/r/485017 (https://phabricator.wikimedia.org/T213820) (owner: 10Vgutierrez) [12:41:45] !log imported nfsd-ldap_1.2+deb9u1 in stretch-wikimedia (T209527) [12:41:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:54] T209527: Set up scratch and maps NFS services on cloudstore1008/9 - https://phabricator.wikimedia.org/T209527 [12:42:23] matthiasmullie: live on mwdebug1002 [12:44:45] (03CR) 10Arturo Borrero Gonzalez: apt: repository: trust also the source repo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/483140 (owner: 10Arturo Borrero Gonzalez) [12:45:13] dcausse: perfect, it works [12:45:21] matthiasmullie: great, shipping [12:46:56] !log dcausse@deploy1001 Synchronized php-1.33.0-wmf.13/extensions/UploadWizard/: T214007: Don't reuse existing input object (duration: 00m 53s) [12:46:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:59] T214007: Clicking 'Add a caption in another language' removes the input box for the previous caption - https://phabricator.wikimedia.org/T214007 [12:47:17] (03CR) 10DCausse: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484720 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [12:47:59] matthiasmullie: it's live [12:48:23] (03Merged) 10jenkins-bot: [cirrus] Enable CirrusSearchCrossClusterSearch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484720 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [12:48:32] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([kubernetes1001.eqiad.wmnet]) [12:48:38] (03CR) 10jenkins-bot: [cirrus] Enable CirrusSearchCrossClusterSearch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484720 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [12:48:39] dcausse: okay, thanks! [12:49:20] yw! [12:50:20] mmmm that pybal ipvs check, kubernetes1001 have been powered off, do i need to do something else? [12:51:40] PROBLEM - Host ganeti1007 is DOWN: PING CRITICAL - Packet loss = 100% [12:52:30] (03CR) 10Vgutierrez: [C: 04-1] "pip version of cryptography 1.7.1 lacks https://github.com/pyca/cryptography/commit/6e7ea2e73e3baf31541c9533dc621d8913152848 so it's faili" [software/certcentral] - 10https://gerrit.wikimedia.org/r/485017 (https://phabricator.wikimedia.org/T213820) (owner: 10Vgutierrez) [12:53:41] !log dcausse@deploy1001 Synchronized wmf-config/CirrusSearch-production.php: T210381: [cirrus] Enable CirrusSearchCrossClusterSearch (duration: 00m 51s) [12:53:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:47] T210381: Update mw-config to use the psi&omega elastic clusters - https://phabricator.wikimedia.org/T210381 [12:55:01] addshore: I'm done [12:55:07] dcausse: amazing! [12:55:32] addshore: is your patch ready, can I help ? [12:55:40] dcausse: I will do it :) [12:55:49] addshore: ok, thanks :) [12:57:18] PROBLEM - PyBal IPVS diff check on lvs1006 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([kubernetes1001.eqiad.wmnet]) [12:58:17] !log addshore@deploy1001 Synchronized php-1.33.0-wmf.13/extensions/Wikibase/view/resources/jquery/wikibase/jquery.wikibase.badgeselector.js: T213998 Fix js type error when adding badges to items (duration: 00m 53s) [12:58:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:21] T213998: TypeError: this.options.badges.map is not a function - https://phabricator.wikimedia.org/T213998 [12:58:27] dcausse: I think that is everything dones then? [12:58:49] !log fsero@puppetmaster1001 conftool action : set/pooled=no; selector: name=kubernetes1001.eqiad.wmnet [12:58:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:55] addshore: yes [12:59:00] !log swat done! [12:59:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190117T1300) [13:03:20] :10 [13:03:27] (03PS1) 10GTirloni: Rebuild for Stretch and add .gitreview [debs/nfsd-ldap] - 10https://gerrit.wikimedia.org/r/485026 (https://phabricator.wikimedia.org/T209527) [13:04:53] (03CR) 10GTirloni: [C: 03+2] Rebuild for Stretch and add .gitreview [debs/nfsd-ldap] - 10https://gerrit.wikimedia.org/r/485026 (https://phabricator.wikimedia.org/T209527) (owner: 10GTirloni) [13:05:24] PROBLEM - eventlogging_sync processes on db1108 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /bin/bash /usr/local/bin/eventlogging_sync.sh [13:05:58] PROBLEM - Check status of defined EventLogging jobs on eventlog1002 is CRITICAL: CRITICAL: Stopped EventLogging jobs: eventlogging-consumer@mysql-m4-master-00 eventlogging-consumer@mysql-eventbus [13:06:11] ^ that is due to the maintenance on the rack [13:08:22] RECOVERY - Juniper alarms on asw-a-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [13:08:30] RECOVERY - Juniper alarms on asw2-a-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [13:18:09] PROBLEM - Host ps1-a3-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [13:33:15] PROBLEM - Host db1103.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:34:07] PROBLEM - Host dbproxy1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:34:09] PROBLEM - Juniper alarms on asw-a-eqiad is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [13:34:13] PROBLEM - Host analytics1056.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:34:13] PROBLEM - Juniper alarms on asw2-a-eqiad is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [13:34:39] PROBLEM - Host restbase1010.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:34:51] PROBLEM - Host relforge1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:35:29] PROBLEM - Host analytics1055.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:35:29] PROBLEM - Host dbproxy1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:35:29] PROBLEM - Host graphite1003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:35:33] PROBLEM - Host dbproxy1003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:35:35] rack a3 maintenance [13:35:43] PROBLEM - MariaDB Slave Lag: s7 on db2077 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.29 seconds [13:35:47] PROBLEM - Host analytics1053.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:35:47] PROBLEM - Host analytics1059.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:35:47] PROBLEM - MariaDB Slave Lag: s7 on db2047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.89 seconds [13:35:49] PROBLEM - MariaDB Slave Lag: s7 on db2086 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.09 seconds [13:35:53] PROBLEM - MariaDB Slave Lag: s7 on db2087 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.47 seconds [13:35:59] PROBLEM - MariaDB Slave Lag: s7 on db2068 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.30 seconds [13:35:59] PROBLEM - MariaDB Slave Lag: s7 on db2040 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.11 seconds [13:36:19] PROBLEM - MariaDB Slave Lag: s7 on db2061 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 306.77 seconds [13:36:21] PROBLEM - Host kubernetes1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:36:23] PROBLEM - Host restbase1011.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:36:29] PROBLEM - MariaDB Slave Lag: s7 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.32 seconds [13:36:41] PROBLEM - MariaDB Slave Lag: s7 on db2054 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.84 seconds [13:37:05] PROBLEM - Host analytics1052.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:37:05] PROBLEM - Host analytics1057.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:37:13] ACKNOWLEDGEMENT - Host rdb1005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff T213859 [13:37:13] ACKNOWLEDGEMENT - Host relforge1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff T213859 [13:37:13] ACKNOWLEDGEMENT - Host restbase1010.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff T213859 [13:37:13] ACKNOWLEDGEMENT - Host restbase1011.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff T213859 [13:37:13] PROBLEM - Host analytics1060.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:37:13] PROBLEM - Host analytics1054.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:38:07] PROBLEM - Host ganeti1007.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:38:07] ACKNOWLEDGEMENT - Host analytics1052.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff T213859 [13:38:07] ACKNOWLEDGEMENT - Host analytics1053.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff T213859 [13:38:07] ACKNOWLEDGEMENT - Host analytics1054.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff T213859 [13:38:07] ACKNOWLEDGEMENT - Host analytics1055.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff T213859 [13:38:08] ACKNOWLEDGEMENT - Host analytics1056.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff T213859 [13:38:08] ACKNOWLEDGEMENT - Host analytics1057.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff T213859 [13:38:09] ACKNOWLEDGEMENT - Host analytics1059.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff T213859 [13:38:09] ACKNOWLEDGEMENT - Host analytics1060.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff T213859 [13:38:22] (03PS4) 10Gehel: maps: migrate maps1003 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/483798 (https://phabricator.wikimedia.org/T198622) (owner: 10Mathew.onipe) [13:38:28] !log starting upgrade to stretch for maps1003 - T198622 [13:38:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:31] T198622: migrate maps servers to stretch with the current style - https://phabricator.wikimedia.org/T198622 [13:39:53] ACKNOWLEDGEMENT - Host cp1008.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff T213859 [13:39:53] ACKNOWLEDGEMENT - Host db1103.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff T213859 [13:40:43] ACKNOWLEDGEMENT - Host dbproxy1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff T213859 [13:40:43] ACKNOWLEDGEMENT - Host dbproxy1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff T213859 [13:40:43] ACKNOWLEDGEMENT - Host dbproxy1003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff T213859 [13:41:38] ACKNOWLEDGEMENT - Host elastic1032.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff T213859 [13:41:38] ACKNOWLEDGEMENT - Host elastic1033.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff T213859 [13:41:38] ACKNOWLEDGEMENT - Host elastic1034.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff T213859 [13:41:38] ACKNOWLEDGEMENT - Host elastic1035.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff T213859 [13:42:10] (03CR) 10Gehel: [C: 03+2] maps: migrate maps1003 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/483798 (https://phabricator.wikimedia.org/T198622) (owner: 10Mathew.onipe) [13:42:18] ACKNOWLEDGEMENT - Host ganeti1007.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff T213859 [13:42:18] ACKNOWLEDGEMENT - Host graphite1003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff T213859 [13:42:54] (03CR) 10Giuseppe Lavagetto: [C: 03+1] admin: allow users to be deployed without ssh keys configured (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/482275 (https://phabricator.wikimedia.org/T212949) (owner: 10Elukey) [13:43:31] 10Operations, 10Pybal, 10Traffic: inconsistencies between pybal configuration and IPVS status - https://phabricator.wikimedia.org/T214041 (10Vgutierrez) [13:44:03] ACKNOWLEDGEMENT - Host kubernetes1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff T213859 [13:44:05] (03CR) 10Nuria: "Funny, I tested this change locally and it still did not made that dimension appear. Now I can see it does." [puppet] - 10https://gerrit.wikimedia.org/r/485002 (https://phabricator.wikimedia.org/T212778) (owner: 10Joal) [13:46:06] !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: dc=.*,service=.*,cluster=kubernetes,name=kubernetes1001.eqiad.wmnet [13:46:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:57] mistake btw ^, not needed [13:55:38] (03CR) 10Nikerabbit: "It's probably easiest just to try and see whether this breaks anything. If it doesn't, I think it's fine." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485015 (https://phabricator.wikimedia.org/T211530) (owner: 10Ladsgroup) [13:59:11] PROBLEM - Check systemd state on maps1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:59:16] (03PS1) 10Alexandros Kosiaris: zotero: Set correct port in conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/485029 [14:00:04] Deploy window MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190117T1400) [14:01:11] !log pooling maps1004 (first time after stretch upgrade) - T198622 [14:01:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:15] T198622: migrate maps servers to stretch with the current style - https://phabricator.wikimedia.org/T198622 [14:02:10] (03CR) 10Fsero: [C: 03+2] zotero: Set correct port in conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/485029 (owner: 10Alexandros Kosiaris) [14:03:34] !log running ipvsadm -D -t 10.2.2.29:1968 in lvs1006 - T214041 [14:03:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:37] T214041: inconsistencies between pybal configuration and IPVS status - https://phabricator.wikimedia.org/T214041 [14:04:21] (03CR) 10Alexandros Kosiaris: "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/485029 (owner: 10Alexandros Kosiaris) [14:04:25] RECOVERY - PyBal IPVS diff check on lvs1006 is OK: OK: no difference between hosts in IPVS/PyBal [14:04:29] lovely [14:04:55] !log running ipvsadm -D -t 10.2.2.29:1968 in lvs1016 - T214041 [14:04:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:08] 10Operations, 10ops-codfw, 10decommission, 10monitoring: Decom graphite2001 - https://phabricator.wikimedia.org/T200209 (10fgiunchedi) I'm taking graphite2001 now to do some tests for prometheus v2 upgrade in {T187987} [14:05:55] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal [14:07:54] (03PS1) 10Arturo Borrero Gonzalez: aptly: add required ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/485030 (https://phabricator.wikimedia.org/T213421) [14:08:24] 10Operations, 10Pybal, 10Traffic: inconsistencies between pybal configuration and IPVS status - https://phabricator.wikimedia.org/T214041 (10Vgutierrez) After removing a service in pybal, a restart is not enough to get rid of the service at IPVS level, it should be removed manually with `ipvsadm -D -t ip:por... [14:12:55] PROBLEM - Host ms-be1019 is DOWN: PING CRITICAL - Packet loss = 100% [14:13:01] PROBLEM - Host ms-be1044 is DOWN: PING CRITICAL - Packet loss = 100% [14:13:11] PROBLEM - Host ms-be1045 is DOWN: PING CRITICAL - Packet loss = 100% [14:15:04] godog: expected? [14:15:44] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptly: add required ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/485030 (https://phabricator.wikimedia.org/T213421) (owner: 10Arturo Borrero Gonzalez) [14:16:48] paravoid: yeah, I'll extend the downtime [14:18:36] 10Operations, 10Maps, 10Reading-Infrastructure-Team-Backlog, 10Patch-For-Review: migrate maps servers to stretch with the current style - https://phabricator.wikimedia.org/T198622 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on cumin1001.eqiad.wmnet for hosts: ` ['maps1003.eqiad.wmn... [14:19:29] !log Drop empty frimpressions database from m2 - T213973 [14:19:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:35] T213973: Remove frimpressions db from prod mysql - https://phabricator.wikimedia.org/T213973 [14:20:58] (03PS8) 10Wangql: Modifying configuration about Chinese Wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482261 (https://phabricator.wikimedia.org/T212919) [14:24:41] PROBLEM - Host restbase1010 is DOWN: PING CRITICAL - Packet loss = 100% [14:24:45] PROBLEM - Host restbase1011 is DOWN: PING CRITICAL - Packet loss = 100% [14:24:54] !log Restarting migrateActors.php on s3 [14:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:07] same expired downtime for restbase hosts, I'll extend [14:26:07] (03CR) 10Wangql: [C: 03+1] "> Patch Set 8: Verified+1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482261 (https://phabricator.wikimedia.org/T212919) (owner: 10Wangql) [14:26:24] (03CR) 10Wangql: [C: 03+1] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482261 (https://phabricator.wikimedia.org/T212919) (owner: 10Wangql) [14:26:39] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10Nuria) 05Open→03Resolved [14:27:00] (03CR) 10Wangql: [C: 03+1] "> Patch Set 8: Verified+1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482261 (https://phabricator.wikimedia.org/T212919) (owner: 10Wangql) [14:28:23] (03PS1) 10Arturo Borrero Gonzalez: toolforge: use new instance for aptly server [puppet] - 10https://gerrit.wikimedia.org/r/485035 (https://phabricator.wikimedia.org/T213421) [14:29:07] (03CR) 10jerkins-bot: [V: 04-1] toolforge: use new instance for aptly server [puppet] - 10https://gerrit.wikimedia.org/r/485035 (https://phabricator.wikimedia.org/T213421) (owner: 10Arturo Borrero Gonzalez) [14:30:57] 10Operations, 10Pybal, 10Traffic: inconsistencies between pybal configuration and IPVS status - https://phabricator.wikimedia.org/T214041 (10Vgutierrez) p:05Triage→03Normal [14:31:21] RECOVERY - Juniper alarms on asw-a-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [14:31:51] PROBLEM - Backup of s5 in codfw on db1115 is CRITICAL: Backup for s5 at codfw taken more than 8 days ago: Most recent backup 2019-01-09 14:08:59 [14:32:15] ^ I will check that [14:32:19] this is known and not an issue [14:32:32] I did a copy manually [14:32:48] ah, ok, you want me to downtime the alert? [14:32:58] I was going to do it [14:33:01] (03PS1) 10Elukey: role::prometheus::analytics: add Hadoop test cluster metrics [puppet] - 10https://gerrit.wikimedia.org/r/485036 (https://phabricator.wikimedia.org/T212256) [14:33:06] I'll let you do it then :) [14:33:45] PROBLEM - Host relforge1001 is DOWN: PING CRITICAL - Packet loss = 100% [14:33:47] PROBLEM - ElasticSearch health check for shards on 9200 on relforge1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 49 threshold =0.15 breach: status: red, number_of_nodes: 1, unassigned_shards: 49, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 51, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: [14:33:47] rds: 51, initializing_shards: 0, number_of_data_nodes: 1, delayed_unassigned_shards: 0 [14:33:57] RECOVERY - Host analytics1056.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms [14:34:53] (03CR) 10星耀晨曦: "@Wangql Note:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482261 (https://phabricator.wikimedia.org/T212919) (owner: 10Wangql) [14:35:13] RECOVERY - Host analytics1055.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.87 ms [14:35:29] RECOVERY - Host analytics1053.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.13 ms [14:35:29] RECOVERY - Host analytics1059.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.86 ms [14:35:39] RECOVERY - Host elastic1031 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [14:35:49] RECOVERY - Host elastic1030 is UP: PING WARNING - Packet loss = 44%, RTA = 0.31 ms [14:36:05] RECOVERY - Host kubernetes1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms [14:36:22] !log rolling out update for debdeploy 0.0.99.6-1 -> 0.0.99.7-1 T207845 [14:36:23] RECOVERY - Host restbase1011.mgmt is UP: PING OK - Packet loss = 16%, RTA = 222.82 ms [14:36:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:28] T207845: debdeploy: show help message if invoked with no arguments - https://phabricator.wikimedia.org/T207845 [14:36:41] RECOVERY - Host cloudservices1004 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [14:36:57] RECOVERY - Host analytics1057.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.99 ms [14:36:57] RECOVERY - Host analytics1052.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.20 ms [14:37:03] RECOVERY - Host analytics1060.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.06 ms [14:37:03] RECOVERY - Host analytics1054.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.10 ms [14:37:03] RECOVERY - Host elastic1030.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.88 ms [14:38:12] (03PS9) 10Wangql: Modifying configuration about Chinese Wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482261 (https://phabricator.wikimedia.org/T212919) [14:38:19] RECOVERY - Host elastic1031.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.87 ms [14:38:39] RECOVERY - Host ganeti1007.mgmt is UP: PING WARNING - Packet loss = 61%, RTA = 1.29 ms [14:38:47] RECOVERY - Host cloudservices1004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.18 ms [14:39:01] RECOVERY - Host prometheus1003.mgmt is UP: PING WARNING - Packet loss = 37%, RTA = 1.21 ms [14:39:21] RECOVERY - Host cp1008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 6.12 ms [14:39:21] RECOVERY - Host dbproxy1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.11 ms [14:39:40] will increase downtime for the health checker [14:39:55] RECOVERY - Host relforge1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms [14:40:07] RECOVERY - Host prometheus1003 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [14:40:17] RECOVERY - Host restbase1010.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.74 ms [14:40:29] RECOVERY - Host graphite1003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.92 ms [14:40:31] RECOVERY - Host restbase1016.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.93 ms [14:40:33] RECOVERY - Host dbproxy1003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.95 ms [14:40:53] RECOVERY - Host dbstore1003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.21 ms [14:41:55] PROBLEM - IPMI Sensor Status on prometheus1003 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [14:43:35] RECOVERY - Host relforge1001 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [14:44:07] RECOVERY - Host db1103.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.86 ms [14:44:33] 10Operations, 10Wikimedia-Logstash, 10User-Addshore: Investigate missing WikibaseQualityConstraints logs in logstash. - https://phabricator.wikimedia.org/T214031 (10fgiunchedi) First suspect I checked was the move to new logging infra (on the mw in practice means moving to syslogging to localhost) and there... [14:45:47] RECOVERY - Host dbproxy1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.21 ms [14:46:51] RECOVERY - ElasticSearch health check for shards on 9200 on relforge1002 is OK: OK - elasticsearch status relforge-eqiad: status: green, number_of_nodes: 2, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 82, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, active_shards: 103, in [14:46:51] : 0, number_of_data_nodes: 2, delayed_unassigned_shards: 0 [14:49:47] !log rebooting darmstadtium (docker registry) to enable SSBD-enabled qemu [14:49:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:49] 10Operations, 10Maps, 10Reading-Infrastructure-Team-Backlog, 10Patch-For-Review: migrate maps servers to stretch with the current style - https://phabricator.wikimedia.org/T198622 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['maps1003.eqiad.wmnet'] ` and were **ALL** successful. [14:52:23] RECOVERY - Host ganeti1007 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [14:52:23] (03CR) 10星耀晨曦: "About the whitelist: " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482261 (https://phabricator.wikimedia.org/T212919) (owner: 10Wangql) [14:52:35] RECOVERY - Host rdb1005 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [14:54:21] RECOVERY - Host restbase1011 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [14:54:27] RECOVERY - Host restbase1010 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [14:58:17] (03CR) 10Wangql: "> Patch Set 9: Verified+1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482261 (https://phabricator.wikimedia.org/T212919) (owner: 10Wangql) [15:03:59] PROBLEM - IPMI Sensor Status on cloudservices1004 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] [15:06:35] RECOVERY - Host ps1-a3-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.67 ms [15:09:05] PROBLEM - ps1-a3-eqiad-infeed-load-tower-B-phase-Z on ps1-a3-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call [15:09:39] PROBLEM - ps1-a3-eqiad-infeed-load-tower-B-phase-Y on ps1-a3-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call [15:09:45] PROBLEM - ps1-a3-eqiad-infeed-load-tower-A-phase-Z on ps1-a3-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call [15:09:53] PROBLEM - ps1-a3-eqiad-infeed-load-tower-A-phase-X on ps1-a3-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call [15:09:53] PROBLEM - ps1-a3-eqiad-infeed-load-tower-A-phase-Y on ps1-a3-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call [15:09:59] PROBLEM - ps1-a3-eqiad-infeed-load-tower-B-phase-X on ps1-a3-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call [15:12:32] !log rebooting archiva1001 (archiva.wikimedia.org) to enable SSBD-enabled qemu [15:12:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:41] (03CR) 10Elukey: [C: 03+2] role::prometheus::analytics: add Hadoop test cluster metrics [puppet] - 10https://gerrit.wikimedia.org/r/485036 (https://phabricator.wikimedia.org/T212256) (owner: 10Elukey) [15:17:24] (03PS3) 10Fsero: docker_registry_ha: package installation missing [puppet] - 10https://gerrit.wikimedia.org/r/484510 [15:18:59] (03CR) 10Giuseppe Lavagetto: [C: 03+1] docker_registry_ha: package installation missing [puppet] - 10https://gerrit.wikimedia.org/r/484510 (owner: 10Fsero) [15:19:24] (03CR) 10Fsero: [C: 03+2] "Thanks" [puppet] - 10https://gerrit.wikimedia.org/r/484510 (owner: 10Fsero) [15:26:59] PROBLEM - IPMI Sensor Status on elastic1031 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [15:28:04] (03CR) 10Bstorm: "I keep thinking this data was all intended to end up in a dashboard." [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/484849 (https://phabricator.wikimedia.org/T107878) (owner: 10BryanDavis) [15:28:53] (03PS1) 10Tulsi Bhagat: Enable transwiki user group on ne.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485043 [15:29:51] 10Operations, 10ORES, 10Scoring-platform-team, 10Performance: Stress test ORES/kubernetes (above 4.5k scores/second) - https://phabricator.wikimedia.org/T214054 (10Halfak) p:05Triage→03Low [15:30:28] 10Operations, 10ORES, 10Scoring-platform-team, 10Patch-For-Review, 10Performance: Diagnose and fix 4.5k req/min ceiling for ores* requests - https://phabricator.wikimedia.org/T182249 (10Halfak) 05Stalled→03Declined I made: {T214054} as a subtask of that epic so we can continue our work there. [15:30:41] PROBLEM - configured eth on analytics1056 is CRITICAL: connect to address 10.64.5.19 port 5666: Connection refused [15:30:47] PROBLEM - Disk space on Hadoop worker on analytics1056 is CRITICAL: connect to address 10.64.5.19 port 5666: Connection refused [15:30:47] PROBLEM - Hadoop DataNode on analytics1056 is CRITICAL: connect to address 10.64.5.19 port 5666: Connection refused [15:30:47] PROBLEM - Check size of conntrack table on analytics1056 is CRITICAL: connect to address 10.64.5.19 port 5666: Connection refused [15:30:49] PROBLEM - Host analytics1054 is DOWN: PING CRITICAL - Packet loss = 100% [15:30:53] PROBLEM - Check systemd state on analytics1056 is CRITICAL: connect to address 10.64.5.19 port 5666: Connection refused [15:31:13] PROBLEM - YARN NodeManager Node-State on analytics1056 is CRITICAL: connect to address 10.64.5.19 port 5666: Connection refused [15:31:13] PROBLEM - SSH on analytics1056 is CRITICAL: connect to address 10.64.5.19 and port 22: Connection refused [15:31:13] PROBLEM - Disk space on analytics1056 is CRITICAL: connect to address 10.64.5.19 port 5666: Connection refused [15:31:13] PROBLEM - Check whether ferm is active by checking the default input chain on analytics1056 is CRITICAL: connect to address 10.64.5.19 port 5666: Connection refused [15:31:25] PROBLEM - dhclient process on analytics1056 is CRITICAL: connect to address 10.64.5.19 port 5666: Connection refused [15:31:31] PROBLEM - Hadoop NodeManager on analytics1056 is CRITICAL: connect to address 10.64.5.19 port 5666: Connection refused [15:31:31] PROBLEM - DPKG on analytics1056 is CRITICAL: connect to address 10.64.5.19 port 5666: Connection refused [15:32:16] checking [15:32:31] PROBLEM - puppet last run on analytics1056 is CRITICAL: connect to address 10.64.5.19 port 5666: Connection refused [15:33:55] PROBLEM - IPMI Sensor Status on elastic1030 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [15:35:20] (03PS2) 10Volans: documentation: fine-tune generated documentation [software/spicerack] - 10https://gerrit.wikimedia.org/r/484330 [15:35:37] RECOVERY - IPMI Sensor Status on analytics1057 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [15:35:49] 10Operations, 10Wikimedia-Logstash, 10User-Addshore: Investigate missing WikibaseQualityConstraints logs in logstash. - https://phabricator.wikimedia.org/T214031 (10fgiunchedi) I can't find anything obviously wrong, the pipeline looks like this with protocols in parenthesis: mw -> (syslog on localhost) -> rs... [15:36:10] (03CR) 10Volans: "rebased" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/484330 (owner: 10Volans) [15:37:45] (03PS2) 10Tulsi Bhagat: Enable transwiki user group on ne.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485043 (https://phabricator.wikimedia.org/T214036) [15:37:47] PROBLEM - Check the NTP synchronisation status of timesyncd on analytics1056 is CRITICAL: connect to address 10.64.5.19 port 5666: Connection refused [15:38:42] addshore: task updated, though not with a whole lot of findings, sorry :( have to context switch to other duties for now [15:39:19] PROBLEM - MariaDB Slave Lag: s2 on db2041 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.80 seconds [15:39:29] PROBLEM - MariaDB Slave Lag: s2 on db2049 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.11 seconds [15:39:29] PROBLEM - MariaDB Slave Lag: s2 on db2035 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.12 seconds [15:39:35] PROBLEM - MariaDB Slave Lag: s2 on db2056 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.67 seconds [15:39:59] PROBLEM - MariaDB Slave Lag: s2 on db2063 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 306.79 seconds [15:40:03] PROBLEM - MariaDB Slave Lag: s2 on db2088 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 306.61 seconds [15:40:13] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:40:13] PROBLEM - MariaDB Slave Lag: s2 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 308.62 seconds [15:40:15] PROBLEM - MariaDB Slave Lag: s2 on db2091 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.40 seconds [15:40:37] PROBLEM - Request latencies on acrab is CRITICAL: instance=10.192.16.26:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:40:55] PROBLEM - Host analytics1056 is DOWN: PING CRITICAL - Packet loss = 100% [15:41:23] PROBLEM - Request latencies on acrux is CRITICAL: instance=10.192.0.93:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:41:51] RECOVERY - Request latencies on acrab is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:42:01] RECOVERY - Host analytics1054 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [15:42:21] RECOVERY - IPMI Sensor Status on prometheus1003 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [15:42:37] RECOVERY - Request latencies on acrux is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:42:41] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:43:42] (03CR) 10Filippo Giunchedi: "Looks good! I'm ok to merge this as is, however we should make sure do to another conversion just before migrating, in case we update the " [puppet] - 10https://gerrit.wikimedia.org/r/484793 (https://phabricator.wikimedia.org/T213708) (owner: 10Cwhite) [15:44:35] PROBLEM - Hadoop NodeManager on analytics1054 is CRITICAL: connect to address 10.64.5.17 port 5666: Connection refused [15:44:43] PROBLEM - Hadoop DataNode on analytics1054 is CRITICAL: connect to address 10.64.5.17 port 5666: Connection refused [15:44:43] PROBLEM - dhclient process on analytics1054 is CRITICAL: connect to address 10.64.5.17 port 5666: Connection refused [15:44:43] PROBLEM - Disk space on analytics1054 is CRITICAL: connect to address 10.64.5.17 port 5666: Connection refused [15:44:55] PROBLEM - Check whether ferm is active by checking the default input chain on analytics1054 is CRITICAL: connect to address 10.64.5.17 port 5666: Connection refused [15:45:09] PROBLEM - SSH on analytics1054 is CRITICAL: connect to address 10.64.5.17 and port 22: Connection refused [15:45:15] PROBLEM - configured eth on analytics1054 is CRITICAL: connect to address 10.64.5.17 port 5666: Connection refused [15:45:21] PROBLEM - DPKG on analytics1054 is CRITICAL: connect to address 10.64.5.17 port 5666: Connection refused [15:45:21] PROBLEM - Check size of conntrack table on analytics1054 is CRITICAL: connect to address 10.64.5.17 port 5666: Connection refused [15:45:23] PROBLEM - puppet last run on analytics1054 is CRITICAL: connect to address 10.64.5.17 port 5666: Connection refused [15:45:29] PROBLEM - Disk space on Hadoop worker on analytics1054 is CRITICAL: connect to address 10.64.5.17 port 5666: Connection refused [15:45:41] PROBLEM - Check systemd state on analytics1054 is CRITICAL: connect to address 10.64.5.17 port 5666: Connection refused [15:45:47] (03PS3) 10Giuseppe Lavagetto: profile::services_proxy: simple local proxying for remote services [puppet] - 10https://gerrit.wikimedia.org/r/483788 (https://phabricator.wikimedia.org/T210717) [15:46:15] RECOVERY - Juniper alarms on asw2-a-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [15:46:17] RECOVERY - Host analytics1056 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [15:46:21] PROBLEM - configured eth on analytics1056 is CRITICAL: connect to address 10.64.5.19 port 5666: Connection refused [15:46:29] PROBLEM - Check size of conntrack table on analytics1056 is CRITICAL: connect to address 10.64.5.19 port 5666: Connection refused [15:46:29] PROBLEM - Disk space on Hadoop worker on analytics1056 is CRITICAL: connect to address 10.64.5.19 port 5666: Connection refused [15:46:29] PROBLEM - Hadoop DataNode on analytics1056 is CRITICAL: connect to address 10.64.5.19 port 5666: Connection refused [15:46:33] PROBLEM - Check systemd state on analytics1056 is CRITICAL: connect to address 10.64.5.19 port 5666: Connection refused [15:46:37] (03CR) 10jerkins-bot: [V: 04-1] profile::services_proxy: simple local proxying for remote services [puppet] - 10https://gerrit.wikimedia.org/r/483788 (https://phabricator.wikimedia.org/T210717) (owner: 10Giuseppe Lavagetto) [15:46:57] PROBLEM - Check whether ferm is active by checking the default input chain on analytics1056 is CRITICAL: connect to address 10.64.5.19 port 5666: Connection refused [15:46:57] PROBLEM - Disk space on analytics1056 is CRITICAL: connect to address 10.64.5.19 port 5666: Connection refused [15:46:57] PROBLEM - YARN NodeManager Node-State on analytics1056 is CRITICAL: connect to address 10.64.5.19 port 5666: Connection refused [15:47:57] PROBLEM - puppet last run on analytics1056 is CRITICAL: connect to address 10.64.5.19 port 5666: Connection refused [15:48:08] (03CR) 10Bstorm: "So I have a possibly stupid question. The manifest file is created by something else, no? This variable never gets written out (does it?" [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/484849 (https://phabricator.wikimedia.org/T107878) (owner: 10BryanDavis) [15:48:33] RECOVERY - IPMI Sensor Status on elastic1035 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [15:49:03] RECOVERY - IPMI Sensor Status on analytics1053 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [15:49:13] PROBLEM - YARN NodeManager Node-State on analytics1054 is CRITICAL: connect to address 10.64.5.17 port 5666: Connection refused [15:49:29] PROBLEM - MegaRAID on analytics1056 is CRITICAL: connect to address 10.64.5.19 port 5666: Connection refused [15:50:11] RECOVERY - IPMI Sensor Status on graphite1003 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [15:50:43] RECOVERY - IPMI Sensor Status on analytics1059 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [15:50:47] RECOVERY - IPMI Sensor Status on db1103 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [15:50:57] RECOVERY - IPMI Sensor Status on elastic1033 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [15:51:11] RECOVERY - IPMI Sensor Status on analytics1052 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [15:51:47] RECOVERY - IPMI Sensor Status on elastic1034 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [15:53:16] (03PS1) 10Vgutierrez: pybal: check for discrepancies in the configured services [puppet] - 10https://gerrit.wikimedia.org/r/485044 (https://phabricator.wikimedia.org/T203194) [15:53:44] (03CR) 10jerkins-bot: [V: 04-1] pybal: check for discrepancies in the configured services [puppet] - 10https://gerrit.wikimedia.org/r/485044 (https://phabricator.wikimedia.org/T203194) (owner: 10Vgutierrez) [15:54:22] (03PS2) 10Vgutierrez: pybal: check for discrepancies in the configured services [puppet] - 10https://gerrit.wikimedia.org/r/485044 (https://phabricator.wikimedia.org/T214041) [15:54:27] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:54:51] (03CR) 10jerkins-bot: [V: 04-1] pybal: check for discrepancies in the configured services [puppet] - 10https://gerrit.wikimedia.org/r/485044 (https://phabricator.wikimedia.org/T214041) (owner: 10Vgutierrez) [15:55:19] RECOVERY - IPMI Sensor Status on dbproxy1003 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [15:55:33] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:55:35] (03PS3) 10Vgutierrez: pybal: check for discrepancies in the configured services [puppet] - 10https://gerrit.wikimedia.org/r/485044 (https://phabricator.wikimedia.org/T214041) [15:55:37] PROBLEM - tilerator on maps1001 is CRITICAL: connect to address 10.64.0.79 and port 6534: Connection refused [15:55:51] gehel: ^ [15:56:15] PROBLEM - tilerator on maps1002 is CRITICAL: connect to address 10.64.16.42 and port 6534: Connection refused [15:56:25] RECOVERY - IPMI Sensor Status on dbproxy1002 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [15:57:05] RECOVERY - IPMI Sensor Status on elastic1031 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [15:57:09] (03PS4) 10Vgutierrez: pybal: check for discrepancies in the configured services [puppet] - 10https://gerrit.wikimedia.org/r/485044 (https://phabricator.wikimedia.org/T214041) [15:57:22] not good [15:57:35] RECOVERY - IPMI Sensor Status on ganeti1007 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [15:58:33] RECOVERY - IPMI Sensor Status on relforge1001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [15:59:29] godog: find with me :) I don't care too much, but I imagine ops will if this issue is a real issue / still happening etc :) [15:59:35] so no rush from my side :D [15:59:36] onimisionipe: let's restart tilerator [15:59:55] gehel: on it [16:00:00] thanks! [16:00:01] RECOVERY - IPMI Sensor Status on rdb1005 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [16:00:17] PROBLEM - Check systemd state on maps1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:01:07] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:01:09] RECOVERY - IPMI Sensor Status on analytics1055 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [16:01:21] RECOVERY - Check systemd state on maps1002 is OK: OK - running: The system is fully operational [16:01:23] RECOVERY - Check systemd state on maps1001 is OK: OK - running: The system is fully operational [16:01:47] RECOVERY - IPMI Sensor Status on dbproxy1001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [16:01:49] RECOVERY - tilerator on maps1002 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.032 second response time [16:02:15] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:02:17] RECOVERY - tilerator on maps1001 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.033 second response time [16:02:57] RECOVERY - IPMI Sensor Status on restbase1011 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [16:04:03] RECOVERY - IPMI Sensor Status on elastic1030 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [16:04:15] RECOVERY - IPMI Sensor Status on elastic1032 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [16:04:19] RECOVERY - IPMI Sensor Status on cloudservices1004 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [16:04:21] (03PS1) 10Bstorm: toolforge: move webservicemonitor to the cronrunner [puppet] - 10https://gerrit.wikimedia.org/r/485046 [16:04:51] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:04:57] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [16:05:11] RECOVERY - IPMI Sensor Status on kubernetes1001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [16:05:11] RECOVERY - IPMI Sensor Status on analytics1060 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [16:06:03] RECOVERY - IPMI Sensor Status on restbase1010 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [16:06:18] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] toolforge: move webservicemonitor to the cronrunner [puppet] - 10https://gerrit.wikimedia.org/r/485046 (owner: 10Bstorm) [16:06:48] (03CR) 10Volans: [C: 03+1] "Looks sane to me, I didn't verify that the prometheus data mangling is indeed correct though." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/485044 (https://phabricator.wikimedia.org/T214041) (owner: 10Vgutierrez) [16:07:03] (03CR) 10Bstorm: [C: 03+2] toolforge: move webservicemonitor to the cronrunner [puppet] - 10https://gerrit.wikimedia.org/r/485046 (owner: 10Bstorm) [16:08:05] Fsero: can I merge your docker patch thingy? [16:08:18] yes sorry bstorm_ [16:08:25] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:08:27] no worries just want to make sure! [16:08:30] (03CR) 10BryanDavis: "> So I have a possibly stupid question. The manifest file is created" [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/484849 (https://phabricator.wikimedia.org/T107878) (owner: 10BryanDavis) [16:08:35] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [16:09:17] PROBLEM - MariaDB Slave Lag: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 646.85 seconds [16:11:57] PROBLEM - MariaDB Slave Lag: s2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 610.98 seconds [16:11:58] addshore: ack, thanks for the detailed task! definitely appreciate it, let us know if you see sth like that again [16:12:01] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:12:09] godog: will do! :) [16:15:26] (03PS8) 10Ottomata: Add kafka-dev chart for local development [deployment-charts] - 10https://gerrit.wikimedia.org/r/484498 (https://phabricator.wikimedia.org/T211247) [16:15:43] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:16:00] !log rebooting tureis (failoid node in codfw) to enable SSBD-enabled qemu [16:16:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:35] RECOVERY - dhclient process on analytics1056 is OK: PROCS OK: 0 processes with command name dhclient [16:17:37] RECOVERY - SSH on analytics1056 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) [16:17:37] RECOVERY - Disk space on analytics1056 is OK: DISK OK [16:17:37] RECOVERY - Check whether ferm is active by checking the default input chain on analytics1056 is OK: OK ferm input default policy is set [16:17:39] 10Operations, 10MediaWiki-extensions-WikibaseClient, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, and 8 others: Investigate more efficient memcached solution for CacheAwarePropertyInfoStore - https://phabricator.wikimedia.org/T97368 (10Addshore) [16:18:01] RECOVERY - DPKG on analytics1056 is OK: All packages OK [16:18:05] RECOVERY - configured eth on analytics1056 is OK: OK - interfaces up [16:18:07] RECOVERY - Hadoop NodeManager on analytics1056 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [16:18:09] RECOVERY - YARN NodeManager Node-State on analytics1056 is OK: OK: YARN NodeManager analytics1056.eqiad.wmnet:8041 Node-State: RUNNING [16:18:15] RECOVERY - Disk space on Hadoop worker on analytics1056 is OK: DISK OK [16:18:17] RECOVERY - Check size of conntrack table on analytics1056 is OK: OK: nf_conntrack is 0 % full [16:18:17] RECOVERY - Hadoop DataNode on analytics1056 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [16:18:25] RECOVERY - Check systemd state on analytics1056 is OK: OK - running: The system is fully operational [16:18:31] !log ps1-a2-eqiad removing redundant power from side A to replace blown fuse [16:18:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:01] !log rebooting roentgenium (failoid node in eqiad) to enable SSBD-enabled qemu [16:19:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:24] (03PS13) 10Ottomata: [WIP] Helm chart for eventgate-analytics deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/483035 (https://phabricator.wikimedia.org/T211247) [16:21:25] PROBLEM - Host an-worker1079 is DOWN: PING CRITICAL - Packet loss = 100% [16:21:41] PROBLEM - Host an-worker1078 is DOWN: PING CRITICAL - Packet loss = 100% [16:21:57] RECOVERY - IPMI Sensor Status on analytics1056 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [16:22:05] PROBLEM - Host kafka1023 is DOWN: PING CRITICAL - Packet loss = 100% [16:22:13] PROBLEM - Host kafka1013 is DOWN: PING CRITICAL - Packet loss = 100% [16:23:07] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:23:37] RECOVERY - Host an-worker1078 is UP: PING WARNING - Packet loss = 54%, RTA = 0.27 ms [16:23:39] RECOVERY - Host an-worker1079 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [16:24:07] RECOVERY - Host kafka1023 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [16:24:13] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:24:35] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:25:27] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:25:57] PROBLEM - Host kafka1012.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:25:59] PROBLEM - MariaDB Slave Lag: s6 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [16:26:11] PROBLEM - Request latencies on acrab is CRITICAL: instance=10.192.16.26:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:26:13] RECOVERY - Host kafka1013 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [16:27:04] elukey: and dbstore1002 mysql just died [16:27:07] PROBLEM - MariaDB Slave Lag: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [16:27:18] marostegui: /me cries in a corner [16:27:20] !log fsero@puppetmaster1001 conftool action : set/pooled=yes; selector: name=kubernetes1001.eqiad.wmnet [16:27:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:41] PROBLEM - MariaDB Slave IO: s1 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [16:27:45] PROBLEM - MariaDB Slave SQL: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [16:27:47] PROBLEM - MariaDB Slave SQL: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [16:27:53] PROBLEM - MariaDB Slave IO: m2 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [16:27:53] PROBLEM - MariaDB Slave SQL: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [16:28:01] PROBLEM - MariaDB Slave SQL: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [16:28:03] PROBLEM - MariaDB Slave SQL: s8 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [16:28:03] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:28:07] PROBLEM - MariaDB Slave SQL: m3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [16:28:15] PROBLEM - MariaDB Slave SQL: s2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [16:28:15] !log uncordoned kubernetes1001 [16:28:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:19] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:28:21] PROBLEM - MariaDB Slave IO: x1 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [16:28:23] PROBLEM - MariaDB Slave IO: s8 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [16:28:27] PROBLEM - MariaDB Slave IO: s6 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [16:28:27] PROBLEM - MariaDB Slave IO: s2 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [16:28:27] PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [16:28:29] PROBLEM - MariaDB Slave SQL: m2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [16:28:31] PROBLEM - MariaDB Slave SQL: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [16:28:35] PROBLEM - MariaDB Slave IO: s5 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [16:28:35] PROBLEM - MariaDB Slave SQL: s6 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [16:28:37] PROBLEM - MariaDB Slave IO: s7 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [16:28:37] PROBLEM - MariaDB Slave IO: m3 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [16:28:39] RECOVERY - Request latencies on acrab is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:28:45] PROBLEM - MariaDB Slave IO: s4 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [16:28:49] PROBLEM - MariaDB Slave IO: s3 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [16:29:54] (03PS1) 10Bstorm: toolforge: fix the services role in case we need to rollback [puppet] - 10https://gerrit.wikimedia.org/r/485050 [16:30:20] 10Operations, 10monitoring: Serve >= 50% of production Prometheus systems with Prometheus v2 - https://phabricator.wikimedia.org/T187987 (10fgiunchedi) [16:30:59] (03PS1) 10Elukey: Remove /var/lib/hadoop/g from analytics1056's Hadoop conf [puppet] - 10https://gerrit.wikimedia.org/r/485051 (https://phabricator.wikimedia.org/T214057) [16:31:55] (03CR) 10Bstorm: [C: 03+2] toolforge: fix the services role in case we need to rollback [puppet] - 10https://gerrit.wikimedia.org/r/485050 (owner: 10Bstorm) [16:32:28] 10Operations, 10monitoring: Serve >= 50% of production Prometheus systems with Prometheus v2 - https://phabricator.wikimedia.org/T187987 (10fgiunchedi) Converting beta prometheus worked well (sans puppetization) and I'll be testing a conversion on production data from codfw on graphite2001 (spare host) [16:34:39] (03CR) 10Elukey: [C: 03+2] Remove /var/lib/hadoop/g from analytics1056's Hadoop conf [puppet] - 10https://gerrit.wikimedia.org/r/485051 (https://phabricator.wikimedia.org/T214057) (owner: 10Elukey) [16:34:46] (03PS2) 10Elukey: Remove /var/lib/hadoop/g from analytics1056's Hadoop conf [puppet] - 10https://gerrit.wikimedia.org/r/485051 (https://phabricator.wikimedia.org/T214057) [16:34:48] (03CR) 10Elukey: [V: 03+2 C: 03+2] Remove /var/lib/hadoop/g from analytics1056's Hadoop conf [puppet] - 10https://gerrit.wikimedia.org/r/485051 (https://phabricator.wikimedia.org/T214057) (owner: 10Elukey) [16:35:33] 10Operations, 10Cassandra, 10Dependency-Tracking, 10Wikibase-Quality, and 6 others: Store WikibaseQualityConstraint check data in persistent storage instead of in the cache - https://phabricator.wikimedia.org/T204024 (10Addshore) > will we store data only for the latest revision or not (implies different s... [16:37:15] RECOVERY - Disk space on analytics1054 is OK: DISK OK [16:37:17] RECOVERY - Check whether ferm is active by checking the default input chain on analytics1054 is OK: OK ferm input default policy is set [16:37:17] RECOVERY - MegaRAID on analytics1054 is OK: OK: optimal, 12 logical, 13 physical, WriteBack policy [16:37:21] RECOVERY - Check systemd state on analytics1054 is OK: OK - running: The system is fully operational [16:37:31] RECOVERY - Hadoop NodeManager on analytics1054 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [16:37:41] RECOVERY - YARN NodeManager Node-State on analytics1054 is OK: OK: YARN NodeManager analytics1054.eqiad.wmnet:8041 Node-State: RUNNING [16:37:49] RECOVERY - dhclient process on analytics1054 is OK: PROCS OK: 0 processes with command name dhclient [16:37:51] RECOVERY - SSH on analytics1054 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) [16:37:59] RECOVERY - configured eth on analytics1054 is OK: OK - interfaces up [16:38:01] ACKNOWLEDGEMENT - eventlogging_sync processes on db1108 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /bin/bash /usr/local/bin/eventlogging_sync.sh Marostegui rack a2 maintenance - The acknowledgement expires at: 2019-01-18 16:37:45. [16:38:01] ACKNOWLEDGEMENT - haproxy failover on dbproxy1004 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Marostegui rack a2 maintenance - The acknowledgement expires at: 2019-01-18 16:37:45. [16:38:01] ACKNOWLEDGEMENT - haproxy failover on dbproxy1009 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Marostegui rack a2 maintenance - The acknowledgement expires at: 2019-01-18 16:37:45. [16:38:07] RECOVERY - Hadoop DataNode on analytics1054 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [16:38:09] RECOVERY - Check size of conntrack table on analytics1054 is OK: OK: nf_conntrack is 0 % full [16:38:09] RECOVERY - DPKG on analytics1054 is OK: All packages OK [16:38:15] RECOVERY - Check the NTP synchronisation status of timesyncd on analytics1056 is OK: OK: synced at Thu 2019-01-17 16:38:14 UTC. [16:38:16] 10Operations, 10monitoring: Serve >= 50% of production Prometheus systems with Prometheus v2 - https://phabricator.wikimedia.org/T187987 (10fgiunchedi) [16:38:16] 10Operations, 10Cassandra, 10Dependency-Tracking, 10Wikibase-Quality, and 7 others: Store WikibaseQualityConstraint check data in persistent storage instead of in the cache - https://phabricator.wikimedia.org/T204024 (10Addshore) [16:38:21] RECOVERY - Disk space on Hadoop worker on analytics1054 is OK: DISK OK [16:39:31] RECOVERY - puppet last run on analytics1056 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [16:41:26] (03CR) 10Bstorm: "> Patch Set 1:" [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/484849 (https://phabricator.wikimedia.org/T107878) (owner: 10BryanDavis) [16:41:44] !log dcausse@deploy1001 Started deploy [search/mjolnir/deploy@42414ca]: add support for multi-instances setup [16:41:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:04] Prod clear for an MW-land deployment? UBN backport for some VE code. [16:42:25] RECOVERY - puppet last run on analytics1054 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:43:49] (03CR) 10GTirloni: [C: 03+2] wmcs::nfs::misc - Remove unused /srv/* exports [puppet] - 10https://gerrit.wikimedia.org/r/485053 (https://phabricator.wikimedia.org/T209527) (owner: 10GTirloni) [16:43:59] (03PS2) 10GTirloni: wmcs::nfs::misc - Remove unused /srv/* exports [puppet] - 10https://gerrit.wikimedia.org/r/485053 (https://phabricator.wikimedia.org/T209527) [16:45:01] RECOVERY - ps1-a3-eqiad-infeed-load-tower-B-phase-X on ps1-a3-eqiad is OK: SNMP OK - ps1-a3-eqiad-infeed-load-tower-B-phase-X 900 [16:45:05] RECOVERY - ps1-a3-eqiad-infeed-load-tower-A-phase-X on ps1-a3-eqiad is OK: SNMP OK - ps1-a3-eqiad-infeed-load-tower-A-phase-X 850 [16:45:07] RECOVERY - ps1-a3-eqiad-infeed-load-tower-A-phase-Y on ps1-a3-eqiad is OK: SNMP OK - ps1-a3-eqiad-infeed-load-tower-A-phase-Y 663 [16:45:07] RECOVERY - ps1-a3-eqiad-infeed-load-tower-B-phase-Z on ps1-a3-eqiad is OK: SNMP OK - ps1-a3-eqiad-infeed-load-tower-B-phase-Z 900 [16:45:14] !log updating ps1-a3-eqiad's SNMP communities to the new ones [16:45:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:17] PROBLEM - Host kafka1012 is DOWN: PING CRITICAL - Packet loss = 100% [16:45:18] i.e. ^^ is me [16:45:18] 10Operations, 10netops: Netbox Dies Mysteriously Sometimes - https://phabricator.wikimedia.org/T214008 (10crusnov) >>! In T214008#4887389, @elukey wrote: > @crusnov hi! I think this is the same issue as T212697 Ah this is undoubtedly correct. [16:45:19] RECOVERY - ps1-a3-eqiad-infeed-load-tower-B-phase-Y on ps1-a3-eqiad is OK: SNMP OK - ps1-a3-eqiad-infeed-load-tower-B-phase-Y 600 [16:45:25] RECOVERY - ps1-a3-eqiad-infeed-load-tower-A-phase-Z on ps1-a3-eqiad is OK: SNMP OK - ps1-a3-eqiad-infeed-load-tower-A-phase-Z 788 [16:46:12] 10Operations, 10monitoring: Serve >= 50% of production Prometheus systems with Prometheus v2 - https://phabricator.wikimedia.org/T187987 (10fgiunchedi) There was an open question re: having Prometheus 2 package co-installable with Prometheus 1, I think it is simpler to keep the package name the same and thus i... [16:46:26] 10Operations, 10netops: Netbox Dies Mysteriously Sometimes - https://phabricator.wikimedia.org/T214008 (10elukey) [16:46:28] 10Operations: uwsgi's logsocket_plugin.so causes segfaults during log rotation - https://phabricator.wikimedia.org/T212697 (10elukey) [16:46:43] !log dcausse@deploy1001 Finished deploy [search/mjolnir/deploy@42414ca]: add support for multi-instances setup (duration: 04m 59s) [16:46:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:49] PROBLEM - Juniper alarms on asw-a-eqiad is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [16:46:50] 10Operations: uwsgi's logsocket_plugin.so causes segfaults during log rotation - https://phabricator.wikimedia.org/T212697 (10crusnov) Ah hah you beat me to merging it. [16:47:59] RECOVERY - IPMI Sensor Status on analytics1054 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [16:49:15] RECOVERY - Juniper alarms on asw-a-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [16:50:02] 10Operations, 10SRE-Access-Requests, 10Developer-Advocacy (Jan-Mar 2019), 10Documentation: Add Srishti to analytics-privatedata-users - https://phabricator.wikimedia.org/T213780 (10CDanis) Hi Srishti, Just wanted to check in and see if you needed any assistance? Please let me know :) [16:50:28] 10Operations, 10hardware-requests: Two test hosts for SREs - https://phabricator.wikimedia.org/T214024 (10CDanis) p:05Triage→03Normal [16:51:09] 10Operations, 10Analytics, 10Analytics-EventLogging, 10Performance-Team, 10Software-Licensing: New MongoDB version is not DFSG-compatible, dropped by Debian - https://phabricator.wikimedia.org/T213996 (10CDanis) p:05Triage→03Normal [16:51:37] 10Operations, 10Wikimedia-Logstash, 10User-Addshore: Investigate missing WikibaseQualityConstraints logs in logstash. - https://phabricator.wikimedia.org/T214031 (10CDanis) p:05Triage→03Normal [16:53:59] 10Operations, 10monitoring: Serve >= 50% of production Prometheus systems with Prometheus v2 - https://phabricator.wikimedia.org/T187987 (10fgiunchedi) [16:54:45] PROBLEM - MariaDB Slave IO: s5 on db1124 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1082.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1082.eqiad.wmnet (111 Connection refused) [16:54:55] PROBLEM - MariaDB Slave IO: s2 on db1125 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1074.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1074.eqiad.wmnet (111 Connection refused) [16:55:09] PROBLEM - MariaDB Slave Lag: s7 on db1125 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 21443.20 seconds [16:55:11] those are downtimes that expired [16:55:11] PROBLEM - MariaDB Slave Lag: s2 on db1125 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 21423.20 seconds [16:55:16] I am silencing those again [16:55:17] ack, thanks [16:56:02] !log jforrester@deploy1001 Synchronized php-1.33.0-wmf.13/extensions/VisualEditor/modules/ve-mw/: T213922: Revert 48db45df7602 for wmf.13 (duration: 00m 51s) [16:56:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:04] T213922: VE removes all new lines when editing a template - https://phabricator.wikimedia.org/T213922 [16:56:12] 10Operations, 10netops, 10Patch-For-Review: IGMP snooping breaks IPv6 ND on Junos 14.1X53-D46 - https://phabricator.wikimedia.org/T201039 (10ayounsi) p:05Normal→03Low Discussed it with Brandon, it's still something we want to fix but is now low priority. We will probably have to wait for the next DC fail... [16:57:12] !log jforrester@deploy1001 Synchronized php-1.33.0-wmf.12/extensions/VisualEditor/modules/ve-mw/: T213922: Revert 48db45df7602 for wmf.12 (duration: 00m 52s) [16:57:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:18] 10Operations, 10monitoring: Convert prometheus-labs-targets to use nova API instead of wikitech's api.php - https://phabricator.wikimedia.org/T214058 (10fgiunchedi) p:05Triage→03Normal [17:00:04] godog and _joe_: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190117T1700). [17:00:05] No GERRIT patches in the queue for this window AFAICS. [17:00:55] RECOVERY - Host ms-be1045 is UP: PING WARNING - Packet loss = 44%, RTA = 0.23 ms [17:01:01] RECOVERY - IPMI Sensor Status on ms-be1045 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [17:01:03] RECOVERY - Host ms-be1044 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [17:01:05] (03PS14) 10Ottomata: Helm chart for eventgate-analytics deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/483035 (https://phabricator.wikimedia.org/T211247) [17:01:31] RECOVERY - Host ms-be1019 is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms [17:02:05] 10Operations, 10Wikimedia-Mailing-lists: request of a new mailing list - https://phabricator.wikimedia.org/T214059 (10Giaccai) [17:02:46] (03PS2) 10Arturo Borrero Gonzalez: toolforge: use new instance for aptly server [puppet] - 10https://gerrit.wikimedia.org/r/485035 (https://phabricator.wikimedia.org/T213421) [17:03:31] (03CR) 10jerkins-bot: [V: 04-1] toolforge: use new instance for aptly server [puppet] - 10https://gerrit.wikimedia.org/r/485035 (https://phabricator.wikimedia.org/T213421) (owner: 10Arturo Borrero Gonzalez) [17:03:38] (03PS1) 10Elukey: Remove host specific overrides for analytics1029 [puppet] - 10https://gerrit.wikimedia.org/r/485061 (https://phabricator.wikimedia.org/T212256) [17:04:09] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1022 is CRITICAL: CRITICAL: 62.07% of data above the critical threshold [10.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1 [17:04:13] (03CR) 10Elukey: [C: 03+2] Remove host specific overrides for analytics1029 [puppet] - 10https://gerrit.wikimedia.org/r/485061 (https://phabricator.wikimedia.org/T212256) (owner: 10Elukey) [17:04:54] so kafka1012 was down and up, 1022 is recovering [17:04:56] I am checking [17:05:21] (03PS1) 10Andrew Bogott: cloudvirt1025: move from Jessie to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/485062 [17:05:29] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1020 is CRITICAL: CRITICAL: 63.33% of data above the critical threshold [10.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1 [17:06:29] RECOVERY - IPMI Sensor Status on ms-be1044 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [17:07:28] (03PS1) 10Mforns: Adapt saltrotate and EventLoggingSanitization params in data_purge.pp [puppet] - 10https://gerrit.wikimedia.org/r/485063 (https://phabricator.wikimedia.org/T212014) [17:07:34] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt1025: move from Jessie to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/485062 (owner: 10Andrew Bogott) [17:07:44] (03PS3) 10Arturo Borrero Gonzalez: toolforge: use new instance for aptly server [puppet] - 10https://gerrit.wikimedia.org/r/485035 (https://phabricator.wikimedia.org/T213421) [17:08:17] (03CR) 10Mforns: [C: 04-1] "This needs to be merged together with a refinery-source/refinery deployment." [puppet] - 10https://gerrit.wikimedia.org/r/485063 (https://phabricator.wikimedia.org/T212014) (owner: 10Mforns) [17:08:32] (03PS4) 10Arturo Borrero Gonzalez: toolforge: use new instance for aptly server [puppet] - 10https://gerrit.wikimedia.org/r/485035 (https://phabricator.wikimedia.org/T213421) [17:09:17] RECOVERY - Host kafka1012 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [17:09:32] 10Operations, 10monitoring, 10Goal, 10Patch-For-Review: Upgrade production prometheus-node-exporter to >= 0.16 - https://phabricator.wikimedia.org/T213708 (10colewhite) Changeset [1] contains a current snapshot of the converted rules files. These will be needed for prometheus server to not only maintain c... [17:11:28] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: use new instance for aptly server [puppet] - 10https://gerrit.wikimedia.org/r/485035 (https://phabricator.wikimedia.org/T213421) (owner: 10Arturo Borrero Gonzalez) [17:12:03] RECOVERY - MariaDB Slave SQL: m2 on dbstore1002 is OK: OK slave_sql_state not a slave [17:12:09] RECOVERY - MariaDB Slave IO: m3 on dbstore1002 is OK: OK slave_io_state not a slave [17:13:38] Hi Pchelolo :d [17:13:39] :D [17:14:13] Pchelolo: how would you feel about me bumping it to 5% today then? :) [17:14:37] addshore: feel great! [17:14:41] whoo [17:15:13] 1% is 0.15 jobs/s, so even 10% would not make any difference for the queue [17:15:21] RECOVERY - Host kafka1012.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.89 ms [17:15:23] !log dcausse@deploy1001 Started deploy [search/mjolnir/deploy@85aec7a]: fix multi-instances support [17:15:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:35] Pchelolo: great, in that case i'll try to do 5% today and then 10% tomorrow eu time [17:15:51] RECOVERY - MariaDB Slave IO: s5 on db1124 is OK: OK slave_io_state Slave_IO_Running: Yes [17:17:15] RECOVERY - MariaDB Slave IO: s2 on db1125 is OK: OK slave_io_state Slave_IO_Running: Yes [17:18:48] 10Operations, 10monitoring, 10Goal, 10Patch-For-Review: Upgrade production prometheus-node-exporter to >= 0.16 - https://phabricator.wikimedia.org/T213708 (10fgiunchedi) I also just realized that the compatibility rules are in Prometheus v2 format, though we'll need them in v1 format as well to decouple th... [17:18:53] RECOVERY - MariaDB Slave Lag: s2 on db2095 is OK: OK slave_sql_lag Replication lag: 49.49 seconds [17:18:57] RECOVERY - MariaDB Slave Lag: s2 on db2091 is OK: OK slave_sql_lag Replication lag: 33.72 seconds [17:19:05] !log dcausse@deploy1001 Finished deploy [search/mjolnir/deploy@85aec7a]: fix multi-instances support (duration: 03m 42s) [17:19:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:07] RECOVERY - MariaDB Slave Lag: s2 on db2041 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [17:19:19] RECOVERY - MariaDB Slave Lag: s2 on db2049 is OK: OK slave_sql_lag Replication lag: 0.31 seconds [17:19:21] RECOVERY - MariaDB Slave Lag: s2 on db2035 is OK: OK slave_sql_lag Replication lag: 0.23 seconds [17:19:25] RECOVERY - MariaDB Slave Lag: s2 on db2056 is OK: OK slave_sql_lag Replication lag: 0.28 seconds [17:19:37] RECOVERY - MariaDB Slave Lag: s2 on db2088 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [17:21:26] (03CR) 10GTirloni: [C: 03+1] "would this help in this case? https://gerrit.wikimedia.org/r/c/operations/software/tools-manifest/+/479181" [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/484849 (https://phabricator.wikimedia.org/T107878) (owner: 10BryanDavis) [17:23:34] elukey: db1107 is now up, you want me to start mysql? [17:23:44] (03PS1) 10Volans: puppet: add is_disabled() method [software/spicerack] - 10https://gerrit.wikimedia.org/r/485066 [17:24:51] (03CR) 10Volans: "The need of this method comes from I05862750f2832a5fee91d64bcfdbd5d0e3c6b6a0" [software/spicerack] - 10https://gerrit.wikimedia.org/r/485066 (owner: 10Volans) [17:25:07] !log restarting mjolnir services on all elastic* nodes [17:25:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:09] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1020 is OK: OK: Less than 50.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1 [17:26:09] marostegui: yes please [17:26:13] ok [17:26:37] elukey: done [17:26:49] <# [17:26:50] <4 [17:26:53] aaarghhhh [17:26:55] XDD [17:26:55] <3 :D [17:27:29] PROBLEM - Nginx local proxy to videoscaler on mw1309 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.009 second response time [17:27:37] PROBLEM - IPMI Sensor Status on kafka1012 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical] [17:28:13] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1022 is OK: OK: Less than 50.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1 [17:28:37] RECOVERY - Nginx local proxy to videoscaler on mw1309 is OK: HTTP OK: HTTP/1.1 200 OK - 288 bytes in 0.009 second response time [17:29:07] RECOVERY - IPMI Sensor Status on db1082 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [17:31:54] (03CR) 10Ottomata: Remove externalIP settings (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/484670 (owner: 10Alexandros Kosiaris) [17:32:10] (03CR) 10Vgutierrez: [C: 03+2] pybal: check for discrepancies in the configured services [puppet] - 10https://gerrit.wikimedia.org/r/485044 (https://phabricator.wikimedia.org/T214041) (owner: 10Vgutierrez) [17:32:47] (03PS5) 10Vgutierrez: pybal: check for discrepancies in the configured services [puppet] - 10https://gerrit.wikimedia.org/r/485044 (https://phabricator.wikimedia.org/T214041) [17:33:18] jouncebot: now [17:33:18] For the next 0 hour(s) and 26 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190117T1700) [17:33:41] Pchelolo: as there is nothing in puppet swat I might use the next 30 mins to do the bump to 5% then [17:33:43] RECOVERY - MariaDB Slave SQL: s1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [17:33:49] RECOVERY - MariaDB Slave SQL: s3 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [17:33:49] RECOVERY - MariaDB Slave IO: m2 on dbstore1002 is OK: OK slave_io_state not a slave [17:33:53] RECOVERY - MariaDB Slave SQL: x1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [17:33:53] RECOVERY - MariaDB Slave IO: s8 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [17:34:03] RECOVERY - MariaDB Slave SQL: s8 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [17:34:03] RECOVERY - MariaDB Slave SQL: s5 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [17:34:07] RECOVERY - MariaDB Slave SQL: m3 on dbstore1002 is OK: OK slave_sql_state not a slave [17:34:19] RECOVERY - MariaDB Slave IO: s5 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [17:34:19] RECOVERY - MariaDB Slave SQL: s6 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [17:34:19] RECOVERY - MariaDB Slave IO: s7 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [17:34:19] RECOVERY - MariaDB Slave SQL: s2 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [17:34:33] RECOVERY - MariaDB Slave IO: s3 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [17:34:33] (03PS2) 10Addshore: wikidata: post edit constraint jobs on 5% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484625 (https://phabricator.wikimedia.org/T204031) [17:34:35] RECOVERY - MariaDB Slave IO: s4 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [17:34:43] RECOVERY - MariaDB Slave IO: s1 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [17:34:43] RECOVERY - MariaDB Slave IO: x1 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [17:34:49] RECOVERY - MariaDB Slave SQL: s7 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [17:34:59] RECOVERY - MariaDB Slave SQL: s4 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [17:35:28] (03CR) 10Addshore: [C: 03+2] wikidata: post edit constraint jobs on 5% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484625 (https://phabricator.wikimedia.org/T204031) (owner: 10Addshore) [17:36:35] (03Merged) 10jenkins-bot: wikidata: post edit constraint jobs on 5% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484625 (https://phabricator.wikimedia.org/T204031) (owner: 10Addshore) [17:36:47] RECOVERY - MariaDB Slave IO: s2 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [17:37:59] (03CR) 10Bstorm: "> Patch Set 1:" [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/484849 (https://phabricator.wikimedia.org/T107878) (owner: 10BryanDavis) [17:38:02] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: ConstraintsCheckJobs on wikidatawiki (5% of edits) T204031 (duration: 00m 52s) [17:38:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:09] T204031: Deploy regular running of wikidata constraint checks using the job queue - https://phabricator.wikimedia.org/T204031 [17:38:24] (03CR) 10jenkins-bot: wikidata: post edit constraint jobs on 5% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484625 (https://phabricator.wikimedia.org/T204031) (owner: 10Addshore) [17:38:35] RECOVERY - MariaDB Slave IO: s6 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [17:41:43] 10Operations, 10monitoring: Convert prometheus-labs-targets to use nova API instead of wikitech's api.php - https://phabricator.wikimedia.org/T214058 (10fgiunchedi) [17:43:59] (03PS1) 10Elukey: check_hdfs_active_namenode: find cluster name in the config [puppet/cdh] - 10https://gerrit.wikimedia.org/r/485070 (https://phabricator.wikimedia.org/T212256) [17:47:05] (03CR) 10Elukey: [V: 03+2 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/14367/" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/485070 (https://phabricator.wikimedia.org/T212256) (owner: 10Elukey) [17:48:29] PROBLEM - PyBal IPVS diff check on lvs2005 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([2620:0:860:ed1a::3:fe:53]) [17:48:49] (03PS1) 10Elukey: Update module cdh to its latest sha [puppet] - 10https://gerrit.wikimedia.org/r/485071 [17:49:11] (03CR) 10Elukey: [V: 03+2 C: 03+2] Update module cdh to its latest sha [puppet] - 10https://gerrit.wikimedia.org/r/485071 (owner: 10Elukey) [17:49:25] PROBLEM - PyBal IPVS diff check on lvs2002 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([2620:0:860:ed1a::3:fe:53]) [17:50:09] PROBLEM - MariaDB Slave Lag: s2 on db2088 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.51 seconds [17:50:11] PROBLEM - MariaDB Slave Lag: s2 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.33 seconds [17:50:13] PROBLEM - MariaDB Slave Lag: s2 on db2091 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.40 seconds [17:50:29] PROBLEM - MariaDB Slave Lag: s2 on db2041 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.31 seconds [17:50:47] PROBLEM - MariaDB Slave Lag: s2 on db2063 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 313.77 seconds [17:50:47] PROBLEM - MariaDB Slave Lag: s2 on db2049 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 313.79 seconds [17:50:49] PROBLEM - MariaDB Slave Lag: s2 on db2035 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 313.91 seconds [17:50:53] PROBLEM - MariaDB Slave Lag: s2 on db2056 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 315.14 seconds [17:52:42] (03PS15) 10Ottomata: Helm chart for eventgate-analytics deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/483035 (https://phabricator.wikimedia.org/T211247) [17:52:54] !log re-enable eventlogging mysql clients and db1108's el replication after db1107 maintenance [17:52:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:13] RECOVERY - eventlogging_sync processes on db1108 is OK: PROCS OK: 1 process with UID = 0 (root), args /bin/bash /usr/local/bin/eventlogging_sync.sh [17:55:03] RECOVERY - Check status of defined EventLogging jobs on eventlog1002 is OK: OK: All defined EventLogging jobs are runnning. [17:55:12] (03PS1) 10Mathew.onipe: maps: change cassandra version [puppet] - 10https://gerrit.wikimedia.org/r/485072 (https://phabricator.wikimedia.org/T198622) [17:58:51] (03CR) 10Ottomata: [C: 03+1] check_hdfs_active_namenode: find cluster name in the config [puppet/cdh] - 10https://gerrit.wikimedia.org/r/485070 (https://phabricator.wikimedia.org/T212256) (owner: 10Elukey) [18:00:04] cscott, arlolra, subbu, halfak, and Amir1: Time to snap out of that daydream and deploy Services – Graphoid / Parsoid / Citoid / ORES. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190117T1800). [18:00:29] 10Operations, 10Wikimedia-Mailing-lists: request of a new mailing list - https://phabricator.wikimedia.org/T214059 (10MarcoAurelio) a:05Giaccai→03None [18:00:41] no parsoid deploy today [18:03:11] (03Abandoned) 10Ottomata: Add round robin DNS records for Kafka clusters [dns] - 10https://gerrit.wikimedia.org/r/484509 (https://phabricator.wikimedia.org/T213561) (owner: 10Ottomata) [18:04:37] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review: Broken disk on analytics1056 - https://phabricator.wikimedia.org/T214057 (10mforns) [18:04:59] PROBLEM - PyBal IPVS diff check on lvs2003 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.1.29:1968]) [18:05:51] 10Operations, 10Analytics, 10Analytics-EventLogging, 10Performance-Team, 10Software-Licensing: New MongoDB version is not DFSG-compatible, dropped by Debian - https://phabricator.wikimedia.org/T213996 (10mforns) We do not use MongoDB in EventLogging production. Thanks for the heads up. Removing EventLogg... [18:06:03] 10Operations, 10Performance-Team, 10Software-Licensing: New MongoDB version is not DFSG-compatible, dropped by Debian - https://phabricator.wikimedia.org/T213996 (10mforns) [18:09:33] PROBLEM - PyBal IPVS diff check on lvs2006 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.1.29:1968]) [18:10:38] (03PS1) 10Andrew Bogott: cloudvirt1025: rename network ids for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/485074 [18:11:05] RECOVERY - MariaDB Slave Lag: s5 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 198.64 seconds [18:11:15] (03PS2) 10Andrew Bogott: cloudvirt1025: rename network ids for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/485074 [18:11:56] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt1025: rename network ids for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/485074 (owner: 10Andrew Bogott) [18:12:45] !log bmansurov@deploy1001 Started deploy [recommendation-api/deploy@5ba7582]: Update to I25c97ed81f763a0c8fe56466cce50219ba707ea0 [18:12:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:02] 10Operations, 10Analytics, 10Performance-Team, 10Software-Licensing: New MongoDB version is not DFSG-compatible, dropped by Debian - https://phabricator.wikimedia.org/T213996 (10mforns) [18:18:14] 10Operations, 10Wikimedia-Mailing-lists: request of a new mailing list - https://phabricator.wikimedia.org/T214059 (10CDanis) p:05Triage→03Normal a:03CDanis [18:18:19] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Offload pings to dedicated server - https://phabricator.wikimedia.org/T190090 (10ayounsi) a:05ayounsi→03faidon Discussed it with Brandon and we think that option 3 is the best path forward. Over to @faidon for thoughts/review. [18:18:22] !log bmansurov@deploy1001 Finished deploy [recommendation-api/deploy@5ba7582]: Update to I25c97ed81f763a0c8fe56466cce50219ba707ea0 (duration: 05m 36s) [18:18:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:27] 10Operations, 10SRE-Access-Requests, 10Developer-Advocacy (Jan-Mar 2019), 10Documentation: Add Srishti to analytics-privatedata-users - https://phabricator.wikimedia.org/T213780 (10srishakatux) @CDanis Thanks for checking in! I've signed the L3 document. Here is a copy of my public key explicitly generated... [18:19:52] !log running ipvsadm -D -t 10.2.1.29:1968 in lvs2006 - T214041 [18:19:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:55] T214041: inconsistencies between pybal configuration and IPVS status - https://phabricator.wikimedia.org/T214041 [18:21:07] (03PS2) 10Gehel: maps: change cassandra version [puppet] - 10https://gerrit.wikimedia.org/r/485072 (https://phabricator.wikimedia.org/T198622) (owner: 10Mathew.onipe) [18:21:57] (03CR) 10Gehel: [C: 03+2] maps: change cassandra version [puppet] - 10https://gerrit.wikimedia.org/r/485072 (https://phabricator.wikimedia.org/T198622) (owner: 10Mathew.onipe) [18:22:20] RECOVERY - PyBal IPVS diff check on lvs2006 is OK: OK: no difference between hosts in IPVS/PyBal [18:22:51] !log running ipvsadm -D -t 10.2.1.29:1968 in lvs2003 - T214041 [18:22:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:28] RECOVERY - PyBal IPVS diff check on lvs2003 is OK: OK: no difference between hosts in IPVS/PyBal [18:28:54] RECOVERY - MariaDB Slave Lag: s2 on db2049 is OK: OK slave_sql_lag Replication lag: 16.28 seconds [18:29:00] RECOVERY - MariaDB Slave Lag: s2 on db2088 is OK: OK slave_sql_lag Replication lag: 0.29 seconds [18:29:04] RECOVERY - MariaDB Slave Lag: s2 on db2095 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [18:29:06] RECOVERY - MariaDB Slave Lag: s2 on db2091 is OK: OK slave_sql_lag Replication lag: 0.23 seconds [18:29:36] RECOVERY - MariaDB Slave Lag: s2 on db2041 is OK: OK slave_sql_lag Replication lag: 0.39 seconds [18:29:40] RECOVERY - MariaDB Slave Lag: s2 on db2035 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [18:29:46] RECOVERY - MariaDB Slave Lag: s2 on db2056 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [18:38:40] RECOVERY - MariaDB Slave Lag: s2 on db2063 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [18:40:22] (03PS1) 10GTirloni: labstore::device_backup - Expose systemd OnCalendar syntax [puppet] - 10https://gerrit.wikimedia.org/r/485079 (https://phabricator.wikimedia.org/T209527) [18:40:36] RECOVERY - MariaDB Slave Lag: s6 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 192.76 seconds [18:42:10] 10Operations, 10Wikimedia-Mailing-lists: request of a new mailing list - https://phabricator.wikimedia.org/T214059 (10CDanis) a:05CDanis→03Giaccai WIKI-BNCF list is now live. Info: https://lists.wikimedia.org/mailman/listinfo/wiki-bncf Admin link to change settings: https://lists.wikimedia.org/mailman/ad... [18:42:31] !log ppchelko@deploy1001 Started deploy [restbase/deploy@f24d681]: Update recommendation api endpoints [18:42:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:56] (03CR) 10Bstorm: "> Patch Set 1:" [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/484849 (https://phabricator.wikimedia.org/T107878) (owner: 10BryanDavis) [18:43:29] (03CR) 10Bstorm: [C: 03+1] Preserve restart attempt timestamps between runs [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/484849 (https://phabricator.wikimedia.org/T107878) (owner: 10BryanDavis) [18:43:32] (03PS2) 10GTirloni: labstore::device_backup - Expose systemd OnCalendar syntax [puppet] - 10https://gerrit.wikimedia.org/r/485079 (https://phabricator.wikimedia.org/T209527) [18:48:31] 10Operations, 10SRE-Access-Requests, 10Developer-Advocacy (Jan-Mar 2019), 10Documentation: Add Srishti to analytics-privatedata-users - https://phabricator.wikimedia.org/T213780 (10CDanis) a:05srishakatux→03CDanis [18:48:49] 10Operations, 10Pybal, 10Traffic: prometheus metrics apparently are missing some ipvs entries - https://phabricator.wikimedia.org/T214072 (10Vgutierrez) [18:48:55] 10Operations, 10SRE-Access-Requests, 10Developer-Advocacy (Jan-Mar 2019), 10Documentation: Add Srishti to analytics-privatedata-users - https://phabricator.wikimedia.org/T213780 (10CDanis) [18:49:02] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs2002 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([2620:0:860:ed1a::3:fe:53]) Vgutierrez T214072 [18:49:02] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs2005 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([2620:0:860:ed1a::3:fe:53]) Vgutierrez T214072 [18:50:09] 10Operations, 10Analytics, 10Performance-Team, 10Software-Licensing: New MongoDB version is not DFSG-compatible, dropped by Debian - https://phabricator.wikimedia.org/T213996 (10MaxSem) @mforns, so you don't need `python-pymongo` installed in `eventlogging::dependencies`? [18:50:54] 10Operations, 10Analytics, 10Performance-Team, 10Software-Licensing: New MongoDB version is not DFSG-compatible, dropped by Debian - https://phabricator.wikimedia.org/T213996 (10Ottomata) nope! def not. must be some super legacy thang. [18:52:36] (03PS1) 10Ottomata: Remove unused python-pymongo from eventlogging::dependencies [puppet] - 10https://gerrit.wikimedia.org/r/485080 (https://phabricator.wikimedia.org/T213996) [18:52:51] 10Operations, 10Discovery-Search, 10Maps: Fix maps puppet to make sure apt-get update runs after configuration change - https://phabricator.wikimedia.org/T214073 (10Mathew.onipe) [18:52:58] (03PS2) 10Ottomata: Remove unused python-pymongo from eventlogging::dependencies [puppet] - 10https://gerrit.wikimedia.org/r/485080 (https://phabricator.wikimedia.org/T213996) [18:53:04] (03PS1) 10Ayounsi: Move mr1-esams interco links to 91.198.174.0/24 [dns] - 10https://gerrit.wikimedia.org/r/485081 (https://phabricator.wikimedia.org/T211254) [18:54:50] 10Operations, 10Pybal, 10Traffic, 10monitoring: prometheus metrics apparently are missing some ipvs entries - https://phabricator.wikimedia.org/T214072 (10CDanis) p:05Triage→03Normal [18:55:37] (03CR) 10Bstorm: [C: 03+2] Preserve restart attempt timestamps between runs [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/484849 (https://phabricator.wikimedia.org/T107878) (owner: 10BryanDavis) [18:55:52] (03Abandoned) 10GTirloni: Limit manifest starts (max 10) [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/479181 (https://phabricator.wikimedia.org/T107878) (owner: 10GTirloni) [18:56:30] (03CR) 10Ottomata: [C: 03+2] Remove unused python-pymongo from eventlogging::dependencies [puppet] - 10https://gerrit.wikimedia.org/r/485080 (https://phabricator.wikimedia.org/T213996) (owner: 10Ottomata) [18:56:37] (03CR) 10Mobrovac: services: add missing 'mediawiki/services' prefix to git cloning (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/484602 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [18:57:17] (03CR) 10Bstorm: [C: 03+2] Code formatting and cleanup [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/484848 (owner: 10BryanDavis) [19:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: That opportune time is upon us again. Time for a Morning SWAT (Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190117T1900). [19:00:04] Zoranzoki21 and stephanebisson: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:15] Hi [19:00:18] Here \o [19:00:32] I can SWAT [19:00:41] Oh tnx [19:01:08] Zoranzoki21: I'm just going to do all of yours at once if that's OK? [19:01:14] Ok [19:01:47] (03CR) 10Catrope: [C: 03+2] Update groupOverrides for srwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482609 (https://phabricator.wikimedia.org/T213055) (owner: 10Zoranzoki21) [19:01:50] (03PS7) 10Catrope: Update groupOverrides for srwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482609 (https://phabricator.wikimedia.org/T213055) (owner: 10Zoranzoki21) [19:01:55] (03CR) 10Catrope: Update groupOverrides for srwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482609 (https://phabricator.wikimedia.org/T213055) (owner: 10Zoranzoki21) [19:01:59] (03CR) 10Catrope: [C: 03+2] Update groupOverrides for srwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482609 (https://phabricator.wikimedia.org/T213055) (owner: 10Zoranzoki21) [19:02:02] (03PS1) 10Jforrester: [DNM] TestCommons: Enable federation of Wikidata items and properties [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485082 [19:02:19] (03PS3) 10Catrope: Update groupOverrides for srwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484187 (https://phabricator.wikimedia.org/T213679) (owner: 10Zoranzoki21) [19:02:21] Tell me when its ready for testing [19:02:38] (03CR) 10Catrope: [C: 03+2] Update groupOverrides for srwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484187 (https://phabricator.wikimedia.org/T213679) (owner: 10Zoranzoki21) [19:02:53] (03PS3) 10Catrope: Update groupOverrides for srwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484195 (https://phabricator.wikimedia.org/T213684) (owner: 10Zoranzoki21) [19:02:57] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@f24d681]: Update recommendation api endpoints (duration: 20m 26s) [19:02:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:04] (03CR) 10Catrope: [C: 03+2] Update groupOverrides for srwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484195 (https://phabricator.wikimedia.org/T213684) (owner: 10Zoranzoki21) [19:03:15] (03PS2) 10Catrope: Update groupOverrides for srwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484500 (https://phabricator.wikimedia.org/T213824) (owner: 10Zoranzoki21) [19:03:21] (03Merged) 10jenkins-bot: Update groupOverrides for srwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482609 (https://phabricator.wikimedia.org/T213055) (owner: 10Zoranzoki21) [19:03:27] (03CR) 10Catrope: [C: 03+2] Update groupOverrides for srwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484500 (https://phabricator.wikimedia.org/T213824) (owner: 10Zoranzoki21) [19:03:43] (03PS2) 10Catrope: Update groupOverrides for srwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484501 (https://phabricator.wikimedia.org/T213828) (owner: 10Zoranzoki21) [19:03:48] (03Merged) 10jenkins-bot: Update groupOverrides for srwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484187 (https://phabricator.wikimedia.org/T213679) (owner: 10Zoranzoki21) [19:03:52] (03CR) 10Catrope: [C: 03+2] Update groupOverrides for srwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484501 (https://phabricator.wikimedia.org/T213828) (owner: 10Zoranzoki21) [19:04:30] (03Merged) 10jenkins-bot: Update groupOverrides for srwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484195 (https://phabricator.wikimedia.org/T213684) (owner: 10Zoranzoki21) [19:04:36] (03Merged) 10jenkins-bot: Update groupOverrides for srwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484500 (https://phabricator.wikimedia.org/T213824) (owner: 10Zoranzoki21) [19:04:40] 10Operations, 10Recommendation-API, 10Release-Engineering-Team, 10Research, and 2 others: Recommendation API improvements - https://phabricator.wikimedia.org/T213222 (10bmansurov) [19:05:09] (03Merged) 10jenkins-bot: Update groupOverrides for srwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484501 (https://phabricator.wikimedia.org/T213828) (owner: 10Zoranzoki21) [19:05:41] (03PS2) 10Jforrester: [DNM] TestCommons: Enable federation of Wikidata items and properties [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485082 (https://phabricator.wikimedia.org/T214075) [19:06:02] (03CR) 10Jforrester: [C: 04-2] "Prefixing needs resolving first." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485082 (https://phabricator.wikimedia.org/T214075) (owner: 10Jforrester) [19:06:40] Zoranzoki21: OK they're all on mwdebug1002, please test [19:06:56] Ok, please wait [19:07:19] Zoranzoki21: In case you don't already know about it, Special:Listgrouprights is useful for this kind of stuff [19:07:52] RoanKattuow: I know, testing on all projects [19:09:24] (03CR) 10jenkins-bot: Update groupOverrides for srwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482609 (https://phabricator.wikimedia.org/T213055) (owner: 10Zoranzoki21) [19:09:26] (03CR) 10jenkins-bot: Update groupOverrides for srwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484187 (https://phabricator.wikimedia.org/T213679) (owner: 10Zoranzoki21) [19:09:28] (03CR) 10jenkins-bot: Update groupOverrides for srwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484195 (https://phabricator.wikimedia.org/T213684) (owner: 10Zoranzoki21) [19:09:30] (03CR) 10jenkins-bot: Update groupOverrides for srwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484500 (https://phabricator.wikimedia.org/T213824) (owner: 10Zoranzoki21) [19:09:32] (03CR) 10jenkins-bot: Update groupOverrides for srwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484501 (https://phabricator.wikimedia.org/T213828) (owner: 10Zoranzoki21) [19:09:40] RoanKattuow: LGTM everywhere [19:11:28] Everything is ok [19:11:43] On each project [19:13:37] OK, syncing [19:13:47] Tnx [19:13:51] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Free up 185.15.59.0/24 - https://phabricator.wikimedia.org/T211254 (10ayounsi) a:03ayounsi In addition to the above DNS change, the following needs to change on the routers: `lang=diff,name=cr1/2-esams - shrink /28 to /29 [edit routing-options aggre... [19:13:53] 10Operations, 10Wikimedia-Mailing-lists: Create discourse-test mailing list - https://phabricator.wikimedia.org/T214077 (10Tgr) [19:14:11] 10Operations, 10Discourse, 10Wikimedia-Mailing-lists: Create discourse-test mailing list - https://phabricator.wikimedia.org/T214077 (10Tgr) [19:16:31] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Update groupOverrides for Serbian wikis (T213055, T213059, T213063, T213065, T213679, T213680, T213681, T213682, T213684, T213685, T213686, T213687, T213824, T213825, T213826, T213827, T213828, T213829, T213830, T213832) (duration: 00m 53s) [19:16:53] Checking without mwdebug [19:17:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:25] T213824: Enable moving categories only for bots, autopatrollers, patrollers, rollbackers, administrators and bureaucrats on srwikibooks - https://phabricator.wikimedia.org/T213824 [19:17:25] T213684: Disable moving categories for users without bot, autopatrol, patrol, rollback, sysop and bureaucrat rights on srwikiquote - https://phabricator.wikimedia.org/T213684 [19:17:26] T213826: Enabled suppressredirect for patrollers and rollbackers on srwikibooks - https://phabricator.wikimedia.org/T213826 [19:17:26] T213829: Enable moving files for patrollers and rollbackers on srwiktionary - https://phabricator.wikimedia.org/T213829 [19:17:27] T213827: Enable move-subpages for bots, autopatrollers, patrollers, rollbackers, administrators and bureaucrats on srwikibooks - https://phabricator.wikimedia.org/T213827 [19:17:27] T213828: Enable moving categories only for bots, autopatrollers, patrollers, rollbackers, administrators and bureaucrats on srwiktionary - https://phabricator.wikimedia.org/T213828 [19:17:27] T213063: Enable suppressredirect on srwikisource - https://phabricator.wikimedia.org/T213063 [19:17:28] T213686: Enable suppressredirect on srwikiquote - https://phabricator.wikimedia.org/T213686 [19:17:28] T213685: Enable moving files for patrollers and rollbackers on srwikiquote - https://phabricator.wikimedia.org/T213685 [19:17:29] T213679: Disable moving categories for users without bot, autopatrol, patrol, rollback, sysop and bureaucrat rights on srwikinews - https://phabricator.wikimedia.org/T213679 [19:17:29] T213830: Enable suppressredirect for patrollers and rollbackers on srwikinews - https://phabricator.wikimedia.org/T213830 [19:17:29] T213682: Enable move-subpages on srwikinews - https://phabricator.wikimedia.org/T213682 [19:17:30] T213065: Enable move-subpages on srwikisource - https://phabricator.wikimedia.org/T213065 [19:17:30] T213680: Enable moving files for patrollers and rollbackers on srwikinews - https://phabricator.wikimedia.org/T213680 [19:17:30] T213687: Enable move-subpages on srwikiquote - https://phabricator.wikimedia.org/T213687 [19:17:31] T213832: Enable move-subpages for bots, autopatrollers, patrollers, rollbackers, administrators and bureaucrats on srwiktionary - https://phabricator.wikimedia.org/T213832 [19:17:31] T213681: Enable suppressredirect on srwikinews - https://phabricator.wikimedia.org/T213681 [19:17:33] T213055: Disable moving categories for users without bot, autopatrol, patrol, rollback, sysop and bureaucrat rights on srwikisource - https://phabricator.wikimedia.org/T213055 [19:17:33] T213059: Enable moving files for patrollers and rollbackers on srwikisource - https://phabricator.wikimedia.org/T213059 [19:17:34] T213825: Enable moving files for patrollers and rollbackers on srwikibooks - https://phabricator.wikimedia.org/T213825 [19:17:54] stephanebisson: Yours is live on mwdebug1002, please test [19:17:59] 10Operations, 10ops-esams, 10netops: set up cr3-esams - https://phabricator.wikimedia.org/T174616 (10ayounsi) [19:18:02] 10Operations, 10Traffic, 10netops: Configure interface damping on primary links - https://phabricator.wikimedia.org/T196432 (10ayounsi) [19:18:03] Ok works everywhere thanks RoanKattouw [19:18:04] 10Operations, 10ops-ulsfo, 10Traffic, 10netops, 10Patch-For-Review: Rack/cable/configure ulsfo MX204 - https://phabricator.wikimedia.org/T189552 (10ayounsi) [19:20:08] RoanKattouw: on it [19:23:50] RoanKattouw: I get an error page when I try to load the testwiki main page from that test server. [19:24:20] RoanKattouw: nvm, it's back, and it works as expected [19:24:42] 10Operations, 10netops: Outbound BGP graceful shutdown - https://phabricator.wikimedia.org/T211728 (10ayounsi) [19:25:45] 10Operations, 10netops: Outbound BGP graceful shutdown - https://phabricator.wikimedia.org/T211728 (10ayounsi) p:05Normal→03Low a:05ayounsi→03faidon Over to @faidon for review/feedback. [19:29:25] !log catrope@deploy1001 Synchronized php-1.33.0-wmf.13/extensions/GrowthExperiments/: Make welcome survey C unescapable (T213958) (duration: 00m 52s) [19:29:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:27] T213958: Welcome Survey: disable implicit escape of Variation C - https://phabricator.wikimedia.org/T213958 [19:32:52] And that's SWAT done [19:47:35] (03CR) 10Dzahn: [C: 03+1] "no difference anywhere: https://puppet-compiler.wmflabs.org/compiler1002/14369/" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/483798 (https://phabricator.wikimedia.org/T198622) (owner: 10Mathew.onipe) [19:47:52] onimisionipe: ^ merging it, ok? i see no difference in compiler at all [19:48:14] PROBLEM - MariaDB Slave Lag: s2 on db2049 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.86 seconds [19:48:22] PROBLEM - MariaDB Slave Lag: s2 on db2088 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.46 seconds [19:48:22] PROBLEM - MariaDB Slave Lag: s2 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.60 seconds [19:48:28] PROBLEM - MariaDB Slave Lag: s2 on db2091 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.71 seconds [19:48:48] PROBLEM - MariaDB Slave Lag: s2 on db2063 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.28 seconds [19:49:00] PROBLEM - MariaDB Slave Lag: s2 on db2041 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.25 seconds [19:49:06] PROBLEM - MariaDB Slave Lag: s2 on db2035 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.27 seconds [19:49:12] PROBLEM - MariaDB Slave Lag: s2 on db2056 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.61 seconds [19:49:31] mutante: its merged already [19:49:58] onimisionipe: oh.. i did not reload my tab :p [19:50:00] ok, cool [19:50:11] 10Operations, 10Analytics, 10DBA, 10Research, 10Article-Recommendation: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10bmansurov) [19:50:43] 10Operations, 10Analytics, 10DBA, 10Research, 10Article-Recommendation: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10bmansurov) @Ottomata thanks! I've updated the task description and ping the groups you mentioned. [19:51:02] mutante: no p [19:51:10] 10Operations, 10Analytics, 10DBA, 10Research, 10Article-Recommendation: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10bmansurov) [19:52:15] 10Operations, 10Analytics, 10DBA, 10Research, 10Article-Recommendation: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10bmansurov) @Banyek and @Dzahn I'd appreciate your input on this task. Thank you. [19:57:54] 10Operations, 10Discovery-Search, 10Maps: Fix maps puppet to make sure apt-get update runs after configuration change - https://phabricator.wikimedia.org/T214073 (10CDanis) p:05Triage→03Normal [20:00:04] marxarelli: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Americas version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190117T2000). [20:00:48] 10Operations, 10Discourse, 10Wikimedia-Mailing-lists: Create discourse-test mailing list - https://phabricator.wikimedia.org/T214077 (10CDanis) It looks like we've been down this road before...? I found T126547, T124690, and https://lists.wikimedia.org/mailman/listinfo/discourse Is that existing list suffi... [20:01:03] 10Operations, 10Analytics, 10DBA, 10Research, 10Article-Recommendation: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10Ottomata) > We should have a clear separation of concerns and while the hadoop cluster is in charge of computing the data the t... [20:04:23] (03PS1) 10Dduvall: all wikis to 1.33.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485085 [20:04:25] (03CR) 10Dduvall: [C: 03+2] all wikis to 1.33.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485085 (owner: 10Dduvall) [20:05:33] (03Merged) 10jenkins-bot: all wikis to 1.33.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485085 (owner: 10Dduvall) [20:06:09] here comes the choo choo [20:06:42] 10Operations, 10Analytics, 10DBA, 10Research, 10Article-Recommendation: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10Dzahn) How to install the importer scripts is what i started once in https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/476... [20:06:45] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.33.0-wmf.13 [20:07:48] dduvall@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [20:10:44] PROBLEM - Apache HTTP on mw1267 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:11:46] RECOVERY - Apache HTTP on mw1267 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.030 second response time [20:13:31] 10Operations, 10ops-eqiad: cloudstore100{8,9} - Upgrade to 10GbE - https://phabricator.wikimedia.org/T214079 (10GTirloni) [20:15:36] PROBLEM - HHVM rendering on mw1322 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:15:38] (03CR) 10jenkins-bot: all wikis to 1.33.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485085 (owner: 10Dduvall) [20:15:48] PROBLEM - Nginx local proxy to apache on mw1322 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:15:50] PROBLEM - Apache HTTP on mw1322 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:16:42] RECOVERY - HHVM rendering on mw1322 is OK: HTTP OK: HTTP/1.1 200 OK - 77301 bytes in 0.132 second response time [20:16:52] RECOVERY - Nginx local proxy to apache on mw1322 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.042 second response time [20:16:54] RECOVERY - Apache HTTP on mw1322 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.033 second response time [20:21:36] (03CR) 10Dzahn: "doing that lead to a "deploy/deploy" path. so i am not sure anymore what the fix is :/" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/484602 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [20:21:43] 10Operations, 10Analytics, 10DBA, 10Research, 10Article-Recommendation: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10Nuria) >It has been abandoned after Analytics said to not use stat hosts and use Hadoop instead. To clarify: stats machines sho... [20:23:14] RECOVERY - MariaDB Slave Lag: s2 on db1125 is OK: OK slave_sql_lag Replication lag: 23.78 seconds [20:24:27] 10Operations, 10Analytics, 10DBA, 10Research, 10Article-Recommendation: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10Ottomata) > to have a daemon on the mysql hosts To clarify, it is unlikely these scripts would run on the mysql servers themsel... [20:31:21] 10Operations, 10Analytics, 10DBA, 10Research, 10Article-Recommendation: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10Nuria) I think one telling use case the ilustrates why we want to decouple data loading from hadoop is a rollback. Say that yo... [20:37:14] 10Operations, 10Analytics, 10Performance-Team, 10Patch-For-Review, 10Software-Licensing: New MongoDB version is not DFSG-compatible, dropped by Debian - https://phabricator.wikimedia.org/T213996 (10Krinkle) [20:37:17] 10Operations, 10Analytics, 10DBA, 10Research, 10Article-Recommendation: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10bmansurov) Rollback is already taken care of the in the script level. We'll have different versions of the data in MySQL and ca... [20:37:53] 10Operations, 10Performance-Team, 10Patch-For-Review, 10Software-Licensing: New MongoDB version is not DFSG-compatible, dropped by Debian - https://phabricator.wikimedia.org/T213996 (10Nuria) [20:39:52] 10Operations, 10Performance-Team, 10Software-Licensing: New MongoDB version is not DFSG-compatible, dropped by Debian - https://phabricator.wikimedia.org/T213996 (10Krinkle) [20:43:34] (03PS1) 10Gehel: elasticsearch: allow cumin to connect to elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/485092 [20:44:10] (03PS7) 10Dzahn: geoip::maxmind: add data types, rm deprecated validate_string [puppet] - 10https://gerrit.wikimedia.org/r/483222 [20:45:00] 10Operations, 10ops-eqiad: cloudstore100{8,9} - Upgrade to 10GbE - https://phabricator.wikimedia.org/T214079 (10RobH) cloudstore1008 is in a5 cloudstore1009 and its array are in a6 @cmjohnson attempted to address the 10G options during the racking and setup on T193655#4264714. So, it seems in prior conversat... [20:46:02] 10Operations, 10Analytics, 10DBA, 10Research, 10Article-Recommendation: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10Nuria) @bmansurov how do handle deleting data in your storage when you have reached capacity or when that dataset is bad? There... [20:47:54] (03CR) 10Dzahn: geoip::maxmind: add data types, rm deprecated validate_string (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/483222 (owner: 10Dzahn) [20:49:22] 10Operations, 10ops-eqiad, 10Patch-For-Review: eqiad: rack a3 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 (10RobH) Update from IRC sync with @Cmjohnson: - Verified with each service owner that all servers were depooled and powered off - I ran power cables across to rack A4 f... [20:52:01] 10Operations, 10Analytics, 10DBA, 10Research, 10Article-Recommendation: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10bmansurov) @Nuria > how do handle deleting data in your storage when you have reached capacity or when that dataset is bad? T... [20:52:58] RECOVERY - MariaDB Slave Lag: s7 on db2068 is OK: OK slave_sql_lag Replication lag: 24.24 seconds [20:53:02] RECOVERY - MariaDB Slave Lag: s7 on db2095 is OK: OK slave_sql_lag Replication lag: 9.28 seconds [20:53:06] RECOVERY - MariaDB Slave Lag: s7 on db2086 is OK: OK slave_sql_lag Replication lag: 0.31 seconds [20:53:10] RECOVERY - MariaDB Slave Lag: s7 on db2087 is OK: OK slave_sql_lag Replication lag: 0.26 seconds [20:53:16] RECOVERY - MariaDB Slave Lag: s7 on db2040 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [20:53:23] (03PS1) 10Dzahn: jenkins: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/485094 [20:53:26] RECOVERY - MariaDB Slave Lag: s7 on db2054 is OK: OK slave_sql_lag Replication lag: 0.27 seconds [20:53:40] RECOVERY - MariaDB Slave Lag: s7 on db2047 is OK: OK slave_sql_lag Replication lag: 0.43 seconds [20:53:46] RECOVERY - MariaDB Slave Lag: s7 on db2077 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [20:53:50] RECOVERY - MariaDB Slave Lag: s7 on db2061 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [20:53:54] (03CR) 10jerkins-bot: [V: 04-1] jenkins: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/485094 (owner: 10Dzahn) [20:54:54] (03PS1) 10Catrope: labs: Configure $wgGEHelpPanelSearchForeignAPI for beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485095 (https://phabricator.wikimedia.org/T214083) [20:55:36] 10Operations, 10ops-eqiad, 10Patch-For-Review: eqiad: rack a3 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 (10RobH) 05Open→03Resolved a:03RobH So, the three failed hosts followups or existing tasks (all were existing): analytics1054 T213038 analytics1056 T214057 restbas... [20:56:02] 10Operations, 10ops-eqiad, 10Analytics: Rack A2's hosts alarm for PSU broken - https://phabricator.wikimedia.org/T212861 (10RobH) [20:56:07] 10Operations, 10ops-eqiad, 10Analytics, 10DBA, 10Patch-For-Review: swap a2-eqiad PDU with on-site spare - https://phabricator.wikimedia.org/T213748 (10RobH) 05Open→03Resolved a:03RobH Synced up with Chris via IRC: All systems were able to come back up within a2 without incident. The spare PDU is... [20:56:49] (03PS1) 10Dzahn: contint: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/485096 [20:57:23] (03PS1) 10Catrope: labs: Remove $wgGEHelpPanelSearchDevMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485097 (https://phabricator.wikimedia.org/T214083) [20:57:49] (03PS2) 10Catrope: labs: Remove $wgGEHelpPanelSearchDevMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485097 (https://phabricator.wikimedia.org/T214083) [20:58:28] (03PS1) 10Dzahn: ifft: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/485098 [21:01:37] (03PS1) 10Dzahn: package_builder: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/485099 [21:06:03] (03PS1) 10Dzahn: releases: add data types to parameters, move vars to params [puppet] - 10https://gerrit.wikimedia.org/r/485100 [21:06:31] (03CR) 10jerkins-bot: [V: 04-1] releases: add data types to parameters, move vars to params [puppet] - 10https://gerrit.wikimedia.org/r/485100 (owner: 10Dzahn) [21:09:01] (03PS1) 10Dzahn: wikistats: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/485101 [21:09:28] (03CR) 10jerkins-bot: [V: 04-1] wikistats: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/485101 (owner: 10Dzahn) [21:25:08] (03PS1) 10Dzahn: mediawiki: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/485104 [21:25:55] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/485104 (owner: 10Dzahn) [21:26:54] (03CR) 10Mobrovac: ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/484602 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [21:27:31] (03CR) 10Mobrovac: "Latest PCC - https://puppet-compiler.wmflabs.org/compiler1002/14368/" [puppet] - 10https://gerrit.wikimedia.org/r/484602 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [21:29:52] RECOVERY - MariaDB Slave Lag: s7 on db1125 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [21:30:52] (03CR) 10Sbisson: [C: 03+2] labs: Configure $wgGEHelpPanelSearchForeignAPI for beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485095 (https://phabricator.wikimedia.org/T214083) (owner: 10Catrope) [21:30:58] (03CR) 10Sbisson: [C: 03+2] labs: Remove $wgGEHelpPanelSearchDevMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485097 (https://phabricator.wikimedia.org/T214083) (owner: 10Catrope) [21:31:00] (03PS1) 10Dzahn: webperf: add data types, split statsd host/port params [puppet] - 10https://gerrit.wikimedia.org/r/485106 [21:31:28] (03CR) 10jerkins-bot: [V: 04-1] webperf: add data types, split statsd host/port params [puppet] - 10https://gerrit.wikimedia.org/r/485106 (owner: 10Dzahn) [21:31:58] 10Operations, 10Discourse, 10Wikimedia-Mailing-lists: Create discourse-test mailing list - https://phabricator.wikimedia.org/T214077 (10Qgil) Yes, sorry, I wasn't aware of that list. Yay! [21:32:00] (03Merged) 10jenkins-bot: labs: Configure $wgGEHelpPanelSearchForeignAPI for beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485095 (https://phabricator.wikimedia.org/T214083) (owner: 10Catrope) [21:33:39] (03CR) 10jenkins-bot: labs: Configure $wgGEHelpPanelSearchForeignAPI for beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485095 (https://phabricator.wikimedia.org/T214083) (owner: 10Catrope) [21:36:08] 10Operations, 10Discourse, 10Wikimedia-Mailing-lists: Create discourse-test mailing list - https://phabricator.wikimedia.org/T214077 (10Qgil) [21:36:10] 10Operations, 10Cloud-Services, 10Discourse, 10Wikimedia-Mailing-lists: Create temporary test mailman mailing list to test synchronization with https://discourse.wmflabs.org/ - https://phabricator.wikimedia.org/T126547 (10Qgil) [21:43:38] (03PS1) 10Dzahn: helm: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/485110 [21:51:48] (03PS1) 10Sbisson: Enable the Welcome survey on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485112 (https://phabricator.wikimedia.org/T213356) [21:56:03] (03PS1) 10BryanDavis: Fix "AttributeError: 'NoneType' object has no attribute 'get'" [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/485113 [21:56:49] (03CR) 10BryanDavis: [C: 03+2] Fix "AttributeError: 'NoneType' object has no attribute 'get'" [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/485113 (owner: 10BryanDavis) [21:57:05] 10Operations, 10ops-eqiad, 10RESTBase, 10RESTBase-Cassandra, and 3 others: Memory error on restbase1016 - https://phabricator.wikimedia.org/T212418 (10mobrovac) So now we should be able to get `restbase1016` back into the cluster. Since we need to re-bootstrap the instances in, we can either: 1. carefully... [21:57:17] (03Merged) 10jenkins-bot: Fix "AttributeError: 'NoneType' object has no attribute 'get'" [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/485113 (owner: 10BryanDavis) [22:02:40] 10Operations, 10netops: Investigate network issues in codfw that caused 503 errors - https://phabricator.wikimedia.org/T209145 (10ayounsi) 05Open→03Resolved My guess based on those clues, is that this link flap caused at least some traffic from eqiad to codfw to be blackholed. Most likely the time protocol... [22:03:40] 10Operations, 10netops: OSPF metrics - https://phabricator.wikimedia.org/T200277 (10ayounsi) p:05Normal→03Low Low priority, over to @faidon for feedbacks. [22:03:51] 10Operations, 10netops: OSPF metrics - https://phabricator.wikimedia.org/T200277 (10ayounsi) a:05ayounsi→03faidon [22:05:32] (03CR) 10Hashar: [C: 03+1] Use python2 as basepython [software] - 10https://gerrit.wikimedia.org/r/484806 (owner: 10Thcipriani) [22:18:29] 10Operations, 10monitoring, 10Goal, 10Patch-For-Review: Upgrade production prometheus-node-exporter to >= 0.16 - https://phabricator.wikimedia.org/T213708 (10colewhite) After deploying the rules and node-exporter v0.17 to deployment-prometheus02, it appears they are not for backwards compatibility, but for... [22:27:07] (03PS1) 10Catrope: labs: Actually enable the help panel on Catalan beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485118 [22:27:17] (03CR) 10Catrope: [C: 03+2] labs: Actually enable the help panel on Catalan beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485118 (owner: 10Catrope) [22:28:30] (03Merged) 10jenkins-bot: labs: Actually enable the help panel on Catalan beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485118 (owner: 10Catrope) [22:31:13] 10Operations, 10Wikimedia-Logstash, 10User-fgiunchedi, 10User-herron: Increase utilization of application logging pipeline (FY2018-2019 Q3 TEC6) - https://phabricator.wikimedia.org/T213157 (10Cmjohnson) [22:34:13] (03CR) 10jenkins-bot: labs: Actually enable the help panel on Catalan beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485118 (owner: 10Catrope) [22:36:21] (03CR) 10Smalyshev: [C: 03+1] "lgtm though I'm not super-familiar with the current script." [puppet] - 10https://gerrit.wikimedia.org/r/484974 (https://phabricator.wikimedia.org/T213305) (owner: 10Mathew.onipe) [22:36:23] (03CR) 10Volans: [C: 04-1] "To my understanding the cumin masters are already part of the domain networks." [puppet] - 10https://gerrit.wikimedia.org/r/485092 (owner: 10Gehel) [22:47:15] (03PS2) 10Krinkle: webperf: add data types, split statsd host/port params [puppet] - 10https://gerrit.wikimedia.org/r/485106 (owner: 10Dzahn) [22:58:41] 10Operations, 10ops-eqiad: cloudstore100{8,9} - Upgrade to 10GbE - https://phabricator.wikimedia.org/T214079 (10GTirloni) It seems the requirement to be side-by-side might have been because we were planning to use DRBD and needed a direct host-to-host connection but we're not pursuing DRBD anymore. In discuss... [22:59:03] 10Operations, 10monitoring, 10Goal, 10Patch-For-Review: Upgrade production prometheus-node-exporter to >= 0.16 - https://phabricator.wikimedia.org/T213708 (10colewhite) Tested command line flags for prometheus-node-exporter v0.17 ` --collector.diskstats.ignored-devices=^(ram|loop|fd|(h|s|v|xv)d[a-z]|nvmed+... [23:00:24] !log bsitzmann@deploy1001 Started deploy [mobileapps/deploy@6b344ca]: Update mobileapps to 258d76b page summary changes, 2nd try [23:00:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:02:28] !log bsitzmann@deploy1001 Finished deploy [mobileapps/deploy@6b344ca]: Update mobileapps to 258d76b page summary changes, 2nd try (duration: 02m 03s) [23:02:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:03:41] 10Operations, 10Analytics, 10Research, 10Article-Recommendation, 10User-Marostegui: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10Marostegui) I don't think writing from Hadoop directly to M2 master is a good idea. But it is not really my call.... [23:13:58] 10Operations, 10Wikimedia-Mailing-lists: request of a new mailing list WIKI-BNCF - https://phabricator.wikimedia.org/T214059 (10Aklapper) [23:16:06] PROBLEM - MariaDB Slave Lag: s7 on db2068 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.69 seconds [23:16:06] PROBLEM - MariaDB Slave Lag: s7 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.22 seconds [23:16:16] PROBLEM - MariaDB Slave Lag: s7 on db2086 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.96 seconds [23:16:18] PROBLEM - MariaDB Slave Lag: s7 on db2087 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.29 seconds [23:16:28] PROBLEM - MariaDB Slave Lag: s7 on db2040 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 307.49 seconds [23:16:36] PROBLEM - MariaDB Slave Lag: s7 on db2054 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.02 seconds [23:16:50] PROBLEM - MariaDB Slave Lag: s7 on db2047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 313.50 seconds [23:16:54] PROBLEM - MariaDB Slave Lag: s7 on db2077 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 315.34 seconds [23:16:58] PROBLEM - MariaDB Slave Lag: s7 on db2061 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 316.27 seconds [23:22:46] (03CR) 10Krinkle: [C: 04-1] "webperf001 fails to compile - https://puppet-compiler.wmflabs.org/compiler1002/14370/" [puppet] - 10https://gerrit.wikimedia.org/r/485106 (owner: 10Dzahn) [23:25:20] RECOVERY - MariaDB Slave Lag: s3 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 255.01 seconds [23:29:40] RECOVERY - MariaDB Slave Lag: s8 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 181.45 seconds [23:49:10] RECOVERY - MariaDB Slave Lag: s2 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 289.49 seconds [23:49:48] (03PS3) 10Dzahn: webperf: add data types, split statsd host/port params [puppet] - 10https://gerrit.wikimedia.org/r/485106 [23:53:26] (03PS1) 10MaxSem: eventlogging: Remove all mentions of MongoDB [puppet] - 10https://gerrit.wikimedia.org/r/485127 [23:56:31] (03PS4) 10Dzahn: webperf: add data types, split statsd host/port params [puppet] - 10https://gerrit.wikimedia.org/r/485106