[00:01:23] !next [00:01:52] oh, silly me the time change means swat was an hour ago .. i'm the only one in it so will deploy now [00:03:06] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [00:31:05] (03PS3) 10CRusnov: cables: detect duplicate cable names, and blank cable names [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/550052 (https://phabricator.wikimedia.org/T237007) [00:32:35] (03PS1) 10BBlack: ATS: Support X-W-D for /w/api.php as well [puppet] - 10https://gerrit.wikimedia.org/r/550774 (https://phabricator.wikimedia.org/T237687) [00:32:41] (03CR) 10CRusnov: cables: detect duplicate cable names, and blank cable names (034 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/550052 (https://phabricator.wikimedia.org/T237007) (owner: 10CRusnov) [00:33:57] (03CR) 10CRusnov: [V: 03+2 C: 03+2] "I'm going to self-merge as this is an ongoing alert noise and i have tested it extensively while debugging the 500s." [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/550741 (https://phabricator.wikimedia.org/T224946) (owner: 10CRusnov) [00:36:48] !log ebernhardson@deploy1001 Synchronized php-1.35.0-wmf.5/extensions/CirrusSearch/includes/BuildDocument/BuildDocument.php: T237849: Restore CirrusSearchBuildDocumentParse hook (duration: 00m 54s) [00:36:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:36:54] T237849: Commons search seems to have stopped indexing statements since 30 October 2019 - https://phabricator.wikimedia.org/T237849 [00:38:25] !log crusnov@deploy1001 Started deploy [netbox/deploy@56df4a5]: deploy netbox for script update [00:38:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:09] !log crusnov@deploy1001 Finished deploy [netbox/deploy@56df4a5]: deploy netbox for script update (duration: 00m 44s) [00:39:11] !log crusnov@deploy1001 Started deploy [netbox/deploy@56df4a5]: deploy netbox for script update [00:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:40:00] !log crusnov@deploy1001 Finished deploy [netbox/deploy@56df4a5]: deploy netbox for script update (duration: 00m 49s) [00:40:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:41:07] !log T237849 Start CirrusSearch forceSearchIndex.php commonswiki 2019-10-20T00:00:00 - 2019-11-14T01:00:00 pushing into jobqueue [00:41:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:01:27] (03PS1) 10Paladox: Gerrit: Increase defaultThreadPoolSize to 2 [puppet] - 10https://gerrit.wikimedia.org/r/550776 [01:02:15] (03PS2) 10Paladox: Gerrit: Increase defaultThreadPoolSize to 2 [puppet] - 10https://gerrit.wikimedia.org/r/550776 [02:02:38] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [02:05:10] PROBLEM - Host cp1077 is DOWN: PING CRITICAL - Packet loss = 100% [02:12:04] PROBLEM - Aggregate IPsec Tunnel Status codfw on icinga1001 is CRITICAL: instance={cp2001:9536,cp2004:9536,cp2006:9536,cp2007:9536,cp2010:9536,cp2012:9536,cp2013:9536,cp2016:9536,cp2019:9536,cp2023:9536} site=codfw tunnel={cp1077_v4,cp1077_v6} https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [02:12:28] PROBLEM - Aggregate IPsec Tunnel Status esams on icinga1001 is CRITICAL: instance={cp3058:9536,cp3060:9536,cp3062:9536,cp3064:9536} site=esams tunnel={cp1077_v4,cp1077_v6} https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [02:19:47] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [03:17:10] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([wdqs1006.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [03:20:58] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1006.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1006.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1006.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:21:08] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1006.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1006.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1006.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:21:52] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([wdqs1006.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [03:26:14] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:27:36] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [03:27:48] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:28:36] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [03:33:58] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [03:48:22] 10Operations, 10ops-eqiad, 10Traffic: cp1077 is unreachable - https://phabricator.wikimedia.org/T238289 (10Vgutierrez) [03:49:25] !log depooling cp1077 - T238289 [03:49:28] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp1077.eqiad.wmnet [03:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:49:31] T238289: cp1077 is unreachable - https://phabricator.wikimedia.org/T238289 [03:49:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:49:52] !log power cycling cp0177 - T238289 [03:49:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:50:25] (03PS1) 10Volans: admin: temporarily revoke tgr's keys [puppet] - 10https://gerrit.wikimedia.org/r/550782 [03:53:50] (03CR) 10CDanis: [C: 03+1] admin: temporarily revoke tgr's keys [puppet] - 10https://gerrit.wikimedia.org/r/550782 (owner: 10Volans) [03:54:06] (03CR) 10Volans: [C: 03+2] admin: temporarily revoke tgr's keys [puppet] - 10https://gerrit.wikimedia.org/r/550782 (owner: 10Volans) [03:56:54] RECOVERY - Host cp1077 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [03:57:50] RECOVERY - Aggregate IPsec Tunnel Status codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [03:58:22] RECOVERY - Aggregate IPsec Tunnel Status esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [04:05:52] 10Operations, 10Wikimedia-Mailing-lists, 10Privacy, 10Security, 10User-Josve05a: Stop storing Mailman passwords in plain text - https://phabricator.wikimedia.org/T181803 (10Apap04) Can this be locked to #security or #acl_security_team members only? It's not smart to keep such issue available for everyone... [04:07:28] 10Operations, 10ops-eqiad, 10Traffic: cp1077 is unreachable - https://phabricator.wikimedia.org/T238289 (10Vgutierrez) Nothing on the logs as well, this looks awfully familiar to T237348 and T238032 [04:19:07] 10Operations, 10Wikimedia-Mailing-lists, 10Privacy, 10Security, 10User-Josve05a: Stop storing Mailman passwords in plain text - https://phabricator.wikimedia.org/T181803 (10Bawolff) Its kind of obvious when mailman keeps sending people monthly password reminders [04:42:48] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2019-11-13-111130-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/550663 (https://phabricator.wikimedia.org/T237379) (owner: 10KartikMistry) [04:46:35] (03PS2) 10KartikMistry: Update cxserver to 2019-11-13-111130-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/550663 (https://phabricator.wikimedia.org/T237379) [04:49:38] !log kartik@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'cxserver' for release 'staging' . [04:49:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:51:33] !log kartik@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'cxserver' for release 'production' . [04:51:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:55:23] (03PS1) 10Volans: Revert "admin: temporarily revoke tgr's keys" [puppet] - 10https://gerrit.wikimedia.org/r/550787 [04:56:18] !log kartik@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'cxserver' for release 'production' . [04:56:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:58:23] (03CR) 10Volans: [C: 03+2] Revert "admin: temporarily revoke tgr's keys" [puppet] - 10https://gerrit.wikimedia.org/r/550787 (owner: 10Volans) [05:01:11] !log Updated cxserver to 2019-11-13-111130-production tag (T237379, T235748, T236906) [05:01:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:01:19] T237379: Add szywiki to cxserver - https://phabricator.wikimedia.org/T237379 [05:01:19] T236906: Sentences don't get highlighted in some articles - https://phabricator.wikimedia.org/T236906 [05:01:19] T235748: Add mnwwiki to cxserver - https://phabricator.wikimedia.org/T235748 [05:05:39] Is the s1 read only happening at 05:00 UTC as originally announced, or at 06:00 UTC as implied by the email that Manuel Arostegui just sent out? [05:06:34] AntiComposite: At 06:00 UTC, when we first scheduled we didn't realise EU would have its change from CEST to CET :( [05:07:24] !log Start pre-failover steps T234800 [05:07:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:07:29] T234800: Switchover s1 primary database master db1067 -> db1083 - 14th Nov 06:00 - 06:30 UTC - https://phabricator.wikimedia.org/T234800 [05:07:52] Hey, at least you get to blame DST. Plenty of people on the tech lists mess up UTC even without it! [05:08:11] lol [05:08:23] AntiComposite: hahaha [05:08:33] !log Repooling cp1077 - T238289 [05:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:08:38] T238289: cp1077 is unreachable - https://phabricator.wikimedia.org/T238289 [05:09:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db1083 with weight 0 T234800', diff saved to https://phabricator.wikimedia.org/P9625 and previous config saved to /var/cache/conftool/dbconfig/20191114-050940-marostegui.json [05:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:13:36] !log Move replicas from db1067 to db1083 T234800 [05:13:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:13:41] T234800: Switchover s1 primary database master db1067 -> db1083 - 14th Nov 06:00 - 06:30 UTC - https://phabricator.wikimedia.org/T234800 [05:23:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1130 after schema change', diff saved to https://phabricator.wikimedia.org/P9626 and previous config saved to /var/cache/conftool/dbconfig/20191114-052303-marostegui.json [05:23:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:24:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1110 for schema change', diff saved to https://phabricator.wikimedia.org/P9627 and previous config saved to /var/cache/conftool/dbconfig/20191114-052400-marostegui.json [05:24:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:27:59] (03PS2) 10Marostegui: mariadb: Promote db1083 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/550587 (https://phabricator.wikimedia.org/T234800) [05:28:03] (03CR) 10Marostegui: mariadb: Promote db1083 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/550587 (https://phabricator.wikimedia.org/T234800) (owner: 10Marostegui) [05:29:30] (03CR) 10Marostegui: wmnet: Point s1-master to db1083 [dns] - 10https://gerrit.wikimedia.org/r/550588 (https://phabricator.wikimedia.org/T234800) (owner: 10Marostegui) [05:29:33] (03PS2) 10Marostegui: wmnet: Point s1-master to db1083 [dns] - 10https://gerrit.wikimedia.org/r/550588 (https://phabricator.wikimedia.org/T234800) [05:32:19] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1083 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/550587 (https://phabricator.wikimedia.org/T234800) (owner: 10Marostegui) [05:33:42] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [05:34:32] !log Compress db2089:3316 - T235599 [05:34:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:38] T235599: Recompress special slaves across eqiad and codfw - https://phabricator.wikimedia.org/T235599 [05:51:15] !log stopping db1114 replication [05:51:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:59:19] (03CR) 10Volans: [C: 03+1] "LGTM, compiler results here:" [puppet] - 10https://gerrit.wikimedia.org/r/549847 (https://phabricator.wikimedia.org/T237197) (owner: 10Alexandros Kosiaris) [06:00:04] marostegui and jynus: My dear minions, it's time we take the moon! Just kidding. Time for s1 database master failover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191114T0600). [06:00:07] jynus: ready? [06:00:10] yes [06:00:14] \o/ [06:00:16] !log Starting s1 failover from db1067 to db1083 - T234800 [06:00:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:21] T234800: Switchover s1 primary database master db1067 -> db1083 - 14th Nov 06:00 - 06:30 UTC - https://phabricator.wikimedia.org/T234800 [06:00:28] !log marostegui@cumin2001 dbctl commit (dc=all): 'Set s1 as read-only for maintenance T234800', diff saved to https://phabricator.wikimedia.org/P9628 and previous config saved to /var/cache/conftool/dbconfig/20191114-060026-marostegui.json [06:00:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:43] The Wikipedia database is temporarily in read-only mode [06:00:45] yep [06:00:48] proceeding [06:00:51] * volans around [06:01:36] topology done [06:01:39] !log marostegui@cumin2001 dbctl commit (dc=all): 'Promote db1083 to s1 master and remove read-only from s1 T234800', diff saved to https://phabricator.wikimedia.org/P9629 and previous config saved to /var/cache/conftool/dbconfig/20191114-060138-marostegui.json [06:01:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:43] RO removed [06:01:55] I can edit [06:01:58] same [06:02:02] let's monitor [06:02:09] no errors so far [06:02:25] I can see recentchanges moving [06:02:28] a few exceptions [06:02:51] tendril looks good [06:03:24] everything looks fine I think [06:03:41] rows written showed a dip during read only but recovered [06:03:55] rows read seems a bit high [06:04:08] but not abnormally high [06:04:17] maybe retries? [06:04:37] (03CR) 10Marostegui: [C: 03+2] wmnet: Point s1-master to db1083 [dns] - 10https://gerrit.wikimedia.org/r/550588 (https://phabricator.wikimedia.org/T234800) (owner: 10Marostegui) [06:04:37] it recovered [06:05:02] could or could not be related, there are previuos spikes like it before [06:05:15] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/522567 (owner: 10Dzahn) [06:05:23] let me check the job queue [06:05:58] exceptions finished [06:06:32] probably unrelated, but to check later "Failed to load data blob from tt:13359121: Bad data in text row 13359121" [06:06:41] cool [06:07:17] we have a few weird enwiki weird searches [06:07:30] like which ones [06:08:45] (03PS1) 10Marostegui: db1134: Set it as candidate master for s1 [puppet] - 10https://gerrit.wikimedia.org/r/550793 (https://phabricator.wikimedia.org/T234800) [06:09:00] https://logstash.wikimedia.org/goto/6e992648a13c0805a261b1cc2f2f58f9 [06:09:19] not weird, just repeating [06:10:10] hehe [06:10:12] I wonder if that exception has been reported [06:10:21] PHP Notice: Array to string conversion [06:10:27] someone might have exams soon! [06:10:39] (03CR) 10Marostegui: [C: 03+2] db1134: Set it as candidate master for s1 [puppet] - 10https://gerrit.wikimedia.org/r/550793 (https://phabricator.wikimedia.org/T234800) (owner: 10Marostegui) [06:12:15] 10Operations, 10DBA, 10Patch-For-Review: Switchover s1 primary database master db1067 -> db1083 - 14th Nov 06:00 - 06:30 UTC - https://phabricator.wikimedia.org/T234800 (10Marostegui) [06:12:28] yes, reported already https://phabricator.wikimedia.org/T237559 [06:14:30] RECOVERY - WDQS high update lag on wdqs1005 is OK: (C)4.32e+04 ge (W)2.16e+04 ge 2.135e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [06:16:49] !log Stop replication on db1067 [06:16:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:37] 10Operations, 10DBA, 10Patch-For-Review: Switchover s1 primary database master db1067 -> db1083 - 14th Nov 06:00 - 06:30 UTC - https://phabricator.wikimedia.org/T234800 (10Marostegui) This was done. Read only start: 06:00:28 Read only stop: 06:01:39 Total read only time: 01:11 minutes [06:41:01] 10Operations, 10DBA, 10Patch-For-Review: Switchover s1 primary database master db1067 -> db1083 - 14th Nov 06:00 - 06:30 UTC - https://phabricator.wikimedia.org/T234800 (10Marostegui) 05Open→03Resolved [06:41:05] 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [06:53:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1067', diff saved to https://phabricator.wikimedia.org/P9630 and previous config saved to /var/cache/conftool/dbconfig/20191114-065309-marostegui.json [06:53:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:19] !log installing intel-microcode updates [07:12:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:52] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [07:31:47] 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [07:42:22] 10Operations, 10Community-Tech, 10serviceops, 10wikidiff2, 10Patch-For-Review: Deploy new version of wikidiff2 package - https://phabricator.wikimedia.org/T140443 (10jijiki) [07:42:51] 10Operations, 10MediaWiki-REST-API, 10serviceops, 10wikidiff2, and 2 others: Deploy latest version of wikidiff2 to production - https://phabricator.wikimedia.org/T236963 (10jijiki) [07:43:52] 10Operations, 10Community-Tech, 10wikidiff2, 10Patch-For-Review: Deploy new version of wikidiff2 package - https://phabricator.wikimedia.org/T140443 (10jijiki) [07:44:58] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [07:54:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1103:3312 T235599', diff saved to https://phabricator.wikimedia.org/P9631 and previous config saved to /var/cache/conftool/dbconfig/20191114-075449-marostegui.json [07:54:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:55] T235599: Recompress special slaves across eqiad and codfw - https://phabricator.wikimedia.org/T235599 [08:00:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool a non partitioned slave db1089 on special groups for s1 T223151', diff saved to https://phabricator.wikimedia.org/P9632 and previous config saved to /var/cache/conftool/dbconfig/20191114-080038-marostegui.json [08:00:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:44] T223151: Review special replica partitioning of certain tables by `xx_user` - https://phabricator.wikimedia.org/T223151 [08:18:15] (03PS1) 10Marostegui: install_server: Reimage db213[3-5] [puppet] - 10https://gerrit.wikimedia.org/r/550801 (https://phabricator.wikimedia.org/T238183) [08:20:39] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db213[3-5] [puppet] - 10https://gerrit.wikimedia.org/r/550801 (https://phabricator.wikimedia.org/T238183) (owner: 10Marostegui) [08:37:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1110 after schema change', diff saved to https://phabricator.wikimedia.org/P9633 and previous config saved to /var/cache/conftool/dbconfig/20191114-083729-marostegui.json [08:37:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:49] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [08:38:49] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [08:38:52] !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [08:38:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1082 for schema change', diff saved to https://phabricator.wikimedia.org/P9634 and previous config saved to /var/cache/conftool/dbconfig/20191114-084043-marostegui.json [08:40:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:55] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:40:56] (03CR) 10Filippo Giunchedi: "See inline, LGTM overall" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/548880 (https://phabricator.wikimedia.org/T217340) (owner: 10Herron) [08:40:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:30] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [08:41:30] !log Deploy schema change with replication on db1082, this will generate lag on s5 labs - T233135 T234066 [08:41:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:36] T233135: Schema change for refactored actor and comment storage - https://phabricator.wikimedia.org/T233135 [08:41:37] T234066: Schema change to rename user_newtalk indexes - https://phabricator.wikimedia.org/T234066 [08:42:22] !log Remove ar_comment from triggers on db1124:3315 - T234704 [08:42:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:26] T234704: Remove ar_comment from sanitarium triggers - https://phabricator.wikimedia.org/T234704 [08:48:10] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:50:06] (03CR) 10Filippo Giunchedi: logstash: introduce logstash 7 and openjdk-11 support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/548880 (https://phabricator.wikimedia.org/T217340) (owner: 10Herron) [08:55:33] (03PS1) 10Filippo Giunchedi: logstash: introduce restbase-specific index [puppet] - 10https://gerrit.wikimedia.org/r/550806 (https://phabricator.wikimedia.org/T238196) [08:57:27] onimisionipe gehel I'm seeing a bunch of UNKNOWNs for "old jvm gc check" for elastic hosts, known/expected and/or can be silenced ? [09:03:08] * onimisionipe checking [09:04:25] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [09:04:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:54] !log Compare wikidatawiki.pagelinks between db1124:3318 and labsdb1010 - T233986 [09:05:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:06:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:00] (03CR) 10Filippo Giunchedi: "PTAL, bandaid meant to be temporary until we can standardize on a common schema (CC'd restbase folks, although this change isn't impacting" [puppet] - 10https://gerrit.wikimedia.org/r/550806 (https://phabricator.wikimedia.org/T238196) (owner: 10Filippo Giunchedi) [09:10:32] godog: expected. geh.el already downtimed some. I've downtime the others [09:10:33] thanks [09:15:56] onimisionipe: sweet, thank you! [09:19:21] (03PS1) 10Marostegui: mariadb: Place db2133 into m2 [puppet] - 10https://gerrit.wikimedia.org/r/550810 (https://phabricator.wikimedia.org/T238183) [09:20:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Full weight to db1089 on special groups for s1 T223151', diff saved to https://phabricator.wikimedia.org/P9635 and previous config saved to /var/cache/conftool/dbconfig/20191114-092006-marostegui.json [09:20:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:13] (03CR) 10Ema: [C: 03+2] ATS: Support X-W-D for /w/api.php as well [puppet] - 10https://gerrit.wikimedia.org/r/550774 (https://phabricator.wikimedia.org/T237687) (owner: 10BBlack) [09:20:13] T223151: Review special replica partitioning of certain tables by `xx_user` - https://phabricator.wikimedia.org/T223151 [09:21:16] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Place db2133 into m2 [puppet] - 10https://gerrit.wikimedia.org/r/550810 (https://phabricator.wikimedia.org/T238183) (owner: 10Marostegui) [09:22:19] (03CR) 10Marostegui: [V: 03+2 C: 03+2] "Known issue that will be tackled with the misc refactoring" [puppet] - 10https://gerrit.wikimedia.org/r/550810 (https://phabricator.wikimedia.org/T238183) (owner: 10Marostegui) [09:24:02] !log Stop mysql on db2067 to clone db21133 - T238183 [09:24:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:07] T238183: Productionize db213[2-5} - https://phabricator.wikimedia.org/T238183 [09:24:07] onimisionipe, godog: thanks! I was in interview [09:24:44] onimisionipe: those alerts are about the new servers, it looks like there is an issue with those metrics not published to prometheus. [09:25:44] !log installing ghostscript updates on thumbor1001 [09:25:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:46] I've checked on elastic1053 and the metrics seems to be exposed by the local prometheus exporter [09:26:54] not sure what's going wrong [09:28:58] gehel: I will check them later. [09:28:59] (03PS1) 10Ema: cache: reimage cp3058 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/550811 (https://phabricator.wikimedia.org/T227432) [09:29:08] I guess its not super critical for now [09:29:10] onimisionipe: thanks! [09:30:01] (03CR) 10Vgutierrez: [C: 03+1] cache: reimage cp3058 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/550811 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [09:31:20] (03PS2) 10Ema: cache: reimage cp3058 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/550811 (https://phabricator.wikimedia.org/T227432) [09:31:41] !log Compare wikidatawiki.pagelinks between labsdb1011 and labsdb1010 - T233986 [09:31:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:00] !log depool cp3058 and reimage as text_ats T227432 [09:34:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:05] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [09:36:19] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp3058.esams.wmnet'] ` The log can be found in `/var/log/wm... [09:36:45] 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Vgutierrez) [09:37:35] 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Vgutierrez) [09:37:37] 10Operations, 10ops-eqiad, 10Traffic: cp1077 is unreachable - https://phabricator.wikimedia.org/T238289 (10Vgutierrez) [09:37:40] 10Operations, 10ops-esams, 10Traffic: cp3065 crashed - https://phabricator.wikimedia.org/T238032 (10Vgutierrez) [09:37:43] 10Operations, 10ops-esams, 10Traffic: cp3057 is unreachable - https://phabricator.wikimedia.org/T237348 (10Vgutierrez) [09:38:04] 10Operations, 10Analytics, 10Analytics-Kanban, 10SRE-Access-Requests: Add system user analytics-privatedata to the anaytics-privatedata-users group - https://phabricator.wikimedia.org/T238306 (10elukey) [09:38:34] 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Vgutierrez) p:05Triage→03High [09:41:49] PROBLEM - haproxy failover on dbproxy2002 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [09:42:13] marostegui, jynus ^^ is that expected? [09:43:13] PROBLEM - Aggregate IPsec Tunnel Status codfw on icinga1001 is CRITICAL: instance={cp2001:9536,cp2004:9536,cp2006:9536,cp2007:9536,cp2010:9536,cp2012:9536,cp2013:9536,cp2016:9536,cp2019:9536,cp2023:9536} site=codfw tunnel={cp3058_v4,cp3058_v6} https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [09:43:15] not by me [09:43:54] ^^ that's triggered by ema re-imaging cp3058 [09:44:05] PROBLEM - Aggregate IPsec Tunnel Status eqiad on icinga1001 is CRITICAL: instance={cp1077:9536,cp1079:9536,cp1081:9536,cp1083:9536,cp1085:9536,cp1087:9536,cp1089:9536} site=eqiad tunnel={cp3058_v4,cp3058_v6} https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [09:47:01] this ema person is really noisy :D [09:47:07] always breaking things [09:47:09] :D [09:47:14] vgutierrez: yep, expected [09:47:17] marostegui: <3 [09:50:04] (03PS1) 10Elukey: admin: add analytics-privatedata system user [puppet] - 10https://gerrit.wikimedia.org/r/550814 (https://phabricator.wikimedia.org/T238306) [09:55:32] !log depool wdqs (public) eqiad - high lag - T238229 [09:55:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:38] T238229: WDQS is having high update lag for the last week - https://phabricator.wikimedia.org/T238229 [09:56:30] 10Operations, 10Traffic: ats-tls shows spikes on H/2 recv settings bad param errors - https://phabricator.wikimedia.org/T238307 (10Vgutierrez) [09:57:11] !log ema@cumin1001 START - Cookbook sre.hosts.downtime [09:57:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:29] 10Operations, 10Traffic: ats-tls shows spikes on H/2 recv settings bad param errors - https://phabricator.wikimedia.org/T238307 (10Vgutierrez) from cp5007: ` vgutierrez@cp5007:~$ journalctl -u trafficserver-tls --since="7days ago" |grep "settings bad param" |cut -f1-2 -d' ' |uniq -c 21 Nov 07 55 Nov... [09:58:31] 10Operations, 10Traffic: ats-tls shows spikes on H/2 recv settings bad param errors - https://phabricator.wikimedia.org/T238307 (10Vgutierrez) It looks like our ATS build is missing https://github.com/apache/trafficserver/pull/5636 [09:59:17] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:59:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:47] elukey: not only that, I also forget merging patches before reimaging things [09:59:58] (03CR) 10Ema: [C: 03+2] cache: reimage cp3058 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/550811 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [10:00:00] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [10:02:06] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp3058.esams.wmnet'] ` The log can be found in `/var/log/wm... [10:03:32] !log Run deleteEqualMessages.php --delete for cswiki and viwiki [10:03:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:38] !log gehel@cumin1001 START - Cookbook sre.wdqs.data-transfer [10:05:40] ACKNOWLEDGEMENT - Aggregate IPsec Tunnel Status codfw on icinga1001 is CRITICAL: instance={cp2001:9536,cp2004:9536,cp2006:9536,cp2007:9536,cp2010:9536,cp2012:9536,cp2013:9536,cp2016:9536,cp2019:9536,cp2023:9536} site=codfw tunnel={cp3058_v4,cp3058_v6} Ema reimaging 3058 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [10:05:40] ACKNOWLEDGEMENT - Aggregate IPsec Tunnel Status eqiad on icinga1001 is CRITICAL: instance={cp1079:9536,cp1081:9536,cp1083:9536,cp1085:9536,cp1087:9536} site=eqiad tunnel={cp3058_v4,cp3058_v6} Ema reimaging 3058 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [10:05:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:05] (03PS1) 10Vgutierrez: Release 8.0.5-wm11 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/550815 (https://phabricator.wikimedia.org/T238307) [10:06:33] !log copying journal from wdqs1007 to wdqs1005 - T238232 [10:06:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:38] T238232: blazegraph journal on wdqs1005 is oversized - https://phabricator.wikimedia.org/T238232 [10:09:34] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:17:53] (03CR) 10jerkins-bot: [V: 04-1] Release 8.0.5-wm11 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/550815 (https://phabricator.wikimedia.org/T238307) (owner: 10Vgutierrez) [10:17:58] lovely [10:18:28] 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10MoritzMuehlenhoff) I tried to narrow this down a bit, but no real luck: - These haven't seen a microcode update yet (and the previous microcode update round dates back quite a while). - All of th... [10:19:49] 10Operations, 10Traffic: debmonitor TLS termination - https://phabricator.wikimedia.org/T238200 (10ema) 05Open→03Resolved a:03ema TLS termination configured on port 7443: ` $ curl -v https://debmonitor.wikimedia.org:7443/login/ --resolve debmonitor.wikimedia.org:7443:10.64.32.62 2>&1 | grep '< HTTP' < HT... [10:20:57] !log netbox1001 bandaid/symlink /srv/deployment/netbox/deploy/src/netbox/project-static to 'static' [10:21:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:15] (03CR) 10Ema: [C: 03+1] Release 8.0.5-wm11 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/550815 (https://phabricator.wikimedia.org/T238307) (owner: 10Vgutierrez) [10:21:51] (03CR) 10Vgutierrez: [V: 03+2 C: 03+2] Release 8.0.5-wm11 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/550815 (https://phabricator.wikimedia.org/T238307) (owner: 10Vgutierrez) [10:22:44] (03CR) 10Jbond: logstash: introduce logstash 7 and openjdk-11 support (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/548880 (https://phabricator.wikimedia.org/T217340) (owner: 10Herron) [10:23:12] (03CR) 10Jbond: logstash: introduce logstash 7 and openjdk-11 support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/548880 (https://phabricator.wikimedia.org/T217340) (owner: 10Herron) [10:23:28] !log ema@cumin1001 START - Cookbook sre.hosts.downtime [10:23:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:20] (03PS3) 10Elukey: role::dumps::distribution::server: add kerberos [puppet] - 10https://gerrit.wikimedia.org/r/550466 (https://phabricator.wikimedia.org/T234229) [10:24:23] (03PS1) 10Elukey: role::dumps::distribution::server: add analytics refinery [puppet] - 10https://gerrit.wikimedia.org/r/550816 (https://phabricator.wikimedia.org/T234229) [10:24:32] RECOVERY - Aggregate IPsec Tunnel Status eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [10:25:32] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:25:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:27] 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10MoritzMuehlenhoff) cp1077 might also be a totally different issue than cp3* (which are from a the same model/generation/ordering batch ; in kern.log on cp1077 there's two oopses from Nov 5, it's no... [10:28:44] RECOVERY - Aggregate IPsec Tunnel Status codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [10:32:54] (03PS1) 10Effie Mouzeli: mediawiki: Remove HHVM references and includes [puppet] - 10https://gerrit.wikimedia.org/r/550818 (https://phabricator.wikimedia.org/T229792) [10:36:06] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:36:11] 10Operations, 10Traffic: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3058.esams.wmnet'] ` and were **ALL** successful. [10:39:51] (03CR) 10Muehlenhoff: mediawiki: Remove HHVM references and includes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/550818 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [10:43:03] !log pool cp3058 with ATS backend T227432 [10:43:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:09] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [10:44:00] !log uploaded trafficserver-8.0.5-1wm11 to apt.wikimedia.org (stretch) - T238307 [10:44:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:05] T238307: ats-tls shows spikes on H/2 recv settings bad param errors - https://phabricator.wikimedia.org/T238307 [10:48:32] !log Rolling restart of ats-tls/ats-backend to upgrade to 8.0.5-1wm11 - T238307 [10:48:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:21] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: (Aug 30th, 2019) rack/setup/install elastic10[53-67].eqiad.wmnet - https://phabricator.wikimedia.org/T230746 (10Gehel) [10:59:37] RECOVERY - haproxy failover on dbproxy2002 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for European Mid-day SWAT(Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191114T1100). [11:00:05] No GERRIT patches in the queue for this window AFAICS. [11:06:05] o/ [11:08:49] (03CR) 10Jbond: "taken a first pass" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/549222 (owner: 10Cwhite) [11:13:59] (03PS1) 10Ema: cache: reimage cp3060 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/550822 (https://phabricator.wikimedia.org/T227432) [11:26:43] !log gehel@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [11:26:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:05] PROBLEM - WDQS high update lag on wdqs1007 is CRITICAL: 4763 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [11:28:56] 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Peachey88) [11:32:12] 10Operations, 10ops-eqiad, 10Traffic: cp1077 is unreachable - https://phabricator.wikimedia.org/T238289 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez Tracking the issue on the parent task: T238305 [11:32:14] 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Vgutierrez) [11:32:27] 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Vgutierrez) [11:32:29] 10Operations, 10ops-esams, 10Traffic: cp3065 crashed - https://phabricator.wikimedia.org/T238032 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez Tracking the issue on the parent task: T238305 [11:39:46] 10Operations, 10User-jbond: Replicated ticket registry - https://phabricator.wikimedia.org/T233933 (10jbond) [11:39:50] 10Operations, 10Patch-For-Review, 10User-jbond: Cross data center setup for CAS - https://phabricator.wikimedia.org/T233931 (10jbond) [11:40:07] 10Operations, 10Traffic, 10Patch-For-Review: ats-tls shows spikes on H/2 recv settings bad param errors - https://phabricator.wikimedia.org/T238307 (10ema) p:05Triage→03Normal [11:40:09] 10Operations, 10User-jbond: Replicated ticket registry - https://phabricator.wikimedia.org/T233933 (10jbond) Looking at the [[ https://apereo.github.io/cas/6.0.x/installation/Ticket-Registry-Replication-Encryption.html | list of supported ticketing registries ]] we have the following options * Hazelcast *... [11:41:18] (03PS2) 10Jbond: build.gradle: add memcached support to cas blob [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/550682 (https://phabricator.wikimedia.org/T233931) [11:41:26] (03PS3) 10Jbond: apereo_cas: add ability to configure basic memcached support [puppet] - 10https://gerrit.wikimedia.org/r/550695 (https://phabricator.wikimedia.org/T233931) [11:43:09] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [11:48:33] PROBLEM - WDQS high update lag on wdqs1007 is CRITICAL: 3603 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [11:50:59] (03PS3) 10Arturo Borrero Gonzalez: toolforge: new k8s: add wmcs-k8s-get-cert.sh script [puppet] - 10https://gerrit.wikimedia.org/r/550673 (https://phabricator.wikimedia.org/T215553) [11:55:14] !log mobrovac@deploy1001 Started deploy [restbase/deploy@58cf5ae] (dev-cluster): Fix /metrics/mediarequests/top/ indentation [11:55:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:04] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@58cf5ae] (dev-cluster): Fix /metrics/mediarequests/top/ indentation (duration: 02m 50s) [11:58:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:39] (03PS1) 10Ema: ATS: further increase log_buffer_size and max_line_size [puppet] - 10https://gerrit.wikimedia.org/r/550825 (https://phabricator.wikimedia.org/T237608) [12:01:29] (03CR) 10Vgutierrez: [C: 03+1] ATS: further increase log_buffer_size and max_line_size [puppet] - 10https://gerrit.wikimedia.org/r/550825 (https://phabricator.wikimedia.org/T237608) (owner: 10Ema) [12:09:31] !log mobrovac@deploy1001 Started deploy [restbase/deploy@58cf5ae]: Fix /metrics/mediarequests/top/ indentation [12:09:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:09] RECOVERY - WDQS high update lag on wdqs1007 is OK: (C)3600 ge (W)1200 ge 1046 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [12:14:56] (03PS4) 10Ema: varnish/templates/text-frontend.inc.vcl.erb: Fix doc ref to renamed variable [puppet] - 10https://gerrit.wikimedia.org/r/514394 (owner: 10Jforrester) [12:17:08] (03CR) 10Ema: [C: 03+2] varnish/templates/text-frontend.inc.vcl.erb: Fix doc ref to renamed variable [puppet] - 10https://gerrit.wikimedia.org/r/514394 (owner: 10Jforrester) [12:17:19] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [12:24:22] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@58cf5ae]: Fix /metrics/mediarequests/top/ indentation (duration: 14m 52s) [12:24:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:43] (03CR) 10Ema: [C: 04-1] "The change itself looks good but I suspect it would break most VTC tests. Please update all `txreq -url [...]` calls adding -hdr "User-Age" [puppet] - 10https://gerrit.wikimedia.org/r/547792 (https://phabricator.wikimedia.org/T237134) (owner: 10CDanis) [12:31:40] (03PS1) 10BBlack: VCL: Move analytics hooks above beacon synth point [puppet] - 10https://gerrit.wikimedia.org/r/550826 [12:40:41] 10Operations, 10serviceops, 10PHP 7.2 support: Mysterious, coordinated slowdowns every ~ 25 minutes on API servers - https://phabricator.wikimedia.org/T231011 (10mobrovac) @Ladsgroup might be on the correct track here. The spikes neatly coincide with the [WD bot updates](https://eu.wikipedia.org/wiki/Berezi:... [12:42:49] 10Operations, 10serviceops, 10PHP 7.2 support: Mysterious, coordinated slowdowns every ~ 25 minutes on API servers - https://phabricator.wikimedia.org/T231011 (10Ladsgroup) >>! In T231011#5663421, @mobrovac wrote: > @Ladsgroup might be on the correct track here. The spikes neatly coincide with the [WD bot up... [12:44:39] _joe_ jynus marostegui effie: I'm planning to deploy a patch that reduces db reads and increases APC cache, is it okay for you? Where can I monitor things? T231011 T229407 T236681 [12:44:40] T231011: Mysterious, coordinated slowdowns every ~ 25 minutes on API servers - https://phabricator.wikimedia.org/T231011 [12:44:40] T236681: SqlEntityInfoBuilder reads and writes from the old term store regardless of the config - https://phabricator.wikimedia.org/T236681 [12:44:40] T229407: Spikes in DB traffic and rows/s reads when reading from new terms store - https://phabricator.wikimedia.org/T229407 [12:45:44] 10Operations, 10Product-Analytics, 10SRE-Access-Requests: Search Console access for he.wikisource.org - https://phabricator.wikimedia.org/T238090 (10Dzahn) Hi Fuzzy, for the 4th item i am adding @RStallman-legalteam . Rachel, could you work with Fuzzy to go through the volunteer NDA process please? @Fuzzy... [12:46:23] Amir1: there is a apc cache graph [12:46:24] https://grafana.wikimedia.org/d/GuHySj3mz/php7-transition?refresh=30s&orgId=1 [12:46:28] 10Operations, 10User-jbond: Upgrade CAS to 6.1.0 - https://phabricator.wikimedia.org/T236815 (10jbond) I managed to get 6.1 built however there seems to be an [[ https://github.com/apereo/cas-overlay-template/pull/39/files | issue with u2f support ]] [12:46:29] toward the bottom [12:46:44] a group* of graphs [12:46:59] effie: Also, where the graph for T231011 is coming from? [12:47:11] oh it's the same graph [12:47:12] nice [12:47:13] thanks [12:47:45] oh no, that graph does not have reponse times [12:47:52] hangon [12:48:38] ok you can check that [12:48:40] https://grafana.wikimedia.org/d/5E7tdiGWz/xxxx-effie?orgId=1&refresh=30s&panelId=20&fullscreen&from=now-12h&to=now [12:48:45] although it is a bit experimental [12:49:10] and lastly [12:49:11] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-method=GET&var-code=200 [12:55:25] what's fingerprint of mwdebug1002? I can't find it in https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints [12:55:41] (03PS3) 10Gehel: Maps: remove varnish URI sanitization for maps (now done in Kartotherian) [puppet] - 10https://gerrit.wikimedia.org/r/545723 (https://phabricator.wikimedia.org/T232817) [12:56:43] effie: Sorry to ping you again ^ [12:58:39] (03PS1) 10BBlack: Unified cert: undeploy digicert-2019 [puppet] - 10https://gerrit.wikimedia.org/r/550829 [13:00:13] (03PS2) 10BBlack: Unified cert: undeploy digicert-2019 [puppet] - 10https://gerrit.wikimedia.org/r/550829 [13:00:39] (03CR) 10Dzahn: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/524088 (https://phabricator.wikimedia.org/T187716) (owner: 10MaxSem) [13:01:14] Amir1: oh dear, thanks for that [13:01:16] my bad [13:01:41] oh not other debugs are there eitehr [13:01:48] Amir1: you can get it from a bastion [13:02:01] I don't think I have access to bastion [13:02:02] it is updated there [13:02:14] you can't access mwdebug without a bastion :p [13:02:27] (03CR) 10Dzahn: [C: 03+1] Gerrit: Increase defaultThreadPoolSize to 2 [puppet] - 10https://gerrit.wikimedia.org/r/550776 (owner: 10Paladox) [13:04:41] (03CR) 10BBlack: [C: 03+2] Unified cert: undeploy digicert-2019 [puppet] - 10https://gerrit.wikimedia.org/r/550829 (owner: 10BBlack) [13:06:51] !log removing digicert-2019 files from cache nodes - https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/550829/ [13:06:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:31] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [13:09:12] I thought we are allowed to just pass through it in ssh [13:09:17] * Amir1 knows nothing about ssh [13:11:21] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:11:57] oh don't worry amir, ssh knows YOU! [13:28:56] I'm deploying it now-ish [13:29:00] testing in mwdebug1002 [13:30:55] mwdebug1002 actually doesn't work with the firefox add on [13:31:00] fyi [13:31:17] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [13:35:16] works fine, let's deploy and monitor [13:35:22] !log depool wdqs1004 to allow catching up on lag - T238229 [13:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:27] T238229: WDQS is having high update lag for the last week - https://phabricator.wikimedia.org/T238229 [13:36:53] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:37:19] !log ladsgroup@deploy1001 scap failed: average error rate on 7/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/db09a36be5ed3e81155041f7d46ad040 for details) [13:37:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:48] the error seems related but why? [13:40:06] okay, it needs a follow up [13:42:50] (03PS1) 10Effie Mouzeli: debug: new module to add debug tools and -dbg packages [puppet] - 10https://gerrit.wikimedia.org/r/550833 (https://phabricator.wikimedia.org/T236048) [13:44:28] (03CR) 10jerkins-bot: [V: 04-1] debug: new module to add debug tools and -dbg packages [puppet] - 10https://gerrit.wikimedia.org/r/550833 (https://phabricator.wikimedia.org/T236048) (owner: 10Effie Mouzeli) [13:45:34] (03CR) 10Ema: [C: 03+2] Maps: remove varnish URI sanitization for maps (now done in Kartotherian) [puppet] - 10https://gerrit.wikimedia.org/r/545723 (https://phabricator.wikimedia.org/T232817) (owner: 10Gehel) [13:46:25] (03CR) 10Ema: [C: 03+1] VCL: Move analytics hooks above beacon synth point [puppet] - 10https://gerrit.wikimedia.org/r/550826 (owner: 10BBlack) [13:47:22] (03PS2) 10Effie Mouzeli: debug: new module to add debug tools and -dbg packages [puppet] - 10https://gerrit.wikimedia.org/r/550833 (https://phabricator.wikimedia.org/T236048) [13:49:01] (03CR) 10jerkins-bot: [V: 04-1] debug: new module to add debug tools and -dbg packages [puppet] - 10https://gerrit.wikimedia.org/r/550833 (https://phabricator.wikimedia.org/T236048) (owner: 10Effie Mouzeli) [13:50:32] I will upload the follow ups soon [13:53:49] (03PS3) 10Effie Mouzeli: debug: new module to add debug tools and -dbg packages [puppet] - 10https://gerrit.wikimedia.org/r/550833 (https://phabricator.wikimedia.org/T236048) [13:55:50] (03PS1) 10Ema: Add dummy globalsign-2019a-ecdsa-unified for pcc [labs/private] - 10https://gerrit.wikimedia.org/r/550835 (https://phabricator.wikimedia.org/T237650) [13:56:37] (03CR) 10Ema: [V: 03+2 C: 03+2] Add dummy globalsign-2019a-ecdsa-unified for pcc [labs/private] - 10https://gerrit.wikimedia.org/r/550835 (https://phabricator.wikimedia.org/T237650) (owner: 10Ema) [14:01:53] !log depool cp3060 and reimage as text_ats T227432 [14:01:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:58] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [14:02:36] (03CR) 10Ema: [C: 03+2] cache: reimage cp3060 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/550822 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [14:04:07] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp3060.esams.wmnet'] ` The log can be found in `/var/log/wm... [14:06:27] (03PS2) 10Filippo Giunchedi: rsyslog: setup temporary secure rsync for logs transfer [puppet] - 10https://gerrit.wikimedia.org/r/547245 (https://phabricator.wikimedia.org/T224564) [14:06:29] (03PS2) 10Filippo Giunchedi: hieradata: remove wezen from service [puppet] - 10https://gerrit.wikimedia.org/r/547246 (https://phabricator.wikimedia.org/T224564) [14:06:31] (03PS3) 10Filippo Giunchedi: Rename wezen to centrallog2001 [puppet] - 10https://gerrit.wikimedia.org/r/547247 (https://phabricator.wikimedia.org/T224564) [14:06:33] (03PS3) 10Filippo Giunchedi: hieradata: pool centrallog2001 in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/547248 (https://phabricator.wikimedia.org/T224564) [14:11:53] PROBLEM - Aggregate IPsec Tunnel Status eqiad on icinga1001 is CRITICAL: instance={cp1077:9536,cp1081:9536,cp1083:9536,cp1085:9536,cp1089:9536} site=eqiad tunnel={cp3060_v4,cp3060_v6} https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [14:12:37] PROBLEM - Aggregate IPsec Tunnel Status codfw on icinga1001 is CRITICAL: instance={cp2001:9536,cp2004:9536,cp2007:9536,cp2010:9536,cp2012:9536,cp2023:9536} site=codfw tunnel={cp3060_v4,cp3060_v6} https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [14:12:43] ACKNOWLEDGEMENT - Aggregate IPsec Tunnel Status codfw on icinga1001 is CRITICAL: instance={cp2001:9536,cp2004:9536,cp2007:9536,cp2010:9536,cp2012:9536,cp2023:9536} site=codfw tunnel={cp3060_v4,cp3060_v6} Ema reimaging 3060 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [14:12:43] ACKNOWLEDGEMENT - Aggregate IPsec Tunnel Status eqiad on icinga1001 is CRITICAL: instance={cp1077:9536,cp1081:9536,cp1083:9536,cp1085:9536,cp1089:9536} site=eqiad tunnel={cp3060_v4,cp3060_v6} Ema reimaging 3060 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [14:24:59] !log ema@cumin1001 START - Cookbook sre.hosts.downtime [14:25:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:02] (03PS1) 10Alexandros Kosiaris: Namespaces for eventgate-logging-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/550840 (https://phabricator.wikimedia.org/T236386) [14:27:08] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:27:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:15] RECOVERY - Aggregate IPsec Tunnel Status codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [14:30:57] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [14:34:19] RECOVERY - Aggregate IPsec Tunnel Status eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [14:35:22] (03CR) 10Ottomata: [C: 03+1] "Thank you!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/550840 (https://phabricator.wikimedia.org/T236386) (owner: 10Alexandros Kosiaris) [14:35:30] So the error I'm getting is only possible if the files are being sync'd in wrong order, is it possible? [14:35:54] when you do extensions/Wikibase, it should sync all of them together I guess [14:37:19] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/549847 (https://phabricator.wikimedia.org/T237197) (owner: 10Alexandros Kosiaris) [14:37:35] (03CR) 10Alexandros Kosiaris: "started a fleet wide PCC just cause I am paranoid" [puppet] - 10https://gerrit.wikimedia.org/r/549847 (https://phabricator.wikimedia.org/T237197) (owner: 10Alexandros Kosiaris) [14:39:33] 10Operations, 10Traffic: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3060.esams.wmnet'] ` and were **ALL** successful. [14:41:33] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [14:42:38] (03CR) 10BBlack: [C: 03+2] varnish: Update sec-warning message [puppet] - 10https://gerrit.wikimedia.org/r/550391 (https://phabricator.wikimedia.org/T238038) (owner: 10Vgutierrez) [14:44:35] (03PS1) 10Ema: Rename globalsign-2019a-ecdsa-unified to digicert-2019a [labs/private] - 10https://gerrit.wikimedia.org/r/550842 (https://phabricator.wikimedia.org/T237650) [14:44:38] (03PS1) 10Elukey: kerberos: allow kerberos-run-command to check any keytab [puppet] - 10https://gerrit.wikimedia.org/r/550843 [14:45:36] (03PS1) 10Anomie: Set MCR migration stage to NEW on group 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550844 (https://phabricator.wikimedia.org/T198312) [14:46:25] (03CR) 10Anomie: [C: 03+2] "Deploying planned config change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550844 (https://phabricator.wikimedia.org/T198312) (owner: 10Anomie) [14:46:27] (03CR) 10jerkins-bot: [V: 04-1] Set MCR migration stage to NEW on group 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550844 (https://phabricator.wikimedia.org/T198312) (owner: 10Anomie) [14:46:45] (03CR) 10Elukey: [C: 03+2] kerberos: allow kerberos-run-command to check any keytab [puppet] - 10https://gerrit.wikimedia.org/r/550843 (owner: 10Elukey) [14:47:15] (03Merged) 10jenkins-bot: Set MCR migration stage to NEW on group 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550844 (https://phabricator.wikimedia.org/T198312) (owner: 10Anomie) [14:47:40] bblack: o/ - ok to puppet-merge? [14:47:42] (03PS1) 10Alexandros Kosiaris: k8s: Add eventgate-logging-external stanzas [puppet] - 10https://gerrit.wikimedia.org/r/550845 (https://phabricator.wikimedia.org/T236386) [14:47:51] elukey: yes, sorry [14:48:16] np! Didn't want to merge varnish stuff without triple checking :) [14:48:18] merging [14:48:43] !log anomie@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Set MCR migration stage to NEW on group1 for T198312 (duration: 00m 53s) [14:48:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:49] T198312: Set the WMF cluster to use the new MCR-only schema - https://phabricator.wikimedia.org/T198312 [14:48:57] huh akosiaris what does that do? [14:49:01] (03PS2) 10Ema: Add dummy digicert-2019a keys [labs/private] - 10https://gerrit.wikimedia.org/r/550842 (https://phabricator.wikimedia.org/T237650) [14:49:15] (03CR) 10CDanis: [C: 03+1] rsyslog: setup temporary secure rsync for logs transfer [puppet] - 10https://gerrit.wikimedia.org/r/547245 (https://phabricator.wikimedia.org/T224564) (owner: 10Filippo Giunchedi) [14:49:51] ottomata: creates the kubeconfig file. Which has a secret token in it [14:50:06] aka, I 'll push a puppet private patch right now [14:50:47] PROBLEM - Check systemd state on cp3060 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:50:49] PROBLEM - check_trafficserver_log_fifo_tls_tls on cp3060 is CRITICAL: CRITICAL: /srv/trafficserver/tls/var/log/tls.pipe - fifo-log-demux not reading from pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:51:41] (03CR) 10BBlack: [C: 03+2] Add dummy digicert-2019a keys [labs/private] - 10https://gerrit.wikimedia.org/r/550842 (https://phabricator.wikimedia.org/T237650) (owner: 10Ema) [14:51:44] (03CR) 10BBlack: [V: 03+2 C: 03+2] Add dummy digicert-2019a keys [labs/private] - 10https://gerrit.wikimedia.org/r/550842 (https://phabricator.wikimedia.org/T237650) (owner: 10Ema) [14:51:45] any scap master around? [14:51:52] the cp3060 alerts should clear soon [14:51:53] RECOVERY - Check systemd state on cp3060 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:51:53] RECOVERY - check_trafficserver_log_fifo_tls_tls on cp3060 is OK: OK: TS_MAIN writing to and fifo-log-demux reading from /srv/trafficserver/tls/var/log/tls.pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:51:58] there you go! [14:52:11] esams done? [14:52:16] ahhh [14:52:17] k [14:52:35] seem so! \o/ [14:53:10] !log mholloway-shell@deploy1001 Synchronized php-1.35.0-wmf.5/extensions/MachineVision: Fix bug when when looking up entity for an unknown ID (duration: 00m 53s) [14:53:11] bblack: ats-be conversion you mean? 2 hosts to go [14:53:13] ottomata: btw end of next quarter we need to have gotten rid of jessie hosts, aka scb, aka we need eventstreams out of it and into kubernetes. Aka, I just gave you an ORK ;-) [14:53:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:05] ema: oh right, 62 and 64 I guess [14:54:19] 60 sounds like a nice round number that should be at the end! :) [14:54:23] :) [14:54:28] !log pool cp3060 with ATS backend T227432 [14:54:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:32] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [14:54:53] (03CR) 10Hashar: [C: 03+1] "I guess it is not going to hurt ;)" [puppet] - 10https://gerrit.wikimedia.org/r/550776 (owner: 10Paladox) [14:55:44] (03CR) 10RLazarus: [C: 03+2] Install httpbb on cluster-management hosts. [puppet] - 10https://gerrit.wikimedia.org/r/550750 (https://phabricator.wikimedia.org/T236699) (owner: 10RLazarus) [14:55:47] \o/ -o- /o\ \/o\/ ... [14:55:58] my cheering arms are now confused and waiting for the moment to re-raise :P [14:59:26] haha i have heard, thank you for reminder [14:59:42] i'll put that on my todo list so i remember...i might be able to find some time this q to start [14:59:53] jouncebot: next [14:59:54] In 1 hour(s) and 0 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191114T1600) [15:00:21] ok, puppet swat empty [15:03:29] ema, akosiaris: just went to puppet-merge and saw your changes as well, okay to proceed or should I wait? [15:05:04] rlazarus: +1 [15:05:49] (03PS3) 10Gilles: Document Apache gzip sidestepping [puppet] - 10https://gerrit.wikimedia.org/r/539842 (https://phabricator.wikimedia.org/T232615) [15:05:57] rlazarus: go on, I got distracted by a conf session, feel free to merge yours [15:06:06] thanks, merging [15:06:48] (03PS1) 10Ema: cache: reimage cp3062 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/550849 (https://phabricator.wikimedia.org/T227432) [15:06:50] (03PS1) 10Ema: cache: reimage cp3064 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/550850 (https://phabricator.wikimedia.org/T227432) [15:09:29] akosiaris: thcipriani: I'm deploying a directory with sync-file and it fails because of errors, looking at the error log, the only explanation is that for a five seconds all of the files are not deployed and only some got deployed (until the next files get sync'ed). Does this can happen with scap sync-file? Can I move forward with --force? [15:11:53] (03PS1) 10Gilles: New SSH key for new laptop [puppet] - 10https://gerrit.wikimedia.org/r/550852 [15:14:43] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [15:18:56] (03PS1) 10Vgutierrez: vcl: Use synthetic warning for 1% of TLSv1/TLSv1.1 pageviews [puppet] - 10https://gerrit.wikimedia.org/r/550856 (https://phabricator.wikimedia.org/T238038) [15:21:01] (03PS13) 10Herron: logstash: introduce logstash 7 and openjdk-11 support [puppet] - 10https://gerrit.wikimedia.org/r/548880 (https://phabricator.wikimedia.org/T217340) [15:25:30] (03CR) 10Herron: logstash: introduce logstash 7 and openjdk-11 support (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/548880 (https://phabricator.wikimedia.org/T217340) (owner: 10Herron) [15:36:33] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 1.138 ge 1 https://wikitech.wikimedia.org/wiki/Logstash https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [15:36:35] (03PS1) 10RLazarus: Code against python3-attr 16.3.0-1, since that's what we have in stretch. [software/httpbb] - 10https://gerrit.wikimedia.org/r/550864 (https://phabricator.wikimedia.org/T236699) [15:38:07] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/547245 (https://phabricator.wikimedia.org/T224564) (owner: 10Filippo Giunchedi) [15:39:03] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/547246 (https://phabricator.wikimedia.org/T224564) (owner: 10Filippo Giunchedi) [15:40:19] (03CR) 10CDanis: [C: 03+1] Code against python3-attr 16.3.0-1, since that's what we have in stretch. [software/httpbb] - 10https://gerrit.wikimedia.org/r/550864 (https://phabricator.wikimedia.org/T236699) (owner: 10RLazarus) [15:40:57] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)1 ge (W)0.2 ge 0.01667 https://wikitech.wikimedia.org/wiki/Logstash https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [15:41:07] I'm looking into the indexing errors btw [15:42:15] (03PS1) 10Ema: ATS: network settings for ats-be [puppet] - 10https://gerrit.wikimedia.org/r/550866 (https://phabricator.wikimedia.org/T227432) [15:43:29] (03CR) 10RLazarus: [C: 03+2] Code against python3-attr 16.3.0-1, since that's what we have in stretch. [software/httpbb] - 10https://gerrit.wikimedia.org/r/550864 (https://phabricator.wikimedia.org/T236699) (owner: 10RLazarus) [15:44:51] (03CR) 10RLazarus: [C: 03+1] "> It has not been merged since July and is a clean-up task. A week or more should not matter." [puppet] - 10https://gerrit.wikimedia.org/r/524088 (https://phabricator.wikimedia.org/T187716) (owner: 10MaxSem) [15:44:57] (03Merged) 10jenkins-bot: Code against python3-attr 16.3.0-1, since that's what we have in stretch. [software/httpbb] - 10https://gerrit.wikimedia.org/r/550864 (https://phabricator.wikimedia.org/T236699) (owner: 10RLazarus) [15:45:10] 10Operations, 10Wikimedia-Apache-configuration, 10serviceops, 10Patch-For-Review: Build a black-box httpd testing framework - https://phabricator.wikimedia.org/T236699 (10crusnov) We use `scap` to deploy Netbox, a possible use-case would be to run httpbb as the last step to verify that apache is configured... [15:47:55] 10Operations, 10Wikimedia-Apache-configuration, 10serviceops, 10Patch-For-Review: Build a black-box httpd testing framework - https://phabricator.wikimedia.org/T236699 (10crusnov) >>! In T236699#5664011, @crusnov wrote: > We use `scap` to deploy Netbox, a possible use-case would be to run httpbb as the las... [15:51:08] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jeh on cumin1001.eqiad.wmnet for hosts: ` ['cloudcephosd1002.wikimedia.org'] ` The log can be fo... [15:52:03] 10Operations, 10User-jbond: Wikimedia theme for SSO login page - https://phabricator.wikimedia.org/T233939 (10jbond) When we do this it may be worth considering overriding `/tmp/logs` in `WEB-INF/classes/log4j2.xml`. We do override this in `/etc/cas/config/log4j2.xml` howev... [15:52:12] Amir1: I have no idea ... [16:00:04] godog and _joe_: Time to snap out of that daydream and deploy Puppet SWAT(Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191114T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:00:52] (03PS2) 10Vgutierrez: vcl: Use synthetic warning for 1% of TLSv1/TLSv1.1 pageviews [puppet] - 10https://gerrit.wikimedia.org/r/550856 (https://phabricator.wikimedia.org/T238038) [16:00:54] (03PS1) 10Vgutierrez: vcl: Bump TLSv1/TLSv1.1 pageview replacement to 4% [puppet] - 10https://gerrit.wikimedia.org/r/550868 (https://phabricator.wikimedia.org/T238038) [16:00:56] (03PS1) 10Vgutierrez: vcl: Bump TLSv1/TLSv1.1 pageview replacement to 10% [puppet] - 10https://gerrit.wikimedia.org/r/550869 (https://phabricator.wikimedia.org/T238038) [16:00:58] (03PS1) 10Vgutierrez: vcl: Bump TLSv1/TLSv1.1 pageview replacement to 100% [puppet] - 10https://gerrit.wikimedia.org/r/550870 (https://phabricator.wikimedia.org/T238038) [16:04:20] !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.5/extensions/Wikibase: [[gerrit:550828|Put a layer of APC cache on top of reading wb_terms in SqlEntityInfoBuilder]] (T231011 T229407 T236681), Try II (duration: 00m 56s) [16:04:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:28] T231011: Mysterious, coordinated slowdowns every ~ 25 minutes on API servers - https://phabricator.wikimedia.org/T231011 [16:04:29] T236681: SqlEntityInfoBuilder reads and writes from the old term store regardless of the config - https://phabricator.wikimedia.org/T236681 [16:04:29] T229407: Spikes in DB traffic and rows/s reads when reading from new terms store - https://phabricator.wikimedia.org/T229407 [16:05:09] the indexing failures were due to T238344 [16:05:10] T238344: MediaWiki Math invalid JSON in logs on Restbase server error - https://phabricator.wikimedia.org/T238344 [16:06:01] !log scandium - upgrading PHP version to 7.2.24 (fyi, @subbu T228069) (T237239) [16:06:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:07] T228069: Deploy Parsoid-PHP with Mediawiki to scandium for RT and performance testing - https://phabricator.wikimedia.org/T228069 [16:06:08] T237239: Upgrade to PHP 7.2.24 - https://phabricator.wikimedia.org/T237239 [16:06:11] okay, this is live, this will change things, read on database, size of apc, app servers, etc. [16:06:11] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:06:39] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1002.wikimedia.org'] ` Of which those **FAILED**: ` ['cloudcephosd1002.wikimedia.org'] ` [16:07:18] ^ this is known [16:07:33] thanks Amir1, was about to ask [16:07:34] the page is fine, it's just sync going crazy [16:07:35] thanks Amir1 [16:07:39] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:07:47] if it continues, I revert [16:07:57] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jeh on cumin1001.eqiad.wmnet for hosts: ` ['cloudcephosd1002.wikimedia.org'] ` The log can be fo... [16:07:59] https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fops&from=now-5m&to=now [16:08:01] yup [16:08:15] sync being stupid [16:08:48] Amir1: totally unrelated but pages (i.e. SMS to phones) will show up with #page in the alert text. Just making sure we're on ... wait for it .. the same page [16:09:07] !log phab2001 - upgrading PHP version to 7.2.24 (T237239) [16:09:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:16] godog: well played [16:09:31] elukey: caruso_sunglasses.flv [16:09:38] ahahhaahhahaha [16:10:14] 😎 [16:12:35] (03CR) 10Filippo Giunchedi: "LGTM, see inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/548880 (https://phabricator.wikimedia.org/T217340) (owner: 10Herron) [16:12:50] I'm wondering how many people use # page as highlight word [16:13:05] and if everybody in SRE should, as a "soft page" [16:13:33] I do :-) [16:13:47] I do as well [16:13:58] hah, apologies for the misfire (I don't) [16:14:00] I've thought about limiting it to usages by icinga-wm but decided against because people have used it for manual-pages before [16:14:09] (which we should be able to do mcuh better, but) [16:14:29] I do [16:15:44] if there's consensus then yeah definitely we should add that to SRE onboarding [16:15:51] topic for next monday's meeting? [16:16:45] +1, sounds good [16:19:10] (03PS1) 10Jbond: apereo_cas: update systemd to run as a system user [puppet] - 10https://gerrit.wikimedia.org/r/550872 [16:21:22] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1002.wikimedia.org'] ` Of which those **FAILED**: ` ['cloudcephosd1002.wikimedia.org'] ` [16:22:03] (03CR) 10Ottomata: admin: add analytics-privatedata system user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/550814 (https://phabricator.wikimedia.org/T238306) (owner: 10Elukey) [16:22:25] (03CR) 10Dzahn: [C: 03+2] wmf_auto_reimage: Adjust message about waiting for puppet [puppet] - 10https://gerrit.wikimedia.org/r/522567 (owner: 10Dzahn) [16:22:32] (03PS12) 10Dzahn: wmf_auto_reimage: Adjust message about waiting for puppet [puppet] - 10https://gerrit.wikimedia.org/r/522567 [16:23:41] (03CR) 10Elukey: admin: add analytics-privatedata system user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/550814 (https://phabricator.wikimedia.org/T238306) (owner: 10Elukey) [16:27:50] (03PS1) 10Jbond: profile::homer: throwa warning if no homer::peers are returned [puppet] - 10https://gerrit.wikimedia.org/r/550874 [16:29:51] (03PS2) 10Elukey: admin: add analytics-privatedata system user [puppet] - 10https://gerrit.wikimedia.org/r/550814 (https://phabricator.wikimedia.org/T238306) [16:31:01] 10Operations, 10Product-Analytics, 10SRE-Access-Requests: Search Console access for he.wikisource.org - https://phabricator.wikimedia.org/T238090 (10RobH) I was a bit surprised when I got tagged into this task, and indeed, the history shows I documented it when on clinic duty at some point, but I do not real... [16:32:31] (03PS2) 10Jbond: profile::homer: throwa warning if no homer::peers are returned [puppet] - 10https://gerrit.wikimedia.org/r/550874 [16:33:23] (03PS4) 10Cwhite: puppetmaster,icinga: naggen2 cleanup and update to python3 [puppet] - 10https://gerrit.wikimedia.org/r/549222 [16:35:00] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster,icinga: naggen2 cleanup and update to python3 [puppet] - 10https://gerrit.wikimedia.org/r/549222 (owner: 10Cwhite) [16:36:04] (03CR) 10Jbond: [C: 03+2] profile::homer: throwa warning if no homer::peers are returned [puppet] - 10https://gerrit.wikimedia.org/r/550874 (owner: 10Jbond) [16:36:22] (03CR) 10Cwhite: puppetmaster,icinga: naggen2 cleanup and update to python3 (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/549222 (owner: 10Cwhite) [16:40:18] 10Operations, 10SRE-Access-Requests: Add new SSH key for production access - https://phabricator.wikimedia.org/T238347 (10Gilles) [16:41:05] (03PS2) 10Gilles: New SSH key for new laptop [puppet] - 10https://gerrit.wikimedia.org/r/550852 (https://phabricator.wikimedia.org/T238347) [16:41:43] 10Operations, 10SRE-Access-Requests: Add new SSH key for production access - https://phabricator.wikimedia.org/T238347 (10Gilles) Corresponding patch: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/550852/ [16:44:03] (03CR) 10jerkins-bot: [V: 04-1] New SSH key for new laptop [puppet] - 10https://gerrit.wikimedia.org/r/550852 (https://phabricator.wikimedia.org/T238347) (owner: 10Gilles) [16:45:13] RECOVERY - Check the Netbox report puppetdb for fail status. on netbox1001 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [16:45:20] yay :{) [16:45:54] (03PS1) 10RLazarus: httpbb: Avoid using flow mappings with URL keys. [puppet] - 10https://gerrit.wikimedia.org/r/550877 [16:47:26] (03CR) 10CDanis: [C: 03+1] httpbb: Avoid using flow mappings with URL keys. [puppet] - 10https://gerrit.wikimedia.org/r/550877 (owner: 10RLazarus) [16:47:42] 10Operations, 10Traffic: Renew and deploy GlobalSign unified cert (2019) - https://phabricator.wikimedia.org/T237650 (10RobH) [16:50:08] (03PS2) 10RLazarus: httpbb: Avoid using flow mappings with URL keys. [puppet] - 10https://gerrit.wikimedia.org/r/550877 (https://phabricator.wikimedia.org/T236699) [16:50:45] (03PS3) 10Gilles: New SSH key for new laptop [puppet] - 10https://gerrit.wikimedia.org/r/550852 (https://phabricator.wikimedia.org/T238347) [16:54:27] (03CR) 10RLazarus: [C: 03+2] httpbb: Avoid using flow mappings with URL keys. [puppet] - 10https://gerrit.wikimedia.org/r/550877 (https://phabricator.wikimedia.org/T236699) (owner: 10RLazarus) [16:54:47] gilles: if you can post the same pubkey to the phab task and use 'Sign with MFA' from the actions dropdown above the comment box, I'll +2 and merge [16:55:04] (also, yay more people keeping ssh creds on yubikeys) [16:58:19] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Add new SSH key for production access - https://phabricator.wikimedia.org/T238347 (10Gilles) New public key: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC6rWxdWJ+Yh8nbepD73tlikm1vdCGkPei2Sy2imlxG2G0DY/FaGreg5PyRqa56YK7KVs6cfoSQ30wO5/Gx2emZpKRz9mlWq4u37Z0/DC... [16:59:08] cdanis: done. I learned about that being possible from a comment of yours on IRC. Thanks to 2 order mistakes from OIT I have 2 spare yubikeys I will use to convert 2 more people from EngProd to that at our offsite next week :) [16:59:21] also, TIL about "signing" comments on phab [17:00:04] cscott, arlolra, subbu, halfak, and accraze: Your horoscope predicts another unfortunate Services – Graphoid / Parsoid / Citoid / ORES deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191114T1700). [17:00:44] \o/ [17:00:57] yeah, it's neat functionality, we should use it more for things like this [17:01:21] (03CR) 10CDanis: [C: 03+2] New SSH key for new laptop [puppet] - 10https://gerrit.wikimedia.org/r/550852 (https://phabricator.wikimedia.org/T238347) (owner: 10Gilles) [17:04:11] (03CR) 10Alexandros Kosiaris: [C: 03+2] Namespaces for eventgate-logging-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/550840 (https://phabricator.wikimedia.org/T236386) (owner: 10Alexandros Kosiaris) [17:04:24] (03Merged) 10jenkins-bot: Namespaces for eventgate-logging-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/550840 (https://phabricator.wikimedia.org/T236386) (owner: 10Alexandros Kosiaris) [17:04:31] gilles: merged and ran puppet on all bastions [17:05:09] (03PS1) 10Andrew Bogott: Depool labsdb1009 [puppet] - 10https://gerrit.wikimedia.org/r/550878 (https://phabricator.wikimedia.org/T237509) [17:05:11] (03PS1) 10Andrew Bogott: Revert "Depool labsdb1009" [puppet] - 10https://gerrit.wikimedia.org/r/550879 (https://phabricator.wikimedia.org/T237509) [17:05:13] (03PS1) 10Andrew Bogott: Depool labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/550880 (https://phabricator.wikimedia.org/T237509) [17:05:18] (03PS1) 10Andrew Bogott: Revert "Depool labsdb1010" [puppet] - 10https://gerrit.wikimedia.org/r/550881 (https://phabricator.wikimedia.org/T237509) [17:05:20] (03PS1) 10Andrew Bogott: Depool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/550882 (https://phabricator.wikimedia.org/T237509) [17:05:22] (03PS1) 10Andrew Bogott: Revert "Depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/550883 (https://phabricator.wikimedia.org/T237509) [17:07:43] (03CR) 10Jbond: "lgtm but a few more nits i missed on first pass" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/548880 (https://phabricator.wikimedia.org/T217340) (owner: 10Herron) [17:08:09] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Add new SSH key for production access - https://phabricator.wikimedia.org/T238347 (10CDanis) 05Open→03Resolved a:03CDanis [17:09:20] 10Operations, 10ops-codfw, 10decommission: Decommission db2048.codfw.wmnet - https://phabricator.wikimedia.org/T237913 (10Papaul) ` papaul@asw-a-codfw# show | compare [edit interfaces interface-range vlan-private1-a-codfw] - member ge-1/0/0; [edit interfaces interface-range disabled] member ge-6/0/1... [17:09:47] 10Operations, 10ops-codfw, 10decommission: Decommission db2048.codfw.wmnet - https://phabricator.wikimedia.org/T237913 (10Papaul) [17:11:50] (03CR) 10Alexandros Kosiaris: [C: 03+2] k8s: Add eventgate-logging-external stanzas [puppet] - 10https://gerrit.wikimedia.org/r/550845 (https://phabricator.wikimedia.org/T236386) (owner: 10Alexandros Kosiaris) [17:17:00] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [17:17:04] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'kube-system' for release 'coredns' . [17:17:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:07] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [17:17:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:14] 10Operations, 10DC-Ops, 10decommission, 10fundraising-tech-ops: decommission alnilam.frack.codfw.wmnet - https://phabricator.wikimedia.org/T238233 (10Papaul) ` papaul@fasw-c-codfw# show | compare [edit interfaces interface-range disabled] member "ge-[0-1]/0/8" { ... } + member "ge-[0-1]/0/9"; [edi... [17:21:48] (03CR) 10Jbond: "hi moritz i have tested this and everything works fine accept binding to port 443. I have added `CapabilityBoundingSet=CAP_NET_BIND_SERVI" [puppet] - 10https://gerrit.wikimedia.org/r/550872 (owner: 10Jbond) [17:22:42] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [17:22:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:46] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'coredns' . [17:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:49] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [17:22:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:22] (03PS2) 10Jbond: apereo_cas: update systemd to run as a system user [puppet] - 10https://gerrit.wikimedia.org/r/550872 (https://phabricator.wikimedia.org/T233951) [17:26:11] (03PS3) 10Jbond: apereo_cas: update systemd to run as a system user [puppet] - 10https://gerrit.wikimedia.org/r/550872 [17:27:11] (03PS4) 10Jbond: apereo_cas: update systemd to run as a system user [puppet] - 10https://gerrit.wikimedia.org/r/550872 (https://phabricator.wikimedia.org/T233951) [17:33:20] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 1.25 ge 1 https://wikitech.wikimedia.org/wiki/Logstash https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [17:33:34] (03CR) 10Muehlenhoff: apereo_cas: update systemd to run as a system user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/550872 (https://phabricator.wikimedia.org/T233951) (owner: 10Jbond) [17:33:52] (03PS2) 10Andrew Bogott: Depool labsdb1009 [puppet] - 10https://gerrit.wikimedia.org/r/550878 (https://phabricator.wikimedia.org/T237509) [17:33:54] (03PS2) 10Andrew Bogott: Revert "Depool labsdb1009" [puppet] - 10https://gerrit.wikimedia.org/r/550879 (https://phabricator.wikimedia.org/T237509) [17:34:01] (03PS2) 10Andrew Bogott: Depool labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/550880 (https://phabricator.wikimedia.org/T237509) [17:34:01] (03PS2) 10Andrew Bogott: Revert "Depool labsdb1010" [puppet] - 10https://gerrit.wikimedia.org/r/550881 (https://phabricator.wikimedia.org/T237509) [17:34:01] (03PS2) 10Andrew Bogott: Depool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/550882 (https://phabricator.wikimedia.org/T237509) [17:34:05] (03PS2) 10Andrew Bogott: Revert "Depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/550883 (https://phabricator.wikimedia.org/T237509) [17:34:11] (03PS1) 10Andrew Bogott: Remove 'globalblocks' table from maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/550888 (https://phabricator.wikimedia.org/T237509) [17:35:35] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [17:35:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:40] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'kube-system' for release 'coredns' . [17:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:44] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [17:35:47] (03CR) 10Cwhite: "this looks like a good move forward. see inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/548880 (https://phabricator.wikimedia.org/T217340) (owner: 10Herron) [17:35:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:45] mutante, thanks. [17:38:26] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)1 ge (W)0.2 ge 0.03333 https://wikitech.wikimedia.org/wiki/Logstash https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [17:39:39] subbu: welcome :) [17:41:09] ottomata: I think you are good to go on deploying eventgate-logging-external [17:42:00] (03CR) 10Cwhite: [C: 03+1] "splitting indexes on schema or volume is better than splitting on application. I'm ok with this as a temporary workaround though." [puppet] - 10https://gerrit.wikimedia.org/r/550806 (https://phabricator.wikimedia.org/T238196) (owner: 10Filippo Giunchedi) [17:43:20] (03CR) 10Jhedden: [C: 03+1] "LGTM, based on the information in the ticket that table no longer exists" [puppet] - 10https://gerrit.wikimedia.org/r/550888 (https://phabricator.wikimedia.org/T237509) (owner: 10Andrew Bogott) [17:43:45] (03PS5) 10Jbond: apereo_cas: update systemd to run as a system user [puppet] - 10https://gerrit.wikimedia.org/r/550872 (https://phabricator.wikimedia.org/T233951) [17:44:21] (03PS5) 10CDanis: Split out DB-related concerns for real and test wikitechs into s10/s11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547596 (https://phabricator.wikimedia.org/T233236) (owner: 10Jforrester) [17:44:42] (03PS1) 10CDanis: dbctl schemata changes for labswiki migration [puppet] - 10https://gerrit.wikimedia.org/r/550889 (https://phabricator.wikimedia.org/T233236) [17:45:14] (03CR) 10Cwhite: [C: 03+1] "LGTM -- looking forward to the next patchset that uses tox with py3" [puppet] - 10https://gerrit.wikimedia.org/r/510613 (owner: 10Jbond) [17:45:27] (03CR) 10Jbond: apereo_cas: update systemd to run as a system user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/550872 (https://phabricator.wikimedia.org/T233951) (owner: 10Jbond) [17:46:20] !log running dell epsa tool on cp3056 per T236497 [17:46:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:25] T236497: cp3056 hardware issue - https://phabricator.wikimedia.org/T236497 [17:50:04] (03PS6) 10Andrew Bogott: Split out DB-related concerns for real and test wikitechs into s10/s11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547596 (https://phabricator.wikimedia.org/T233236) (owner: 10Jforrester) [17:50:06] (03PS3) 10Andrew Bogott: Follow-up 0f90f506: Leave labtestwiki in the wikitech dblist for config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547597 (https://phabricator.wikimedia.org/T233236) (owner: 10Jforrester) [17:52:40] 10Operations, 10ops-esams, 10DC-Ops, 10Traffic: cp3056 hardware issue - https://phabricator.wikimedia.org/T236497 (10RobH) Please note part of the ePSA tool is checking the SEL. So the SEL has to be cleared before running the test. @bblack let me know this server had issues with the storage ssd needing r... [17:53:29] (03PS3) 10Dzahn: Gerrit: Increase defaultThreadPoolSize to 2 [puppet] - 10https://gerrit.wikimedia.org/r/550776 (owner: 10Paladox) [18:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: That opportune time is upon us again. Time for a Morning SWAT(Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191114T1800). [18:00:04] andrewbogott: A patch you scheduled for Morning SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:01:33] * andrewbogott is here but cdanis is helping with this and not ready for a few minutes yet [18:02:57] going to think out loud about how this has to be done [18:03:28] first dbctl has to recognize the new section names as being allowed, that should be handled by https://gerrit.wikimedia.org/r/c/operations/puppet/+/550889 [18:03:59] MaxSem, RoanKattouw, Niharika, Urbanecm, is one of you here for swat? [18:04:14] which will need to be merged and applied on the cumin hosts, then we can do a `dbctl config diff` to make sure it doesn't have any unintended effects [18:04:40] I can be here in about 15 minutes [18:04:59] Normally I'm available but the SWAT time moved [18:05:12] (or rather, our clocks moved and the SWAT time didn't) [18:05:17] then I'll create the new sections s10 and s11 in dbctl, and modify their current instances to be pooled in both places [18:05:28] RoanKattouw: I was wondering about that :) [18:05:30] and then I think we can merge and scap https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/547596 [18:07:24] (03CR) 10Andrew Bogott: [C: 03+1] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/550889 (https://phabricator.wikimedia.org/T233236) (owner: 10CDanis) [18:07:25] I am going to begin with the prep work I just outlined [18:07:36] ok [18:07:56] (03PS2) 10CDanis: dbctl schemata changes for labswiki migration [puppet] - 10https://gerrit.wikimedia.org/r/550889 (https://phabricator.wikimedia.org/T233236) [18:08:28] (03CR) 10CDanis: [C: 03+2] dbctl schemata changes for labswiki migration [puppet] - 10https://gerrit.wikimedia.org/r/550889 (https://phabricator.wikimedia.org/T233236) (owner: 10CDanis) [18:09:22] 10Operations, 10ops-esams, 10DC-Ops, 10Traffic: cp3056 hardware issue - https://phabricator.wikimedia.org/T236497 (10RobH) No errors in quick test, full testing is in progress. [18:09:43] heh, waiting on CI because I edited the commit message 🙃 [18:12:06] (03PS4) 10Dzahn: Gerrit: Increase defaultThreadPoolSize to 2 [puppet] - 10https://gerrit.wikimedia.org/r/550776 (owner: 10Paladox) [18:15:40] (03CR) 10Dzahn: [C: 03+2] Gerrit: Increase defaultThreadPoolSize to 2 [puppet] - 10https://gerrit.wikimedia.org/r/550776 (owner: 10Paladox) [18:16:32] (03PS10) 10Jcrespo: bacula: Setup separate pool and defaults for database backups on backup1001 [puppet] - 10https://gerrit.wikimedia.org/r/550671 (https://phabricator.wikimedia.org/T238048) [18:16:32] (03PS1) 10Jcrespo: bacula: Fix calculation of success rate function on bacula check [puppet] - 10https://gerrit.wikimedia.org/r/550898 [18:17:33] !log cdanis@cumin2001 dbctl commit (dc=all): 'alias wikitech section to new s10 section T233236', diff saved to https://phabricator.wikimedia.org/P9638 and previous config saved to /var/cache/conftool/dbconfig/20191114-181732-cdanis.json [18:17:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:40] T233236: Move labtestwikitech database to clouddb2001-dev - https://phabricator.wikimedia.org/T233236 [18:19:10] (03PS11) 10Jcrespo: bacula: Setup separate pool and defaults for database backups on backup1001 [puppet] - 10https://gerrit.wikimedia.org/r/550671 (https://phabricator.wikimedia.org/T238048) [18:19:19] okay, the dbctl prerequisites are done -- 's10' now exists as an alias of the other. 's11' does not exist in dbctl, but I'm not convinced that it should? [18:20:19] I think it shouldn't since we're keeping the hack as far as I know [18:20:37] cdanis: ok, let's wait for RoanKattouw to arrive and then we can make the config changes [18:22:08] (03CR) 10jerkins-bot: [V: 04-1] bacula: Setup separate pool and defaults for database backups on backup1001 [puppet] - 10https://gerrit.wikimedia.org/r/550671 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo) [18:23:15] OK I'm here [18:23:30] jouncebot: now [18:23:30] For the next 0 hour(s) and 36 minute(s): Morning SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191114T1800) [18:23:48] So https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/547596 first? [18:24:00] (03CR) 10CDanis: [C: 03+1] "The etcd/dbctl prerequisite work is completed: https://phabricator.wikimedia.org/P9638" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547596 (https://phabricator.wikimedia.org/T233236) (owner: 10Jforrester) [18:24:13] does NOT restart gerrit because you are in SWAT [18:24:22] Thanks mutante [18:24:34] RoanKattouw: yes, I think cdanis is ready [18:24:37] +1 [18:24:52] OK and it's 547596 first, and 547597 later? [18:25:03] RoanKattouw: They can go together... [18:25:10] labtestwiki is low stakes [18:25:33] (It has ~one user, me) [18:26:56] (03CR) 10Catrope: [C: 03+2] Split out DB-related concerns for real and test wikitechs into s10/s11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547596 (https://phabricator.wikimedia.org/T233236) (owner: 10Jforrester) [18:27:52] (03CR) 10Catrope: [C: 03+2] Follow-up 0f90f506: Leave labtestwiki in the wikitech dblist for config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547597 (https://phabricator.wikimedia.org/T233236) (owner: 10Jforrester) [18:28:51] (03Merged) 10jenkins-bot: Split out DB-related concerns for real and test wikitechs into s10/s11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547596 (https://phabricator.wikimedia.org/T233236) (owner: 10Jforrester) [18:28:53] (03Merged) 10jenkins-bot: Follow-up 0f90f506: Leave labtestwiki in the wikitech dblist for config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547597 (https://phabricator.wikimedia.org/T233236) (owner: 10Jforrester) [18:29:46] (03CR) 10Ottomata: [C: 03+1] admin: add analytics-privatedata system user [puppet] - 10https://gerrit.wikimedia.org/r/550814 (https://phabricator.wikimedia.org/T238306) (owner: 10Elukey) [18:29:50] (03PS12) 10Jcrespo: bacula: Setup separate pool and defaults for database backups on backup1001 [puppet] - 10https://gerrit.wikimedia.org/r/550671 (https://phabricator.wikimedia.org/T238048) [18:30:28] (03PS1) 10Dzahn: add spare::system role to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/550902 (https://phabricator.wikimedia.org/T190568) [18:31:08] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 1.238 ge 1 https://wikitech.wikimedia.org/wiki/Logstash https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [18:31:13] andrewbogott: Since this is all wikitech-related I don't suppose testing on mwdebug1002 is valuable? [18:31:43] RoanKattouw: It's worth a spot check to make sure that it it's only wikitech-related :) [18:31:43] !log phabricator (phab1003, prod server) - upgrade PHP version to 7.2.24 (T237239) [18:31:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:48] T237239: Upgrade to PHP 7.2.24 - https://phabricator.wikimedia.org/T237239 [18:31:50] But that's certainly the intent [18:32:11] (03CR) 10jerkins-bot: [V: 04-1] add spare::system role to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/550902 (https://phabricator.wikimedia.org/T190568) (owner: 10Dzahn) [18:32:26] Fair enough. On mwdebug1001 (not 1 not 2) for testing now [18:32:31] (03CR) 10Jcrespo: [C: 04-1] "Refactor everything properly first, and split into 2 patches (this got more complex than I initially thought it was going to be): refactor" [puppet] - 10https://gerrit.wikimedia.org/r/550671 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo) [18:32:37] Let me know when you're ready for me to deploy [18:33:21] pages load and log out/in works... [18:33:26] that's about all I know to test [18:33:32] so, ready to deploy [18:33:56] +1 [18:34:07] !log scandium - restart php7.2-fpm [18:34:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:32] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)1 ge (W)0.2 ge 0.0125 https://wikitech.wikimedia.org/wiki/Logstash https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [18:35:17] OK, doing the dblists directory first because that seems to be the safest order [18:35:54] !log catrope@deploy1001 Synchronized dblists/: Add s10/s11 dblists for wikitechs (T233236) (duration: 00m 52s) [18:35:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:59] T233236: Move labtestwikitech database to clouddb2001-dev - https://phabricator.wikimedia.org/T233236 [18:36:12] Then there's a circular dependency between CommonSettings and db-*.php, but that only affects labtestwiki so I'll just sync all of wmf-config and if it causes hiccups for labstestwiki then oh well [18:36:43] sounds good to me Roan [18:36:45] wfm [18:37:42] !log catrope@deploy1001 Synchronized dblists/: Use s10/s11 dblists for wikitechs (T233236) (duration: 00m 51s) [18:37:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:28] Alright, all done [18:38:59] cdanis: wikitech seems to still load :) [18:39:06] 10Operations, 10ops-esams, 10DC-Ops, 10Traffic: cp3056 hardware issue - https://phabricator.wikimedia.org/T236497 (10RobH) a:05RobH→03BBlack ` All tests passed. Validation Code : 84413 ` So all testing has passed. I've gone ahead and powered down the host. Not sure on next steps, will need to sync... [18:39:30] indeed [18:39:39] https://wikitech.wikimedia.org/w/api.php?action=query&meta=siteinfo&formatversion=2&siprop=dbrepllag&sishowalldb=true output as expected [18:39:53] is that all the parts? [18:39:59] 10Operations, 10serviceops: Upgrade to PHP 7.2.24 - https://phabricator.wikimedia.org/T237239 (10Dzahn) 05Open→03Resolved @jijiki Done! @MoritzMuehlenhoff I see no more yellow or red in debmonitor for PHP 7.2 packages, only for PHP 7.0 packages. [18:40:12] there's some cleanup work in dbctl, just to remove the now-defunct 'wikitech' section [18:40:19] ok [18:40:31] pretty sure that scap doesn't push out to labtestwikitech so I'm going to re-sync that now [18:42:27] (03PS2) 10Dzahn: add spare::system role to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/550902 (https://phabricator.wikimedia.org/T190568) [18:42:42] cool, if that all works we can call the mwconfig parts done [18:43:53] cdanis: at least one bug that I had with labtestwikitech before is now resolved. [18:43:55] and pages load [18:44:02] cool [18:44:03] So I'm happy [18:44:15] Thank you cdanis, RoanKattouw, James_F [18:44:41] I would have expected the dblist change to cause some change in the output at https://noc.wikimedia.org/db.php [18:45:14] Cluster s10 should show 'labswiki' in its list of databases, and Cluster wikitech should not [18:45:47] I don't know how/when that page is refreshed [18:47:34] wait, RoanKattouw -- stashbot says you synchronized dblists/ twice? and not wmf-config? [18:47:39] or am I reading it wrong [18:47:46] Uhh what [18:47:58] Ugh you're right [18:48:02] Doing wmf-config now [18:49:02] !log catrope@deploy1001 Synchronized wmf-config/: Use s10/s11 dblists for wikitechs (for real this time) (T233236) (duration: 00m 52s) [18:49:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:06] T233236: Move labtestwikitech database to clouddb2001-dev - https://phabricator.wikimedia.org/T233236 [18:49:26] db.php looks correct now! [18:49:41] and wikitech still works [18:50:05] btw andrewbogott that page comes from docroot/noc/ in mediawiki-config [18:50:31] 'k [18:50:35] my tests are still working [18:50:45] great, now I think we're actually done, then 🙃 [18:51:36] thanks again [18:59:20] (03PS14) 10Herron: logstash: introduce logstash 7 and openjdk-11 support [puppet] - 10https://gerrit.wikimedia.org/r/548880 (https://phabricator.wikimedia.org/T217340) [19:02:20] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T236331 (10Cmjohnson) Looks like this was approved this time around, @Jclark-ctr please keep an eye out for the disk in receiving [19:11:26] (03CR) 10Herron: logstash: introduce logstash 7 and openjdk-11 support (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/548880 (https://phabricator.wikimedia.org/T217340) (owner: 10Herron) [19:17:53] (03CR) 10Herron: [C: 03+1] "LGTM! straightforward way to address the errors without blocking on schema overhaul. I'd be pleasantly surprised to see this indeed used" [puppet] - 10https://gerrit.wikimedia.org/r/550806 (https://phabricator.wikimedia.org/T238196) (owner: 10Filippo Giunchedi) [19:18:18] akosiaris: ok thank you, will do some deploys now then! [19:20:54] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'logging-external' . [19:20:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:22] (03CR) 10Bstorm: Depool labsdb1011 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/550882 (https://phabricator.wikimedia.org/T237509) (owner: 10Andrew Bogott) [19:30:52] lol @jouncebot [19:31:33] !log otto@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'logging-external' . [19:31:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:38] (03PS1) 10Gilles: Revoke old SSH keys [puppet] - 10https://gerrit.wikimedia.org/r/550913 (https://phabricator.wikimedia.org/T238347) [19:33:26] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'logging-external' . [19:33:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:11] (03CR) 10Dzahn: [C: 03+2] Revoke old SSH keys [puppet] - 10https://gerrit.wikimedia.org/r/550913 (https://phabricator.wikimedia.org/T238347) (owner: 10Gilles) [19:36:59] RECOVERY - WDQS high update lag on wdqs1004 is OK: (C)4.32e+04 ge (W)2.16e+04 ge 2.031e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [19:38:43] (03PS3) 10Dzahn: add spare::system role to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/550902 (https://phabricator.wikimedia.org/T190568) [19:40:04] (03Abandoned) 10Dzahn: add spare::system role to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/550902 (https://phabricator.wikimedia.org/T190568) (owner: 10Dzahn) [19:46:29] (03PS1) 10Ottomata: Add LVS entries for eventgate-logging-external [dns] - 10https://gerrit.wikimedia.org/r/550914 (https://phabricator.wikimedia.org/T236386) [19:46:31] (03PS1) 10Ottomata: Add discovery entries for eventgate-logging-external [dns] - 10https://gerrit.wikimedia.org/r/550915 (https://phabricator.wikimedia.org/T236386) [19:47:23] (03PS1) 10CRusnov: coherence: Check device names for correct formatting [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/550917 (https://phabricator.wikimedia.org/T237469) [19:48:14] (03CR) 10jerkins-bot: [V: 04-1] coherence: Check device names for correct formatting [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/550917 (https://phabricator.wikimedia.org/T237469) (owner: 10CRusnov) [19:48:18] (03PS2) 10CRusnov: coherence: Alert on ACTIVE devices with names future- or spare. [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/550051 (https://phabricator.wikimedia.org/T237464) [19:51:44] (03PS2) 10CRusnov: coherence: Check device names for correct formatting [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/550917 (https://phabricator.wikimedia.org/T237469) [19:56:24] 10Operations, 10Core Platform Team, 10Traffic, 10Wikimedia-General-or-Unknown, 10User-DannyS712: Pages whose title ends with semicolon (;) are intermittently inaccessible - https://phabricator.wikimedia.org/T238285 (10matmarex) I ran `curl -I "https://ban.wikipedia.org/wiki/Mal:;"` in a loop for a while... [19:56:58] is there anything special about mw1273.eqiad.wmnet? https://phabricator.wikimedia.org/T238285#5664996 [19:57:15] !log gehel@cumin1001 START - Cookbook sre.wdqs.data-transfer [19:57:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:04] (03PS1) 10Ottomata: Add LVS for eventgate-logging-external [puppet] - 10https://gerrit.wikimedia.org/r/550922 (https://phabricator.wikimedia.org/T236386) [19:59:38] (03PS1) 10Ottomata: Add discovery for eventgate-logging-external [puppet] - 10https://gerrit.wikimedia.org/r/550923 (https://phabricator.wikimedia.org/T236386) [20:04:25] !log reloading data on wdqs1004 from wdqs1007 to catch up on lag faster - T238229 [20:04:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:29] T238229: WDQS is having high update lag for the last week - https://phabricator.wikimedia.org/T238229 [20:05:27] 10Operations, 10Core Platform Team, 10Traffic, 10Wikimedia-General-or-Unknown, 10User-DannyS712: Pages whose title ends with semicolon (;) are intermittently inaccessible - https://phabricator.wikimedia.org/T238285 (10Ladsgroup) I think this has to do something from differences between ATS and varnish no... [20:06:50] !log cdanis@cumin2001 dbctl commit (dc=all): 'remove now-defunct wikitech section T233236', diff saved to https://phabricator.wikimedia.org/P9639 and previous config saved to /var/cache/conftool/dbconfig/20191114-200649-cdanis.json [20:06:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:55] T233236: Move labtestwikitech database to clouddb2001-dev - https://phabricator.wikimedia.org/T233236 [20:27:27] 10Operations, 10Dumps-Generation: Migrate dumpsdata hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224563 (10ArielGlenn) Rync of both xmldatadumps/public and otherdumps from dumpsdata1002 to dumpsdata1003 is caught up as of earlier this evening. I'll be running these throughout the day tomorrow,... [20:34:49] 10Operations, 10Core Platform Team, 10Traffic, 10Wikimedia-General-or-Unknown, 10User-DannyS712: Pages whose title ends with semicolon (;) are intermittently inaccessible - https://phabricator.wikimedia.org/T238285 (10BBlack) >>! In T238285#5665013, @Ladsgroup wrote: > I think this has to do something wi... [20:36:52] (03PS1) 10RLazarus: httpbb: Correct baseurls.yaml for in-prod behavior. [puppet] - 10https://gerrit.wikimedia.org/r/550927 (https://phabricator.wikimedia.org/T236699) [20:45:40] (03Abandoned) 10RLazarus: [Testing only, don't merge] Install httpbb on appservers. [puppet] - 10https://gerrit.wikimedia.org/r/550752 (owner: 10RLazarus) [20:46:01] (03CR) 10CDanis: [C: 03+1] httpbb: Correct baseurls.yaml for in-prod behavior. [puppet] - 10https://gerrit.wikimedia.org/r/550927 (https://phabricator.wikimedia.org/T236699) (owner: 10RLazarus) [20:46:15] (03CR) 10Ottomata: eventgate-analytics - stream config for new sparql-query streams (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/548764 (https://phabricator.wikimedia.org/T101013) (owner: 10Ottomata) [20:46:39] (03CR) 10Ottomata: "Lemme know when you are ready and we can deploy this" [deployment-charts] - 10https://gerrit.wikimedia.org/r/548764 (https://phabricator.wikimedia.org/T101013) (owner: 10Ottomata) [20:46:59] (03CR) 10CDanis: [C: 03+1] "LGTM with a question to think about, not answer immediately: do we need a mechanism, or at least a convention, to handle divergences like " [puppet] - 10https://gerrit.wikimedia.org/r/550927 (https://phabricator.wikimedia.org/T236699) (owner: 10RLazarus) [21:08:40] (03PS1) 10Jhedden: install_server: switch cloudcephosd to flat partition layout [puppet] - 10https://gerrit.wikimedia.org/r/550930 (https://phabricator.wikimedia.org/T228102) [21:11:11] (03CR) 10Jhedden: [C: 03+2] install_server: switch cloudcephosd to flat partition layout [puppet] - 10https://gerrit.wikimedia.org/r/550930 (https://phabricator.wikimedia.org/T228102) (owner: 10Jhedden) [21:14:29] !log gehel@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [21:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:35] (03PS2) 10RLazarus: httpbb: Correct baseurls.yaml for in-prod behavior. [puppet] - 10https://gerrit.wikimedia.org/r/550927 (https://phabricator.wikimedia.org/T236699) [21:14:43] (03CR) 10RLazarus: [C: 03+2] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/550927 (https://phabricator.wikimedia.org/T236699) (owner: 10RLazarus) [21:14:57] PROBLEM - WDQS high update lag on wdqs1007 is CRITICAL: 4479 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [21:15:05] jouncebot: next [21:15:05] In 1 hour(s) and 44 minute(s): Evening SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191114T2300) [21:15:09] i am going to do a parsoid deploy now. should i wait o anything else? [21:15:25] (03CR) 10Ottomata: "Ah ha! I am just now learning about transfer.py!!!! This is so useful thank you so much!!!" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/500043 (https://phabricator.wikimedia.org/T219631) (owner: 10Jcrespo) [21:16:37] jouncebot: now [21:16:37] No deployments scheduled for the next 1 hour(s) and 43 minute(s) [21:16:45] ACKNOWLEDGEMENT - WDQS high update lag on wdqs1007 is CRITICAL: 4581 ge 3600 Gehel data transfer just completed https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [21:16:49] subbu I don't think anyone is doing anything [21:17:07] ok. ty. [21:19:11] wikibugs: are you ok? [21:19:26] i still want to restart gerrit but i'll find a better time [21:19:47] I just uploaded 12 patches at once, it may have killed it [21:19:48] (03PS2) 10BBlack: VCL: Move analytics hooks above beacon synth point [puppet] - 10https://gerrit.wikimedia.org/r/550826 [21:19:50] (03PS1) 10BBlack: VCL: Remove host regex from TLS redirect and STS [puppet] - 10https://gerrit.wikimedia.org/r/550931 (https://phabricator.wikimedia.org/T133548) [21:19:52] oh wait there it goes [21:19:53] lol [21:20:04] it has a killswitch [21:20:16] i was already on the page with the restart docs [21:20:17] probably it won't report them all [21:20:26] (03PS1) 10BBlack: Add protocol to TLS analytics fields [puppet] - 10https://gerrit.wikimedia.org/r/550932 (https://phabricator.wikimedia.org/T233661) [21:20:35] mutante: unplug then plug again? Not that hard ;) [21:20:39] (03PS1) 10BBlack: Add session to TLS analytics fields [puppet] - 10https://gerrit.wikimedia.org/r/550933 (https://phabricator.wikimedia.org/T233661) [21:20:41] (03PS1) 10BBlack: varnishxcps decom: undefine the mtail program [puppet] - 10https://gerrit.wikimedia.org/r/550934 [21:20:43] (03PS1) 10BBlack: varnishxcps decom: remove xcps ref [puppet] - 10https://gerrit.wikimedia.org/r/550935 [21:20:45] (03PS1) 10BBlack: varnishxcps decom: remove manifest [puppet] - 10https://gerrit.wikimedia.org/r/550936 [21:20:47] (03PS1) 10BBlack: varnishxcps decom: remove mtail prog/tests [puppet] - 10https://gerrit.wikimedia.org/r/550937 [21:20:49] (03PS1) 10BBlack: varnishxcps decom: remove global prom rules [puppet] - 10https://gerrit.wikimedia.org/r/550938 [21:20:57] (03PS1) 10BBlack: varnishxcps decom: remove mtail log outputs from VCL [puppet] - 10https://gerrit.wikimedia.org/r/550939 [21:21:01] (03PS1) 10BBlack: TLS analytics: simplify variable scheme [puppet] - 10https://gerrit.wikimedia.org/r/550940 [21:21:03] (03PS1) 10BBlack: TLS analytics: simplify logic for the present [puppet] - 10https://gerrit.wikimedia.org/r/550941 [21:21:05] (03CR) 10RLazarus: [C: 03+2] httpbb: Correct baseurls.yaml for in-prod behavior. [puppet] - 10https://gerrit.wikimedia.org/r/550927 (https://phabricator.wikimedia.org/T236699) (owner: 10RLazarus) [21:21:13] hauskater: "ssh to toolforge, load the right ssh key, lookup the bastion name, become the right user..." and then what you said :p [21:21:36] ssh -a username@login.tools.wmflabs.org [21:21:44] no wait.. "stop all the jobs" "poll qstat" [21:21:53] then start_jobs again [21:22:13] I use qstat and qdel on each job I want to stop [21:22:13] "use qdel " after you saw the IDs [21:22:31] hauskater: there you go.. not really "systemctl restart" [21:22:50] but, wait, gerrit restart on Toolforge... ? [21:23:04] it's gerrit1001.eqiad.wmnet iirc [21:23:13] RECOVERY - rpki grafana alert on icinga1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [21:23:55] hauskater: i was talking about wikibugs [21:24:05] !log ssastry@deploy1001 Started deploy [parsoid/deploy@150f9af]: Updating Parsoid to 74203415 [21:24:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:48] mutante: you said gerrit ;) [21:25:03] "i still want to restart gerrit but i'll find a better time " [21:25:40] hauskater: that too.. but also a response to "wikibugs are you ok" [21:25:46] unrelated things [21:25:51] I can restart wikibugs if needed but it looks it's working fine [21:26:30] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Anycast recdns - https://phabricator.wikimedia.org/T186550 (10Jgreen) [21:26:50] hauskater: thanks, yes, it working [21:26:52] 10Operations, 10SRE-tools, 10Traffic, 10Goal, and 3 others: Automate generation of Management DNS records from Netbox - https://phabricator.wikimedia.org/T233183 (10BBlack) Seems sane! The only thing I'm a little iffy about iis from the "SHA1 written to etcd" onwards. I'm not sure it's a bad approach, bu... [21:27:24] So next week there are no Train but will we have Deployments or those are also halted? [21:27:40] SWAT, etc. I mean [21:30:28] 10Operations, 10SRE-tools, 10Traffic, 10Goal, and 3 others: Automate generation of Management DNS records from Netbox - https://phabricator.wikimedia.org/T233183 (10crusnov) >>! In T233183#5665281, @BBlack wrote: > Seems sane! The only thing I'm a little iffy about iis from the "SHA1 written to etcd" onwa... [21:32:26] !log ssastry@deploy1001 Finished deploy [parsoid/deploy@150f9af]: Updating Parsoid to 74203415 (duration: 08m 21s) [21:32:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:33] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [21:36:23] ACKNOWLEDGEMENT - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] daniel_zahn not actionable https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [21:39:04] (03CR) 10BBlack: [C: 04-1] vcl: Use synthetic warning for 1% of TLSv1/TLSv1.1 pageviews (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/550856 (https://phabricator.wikimedia.org/T238038) (owner: 10Vgutierrez) [21:39:46] (03CR) 10BBlack: [C: 03+1] vcl: Bump TLSv1/TLSv1.1 pageview replacement to 4% [puppet] - 10https://gerrit.wikimedia.org/r/550868 (https://phabricator.wikimedia.org/T238038) (owner: 10Vgutierrez) [21:39:51] (03CR) 10BBlack: [C: 03+1] vcl: Bump TLSv1/TLSv1.1 pageview replacement to 10% [puppet] - 10https://gerrit.wikimedia.org/r/550869 (https://phabricator.wikimedia.org/T238038) (owner: 10Vgutierrez) [21:39:56] (03CR) 10BBlack: [C: 03+1] vcl: Bump TLSv1/TLSv1.1 pageview replacement to 100% [puppet] - 10https://gerrit.wikimedia.org/r/550870 (https://phabricator.wikimedia.org/T238038) (owner: 10Vgutierrez) [21:43:08] 10Operations, 10ops-esams, 10DC-Ops, 10Traffic: cp3056 hardware issue - https://phabricator.wikimedia.org/T236497 (10BBlack) a:05BBlack→03RobH I don't think there's anything else we can do either. We can't keep it alive booted into an OS for very long before we get a Linux kernel crash in the network... [21:43:12] (03PS1) 10Alexandros Kosiaris: zuul: Remove zuul Gearman queue alert [puppet] - 10https://gerrit.wikimedia.org/r/550943 [21:44:06] (03PS1) 10CDanis: dbctl: remove now-obsolete 'wikitech' section [puppet] - 10https://gerrit.wikimedia.org/r/550946 (https://phabricator.wikimedia.org/T233236) [21:47:49] (03CR) 10CDanis: [C: 03+2] dbctl: remove now-obsolete 'wikitech' section [puppet] - 10https://gerrit.wikimedia.org/r/550946 (https://phabricator.wikimedia.org/T233236) (owner: 10CDanis) [21:57:51] (03CR) 10Cwhite: [C: 04-1] "getting there! please see inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/548880 (https://phabricator.wikimedia.org/T217340) (owner: 10Herron) [21:58:02] 10Operations, 10serviceops, 10PHP 7.2 support: Mysterious, coordinated slowdowns every ~ 25 minutes on API servers - https://phabricator.wikimedia.org/T231011 (10Joe) @Theklan might be interested in watching this task too. [21:58:14] (03PS2) 10BBlack: TLS analytics: simplify variable scheme [puppet] - 10https://gerrit.wikimedia.org/r/550940 [21:58:16] (03PS2) 10BBlack: TLS analytics: simplify logic for the present [puppet] - 10https://gerrit.wikimedia.org/r/550941 [22:01:20] (03CR) 10Dzahn: [C: 03+1] "+1, not actionable" [puppet] - 10https://gerrit.wikimedia.org/r/550943 (owner: 10Alexandros Kosiaris) [22:03:29] (03PS4) 10Giuseppe Lavagetto: Also check charts generated by helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/549059 [22:04:39] (03CR) 10Giuseppe Lavagetto: Also check charts generated by helmfile (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/549059 (owner: 10Giuseppe Lavagetto) [22:05:15] (03PS2) 10Alexandros Kosiaris: zuul: Remove zuul Gearman queue alert [puppet] - 10https://gerrit.wikimedia.org/r/550943 [22:05:44] (03CR) 10Alexandros Kosiaris: [C: 03+2] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/550943 (owner: 10Alexandros Kosiaris) [22:16:21] 10Operations, 10Traffic: Renew and deploy GlobalSign unified cert (2019) - https://phabricator.wikimedia.org/T237650 (10Seb35) Thanks for the detailled explanation, it’s interesting. I have no further details from the user I reported the issue. [22:17:37] (03PS1) 10CDanis: dbctl: rename 'wikitech' to 's10' to match prod [software/conftool] - 10https://gerrit.wikimedia.org/r/550958 (https://phabricator.wikimedia.org/T233236) [22:20:03] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [22:24:26] (03PS4) 10Jforrester: Variant configuration: Generate dblists from YAML [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545411 (https://phabricator.wikimedia.org/T223602) [22:25:18] (03CR) 10jerkins-bot: [V: 04-1] Variant configuration: Generate dblists from YAML [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545411 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [22:27:06] (03CR) 10CDanis: [C: 03+2] dbctl: rename 'wikitech' to 's10' to match prod [software/conftool] - 10https://gerrit.wikimedia.org/r/550958 (https://phabricator.wikimedia.org/T233236) (owner: 10CDanis) [22:28:56] (03PS8) 10Dzahn: gerrit: refactor, move java setup to separate class [puppet] - 10https://gerrit.wikimedia.org/r/548554 [22:29:51] (03Merged) 10jenkins-bot: dbctl: rename 'wikitech' to 's10' to match prod [software/conftool] - 10https://gerrit.wikimedia.org/r/550958 (https://phabricator.wikimedia.org/T233236) (owner: 10CDanis) [22:36:29] (03PS9) 10Dzahn: gerrit: refactor, move java setup to separate class [puppet] - 10https://gerrit.wikimedia.org/r/548554 [22:38:20] (03CR) 10jerkins-bot: [V: 04-1] gerrit: refactor, move java setup to separate class [puppet] - 10https://gerrit.wikimedia.org/r/548554 (owner: 10Dzahn) [22:46:49] RECOVERY - WDQS high update lag on wdqs1007 is OK: (C)3600 ge (W)1200 ge 1182 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [22:47:33] (03CR) 10Nuria: "Looks fine on our end. Since indexing in druid/turnilo is not automagical (as fields are called out explicitly), when you are fully done w" [puppet] - 10https://gerrit.wikimedia.org/r/550933 (https://phabricator.wikimedia.org/T233661) (owner: 10BBlack) [23:00:05] MaxSem, RoanKattouw, Niharika, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Evening SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191114T2300). [23:00:05] No GERRIT patches in the queue for this window AFAICS. [23:02:01] well then.. time for gerrit restart now [23:02:32] (03PS10) 10Dzahn: gerrit: refactor, move java setup to separate class [puppet] - 10https://gerrit.wikimedia.org/r/548554 [23:03:43] !log restarting gerrit to ncrease defaultThreadPoolSize to 2 [23:03:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:04:19] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:04:36] ^ tries to git pull.. will recover [23:05:22] gerrit back and now using 2 threads .. so when GC kicks in there should be still one thread left as opposed to before when it would slow everything down [23:05:27] paladox: ^ [23:05:33] thanks mutante! [23:05:45] (03CR) 10jerkins-bot: [V: 04-1] gerrit: refactor, move java setup to separate class [puppet] - 10https://gerrit.wikimedia.org/r/548554 (owner: 10Dzahn) [23:05:51] thanks for the patch. that was a great idea [23:06:23] :) [23:06:32] it was upstream that gave me the idea [23:07:32] *nod*, i looked at the link [23:07:36] time for dinner here now [23:08:12] :) [23:10:36] the an-coord alert is actually unrelated to gerrit restart..just coincidence [23:10:52] failed service is hdfs-cleaner [23:12:17] (03CR) 10Dzahn: "service restarted now" [puppet] - 10https://gerrit.wikimedia.org/r/550776 (owner: 10Paladox)