[00:00:04] twentyafterfour: I, the Bot under the Fountain, allow thee, The Deployer, to do Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200709T0000). [00:16:26] (03PS1) 10CDanis: varnish: Set-Cookie is extra-super-mega-uncacheable [puppet] - 10https://gerrit.wikimedia.org/r/610434 (https://phabricator.wikimedia.org/T256395) [00:21:55] (03PS2) 10CDanis: varnish: Set-Cookie is extra-super-mega-uncacheable [puppet] - 10https://gerrit.wikimedia.org/r/610434 (https://phabricator.wikimedia.org/T256395) [00:36:16] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 23541672 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:37:30] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 25319496 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:39:24] RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 110392 and 39 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:40:00] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1304 and 75 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:47:19] (03PS3) 10CDanis: varnish: Set-Cookie is extra-super-mega-uncacheable [puppet] - 10https://gerrit.wikimedia.org/r/610434 (https://phabricator.wikimedia.org/T256395) [00:48:37] (03CR) 10CDanis: [C: 03+2] varnish: Set-Cookie is extra-super-mega-uncacheable [puppet] - 10https://gerrit.wikimedia.org/r/610434 (https://phabricator.wikimedia.org/T256395) (owner: 10CDanis) [00:49:06] !log โœ”๏ธ cdanis@cumin1001.eqiad.wmnet ~ ๐Ÿ•˜๐Ÿบ sudo cumin A:cp 'disable-puppet "cdanis deploying I6c1b646e T256395"' [00:49:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:51:50] (03PS1) 10DannyS712: DeprecatablePropertyArray: Use MW_VERSION instead of array_key_exists [extensions/Translate] (wmf/1.35.0-wmf.40) - 10https://gerrit.wikimedia.org/r/610103 (https://phabricator.wikimedia.org/T257531) [00:54:46] (03CR) 10Ppchelko: "I don't think it really worths back porting. It's not actually creating any real trouble, just a fairly rare logspam. We can leave it here" [extensions/Translate] (wmf/1.35.0-wmf.40) - 10https://gerrit.wikimedia.org/r/610103 (https://phabricator.wikimedia.org/T257531) (owner: 10DannyS712) [00:57:14] (03PS1) 10Dave Pifke: xhgui: Pin php-twig at version 1.* [puppet] - 10https://gerrit.wikimedia.org/r/610446 (https://phabricator.wikimedia.org/T254310) [00:58:15] !log โœ”๏ธ cdanis@cumin1001.eqiad.wmnet ~ ๐Ÿ•˜๐Ÿบ sudo cumin A:cp 'enable-puppet "cdanis deploying I6c1b646e T256395"' [00:58:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:43:55] !log reset email for GseSro [01:43:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:48:32] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:48:34] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 52, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:14:08] 10Operations, 10Pywikibot: http://pywikibot.org/ is displaying Wikimedia error page - https://phabricator.wikimedia.org/T257536 (10Legoktm) [03:16:41] (03PS1) 10BryanDavis: ncredir: Update mappings for pywikibot redirects [puppet] - 10https://gerrit.wikimedia.org/r/610548 (https://phabricator.wikimedia.org/T234617) [03:21:01] 10Operations, 10Pywikibot, 10Patch-For-Review, 10cloud-services-team (Kanban): http://pywikibot.org/ is displaying Wikimedia error page - https://phabricator.wikimedia.org/T257536 (10bd808) [03:21:29] (03CR) 10Legoktm: [C: 03+1] ncredir: Update mappings for pywikibot redirects [puppet] - 10https://gerrit.wikimedia.org/r/610548 (https://phabricator.wikimedia.org/T234617) (owner: 10BryanDavis) [03:37:05] 10Operations, 10Pywikibot, 10Traffic, 10HTTPS: Configure HTTPS for pywikibot.org - https://phabricator.wikimedia.org/T257537 (10Legoktm) [04:34:56] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 54, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:40:32] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 52, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:53:00] PROBLEM - Thanos compact is halted on icinga1001 is CRITICAL: cluster=thanos instance=thanos-fe2001 job=thanos-compact prometheus=ops site=codfw https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [05:10:42] !log Deploy schema change on s5 codfw, lag will be generated - T238966 [05:10:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:10:48] T238966: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 [05:11:46] !log Remove revision triggers from db2093:3315 T238966 [05:11:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:13:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1101:3317', diff saved to https://phabricator.wikimedia.org/P11821 and previous config saved to /var/cache/conftool/dbconfig/20200709-051355-marostegui.json [05:13:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:18:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1084', diff saved to https://phabricator.wikimedia.org/P11822 and previous config saved to /var/cache/conftool/dbconfig/20200709-051826-marostegui.json [05:18:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:19:50] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 54, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:22:23] (03PS1) 10Marostegui: db1084: Remove it from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/610597 (https://phabricator.wikimedia.org/T257540) [05:23:21] (03CR) 10Marostegui: [C: 03+2] db1084: Remove it from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/610597 (https://phabricator.wikimedia.org/T257540) (owner: 10Marostegui) [05:25:26] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 52, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:31:58] (03PS1) 10Marostegui: mariadb: Move db1084 from s4 to m2 [puppet] - 10https://gerrit.wikimedia.org/r/610598 (https://phabricator.wikimedia.org/T257540) [05:32:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1084 from dbctl', diff saved to https://phabricator.wikimedia.org/P11823 and previous config saved to /var/cache/conftool/dbconfig/20200709-053206-marostegui.json [05:32:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:32:29] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Move db1084 from s4 to m2 [puppet] - 10https://gerrit.wikimedia.org/r/610598 (https://phabricator.wikimedia.org/T257540) (owner: 10Marostegui) [05:33:49] (03CR) 10Marostegui: "> Patch Set 5:" [puppet] - 10https://gerrit.wikimedia.org/r/606549 (https://phabricator.wikimedia.org/T254158) (owner: 10Dzahn) [05:34:10] (03CR) 10Marostegui: [C: 03+1] mariadb: remove ferm firewall hole for gerrit servers [puppet] - 10https://gerrit.wikimedia.org/r/609884 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [05:39:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1101:3317', diff saved to https://phabricator.wikimedia.org/P11824 and previous config saved to /var/cache/conftool/dbconfig/20200709-053905-marostegui.json [05:39:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:42:24] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:01:00] 10Operations, 10ops-eqiad: Interface errors on asw2-d-eqiad:xe-7/0/0 (ms-be1037) - https://phabricator.wikimedia.org/T257541 (10ayounsi) p:05Triageโ†’03Medium [06:06:04] 10Operations, 10ops-eqiad: Interface errors on asw2-b-eqiad:ge-5/0/35 (kubernetes1010) - https://phabricator.wikimedia.org/T257542 (10ayounsi) p:05Triageโ†’03Medium [06:08:28] 10Operations, 10SRE-Access-Requests: Requesting access to centralauth database for Jennifer Wang - https://phabricator.wikimedia.org/T255836 (10jcrespo) > It is likely that @jwang is already part of one or both of these groups. Is there a way to confirm that? Indeed, I can see the user is already as part of `... [06:27:24] (03PS1) 10Elukey: superset: set sqllab timeout to 60s [puppet] - 10https://gerrit.wikimedia.org/r/610656 [06:28:53] (03CR) 10Elukey: [C: 03+2] superset: set sqllab timeout to 60s [puppet] - 10https://gerrit.wikimedia.org/r/610656 (owner: 10Elukey) [06:32:57] (03PS2) 10Marostegui: mariadb: Move db1084 from s4 to m2 [puppet] - 10https://gerrit.wikimedia.org/r/610598 (https://phabricator.wikimedia.org/T257540) [06:33:32] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Move db1084 from s4 to m2 [puppet] - 10https://gerrit.wikimedia.org/r/610598 (https://phabricator.wikimedia.org/T257540) (owner: 10Marostegui) [06:33:58] (03CR) 10Marostegui: [V: 03+2 C: 03+2] "The -1 is a known issue with misc puppet code, that needs refactoring." [puppet] - 10https://gerrit.wikimedia.org/r/610598 (https://phabricator.wikimedia.org/T257540) (owner: 10Marostegui) [06:45:42] (03CR) 10Ayounsi: [C: 03+1] "Hard to review it, but the PCC diff looks good based on the LibreNMS doc. I don't know the exact variables needed by SSO though." [puppet] - 10https://gerrit.wikimedia.org/r/610291 (https://phabricator.wikimedia.org/T256958) (owner: 10Jbond) [06:49:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [06:49:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:02] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [06:52:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:30] (03PS1) 10Marostegui: install_server: Format db1084 [puppet] - 10https://gerrit.wikimedia.org/r/610675 (https://phabricator.wikimedia.org/T257540) [06:59:33] (03CR) 10Marostegui: [C: 03+2] install_server: Format db1084 [puppet] - 10https://gerrit.wikimedia.org/r/610675 (https://phabricator.wikimedia.org/T257540) (owner: 10Marostegui) [07:00:13] (03CR) 10ZPapierski: query_service: Remove more hardcoding of wdqs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/610401 (owner: 10Ebernhardson) [07:04:08] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:08:11] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:15:53] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [07:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:10] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:19:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:45] (03CR) 10Jcrespo: "Looking good, a few comments, all small corrections/suggestions." (036 comments) [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979) (owner: 10Privacybatm) [07:25:04] (03CR) 10Jcrespo: "One last thought." (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979) (owner: 10Privacybatm) [07:26:38] (03PS3) 10Ryan Kemper: wdqs: remove wdqs200[78] from "role(insetup)" [puppet] - 10https://gerrit.wikimedia.org/r/595022 (owner: 10Gehel) [07:27:52] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] "I let this get stale, this change is really simple so let's get it out!" [puppet] - 10https://gerrit.wikimedia.org/r/595022 (owner: 10Gehel) [07:29:38] (03CR) 10Ryan Kemper: "`sudo puppet-merge` completed w/o issue" [puppet] - 10https://gerrit.wikimedia.org/r/595022 (owner: 10Gehel) [07:42:28] (03CR) 10Jcrespo: "A few minor comments. Description should also say that this (I think) fixes the arbitrary order issue (right?) by using find for checksumm" (032 comments) [software/transferpy] - 10https://gerrit.wikimedia.org/r/608640 (https://phabricator.wikimedia.org/T254979) (owner: 10Privacybatm) [07:46:46] (03PS1) 10Muehlenhoff: Icinga: Add permissions also for ayounsi [puppet] - 10https://gerrit.wikimedia.org/r/610699 [07:52:25] (03CR) 10Jcrespo: [C: 04-2] "no rm -r on code, big no. See comment below." (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/608274 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm) [07:54:53] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:57:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1136', diff saved to https://phabricator.wikimedia.org/P11825 and previous config saved to /var/cache/conftool/dbconfig/20200709-075749-marostegui.json [07:57:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:29] (03PS3) 10ZPapierski: query_service: Remove more hardcoding of wdqs [puppet] - 10https://gerrit.wikimedia.org/r/610401 (owner: 10Ebernhardson) [07:59:03] (03CR) 10Jcrespo: [C: 04-2] "The idea looks good, but needs some fixes." (033 comments) [software/transferpy] - 10https://gerrit.wikimedia.org/r/608274 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm) [07:59:41] !log Stop db1117:3322 to clone db1084, this will trigger haproxy alerts - T257540 [07:59:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:46] T257540: Upgrade m2 to Buster and Mariadb 10.4 - https://phabricator.wikimedia.org/T257540 [08:02:01] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:02:32] (03CR) 10Muehlenhoff: [C: 03+2] Stop installing git-lfs from stretch-backports [puppet] - 10https://gerrit.wikimedia.org/r/610015 (https://phabricator.wikimedia.org/T256877) (owner: 10Muehlenhoff) [08:03:15] PROBLEM - haproxy failover on dbproxy1013 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [08:03:21] ^ expected [08:03:21] PROBLEM - haproxy failover on dbproxy1015 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [08:03:27] same ^ [08:03:45] ACKNOWLEDGEMENT - haproxy failover on dbproxy1013 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Marostegui known https://wikitech.wikimedia.org/wiki/HAProxy [08:03:45] ACKNOWLEDGEMENT - haproxy failover on dbproxy1015 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Marostegui known https://wikitech.wikimedia.org/wiki/HAProxy [08:03:50] (03PS1) 10Elukey: Revert "Set BigTop for the Hadoop test cluster" [puppet] - 10https://gerrit.wikimedia.org/r/610105 [08:04:08] (03CR) 10DCausse: query_service: Remove more hardcoding of wdqs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/610401 (owner: 10Ebernhardson) [08:04:39] (03CR) 10Elukey: [C: 03+2] Revert "Set BigTop for the Hadoop test cluster" [puppet] - 10https://gerrit.wikimedia.org/r/610105 (owner: 10Elukey) [08:06:33] (03PS1) 10Ayounsi: msw1-eqiad: disable LLDP to mr1 [homer/public] - 10https://gerrit.wikimedia.org/r/610703 [08:07:06] (03CR) 10Ayounsi: [C: 03+2] msw1-eqiad: disable LLDP to mr1 [homer/public] - 10https://gerrit.wikimedia.org/r/610703 (owner: 10Ayounsi) [08:07:33] (03Merged) 10jenkins-bot: msw1-eqiad: disable LLDP to mr1 [homer/public] - 10https://gerrit.wikimedia.org/r/610703 (owner: 10Ayounsi) [08:07:44] 10Operations, 10Analytics-Clusters, 10netops, 10Patch-For-Review: Move netflow data to Eventgate Analytics - https://phabricator.wikimedia.org/T248865 (10elukey) For this quarter I'd propose to stop the work on moving netflow to eventgate/mep, keeping the current 'ad-hoc' configuration, and then re-evaluat... [08:08:09] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 133, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:09:40] (03PS4) 10ZPapierski: Correct url and path for nginx OAuth 1.0a [puppet] - 10https://gerrit.wikimedia.org/r/609909 (https://phabricator.wikimedia.org/T251498) [08:10:45] (03PS1) 10Ayounsi: Disable IGMP snooping on msw [homer/public] - 10https://gerrit.wikimedia.org/r/610704 [08:11:07] !log disable igmp snooping on msw1-codfw [08:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:44] (03CR) 10Ayounsi: [C: 03+2] Disable IGMP snooping on msw [homer/public] - 10https://gerrit.wikimedia.org/r/610704 (owner: 10Ayounsi) [08:12:20] (03Merged) 10jenkins-bot: Disable IGMP snooping on msw [homer/public] - 10https://gerrit.wikimedia.org/r/610704 (owner: 10Ayounsi) [08:13:43] !log elukey@cumin1001 START - Cookbook sre.hadoop.stop-cluster [08:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:05] (03CR) 10Privacybatm: "Thank you for the review. Please see my comments." (037 comments) [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979) (owner: 10Privacybatm) [08:16:02] (03PS15) 10Privacybatm: transferpy: Generate checksum parallel to the data transfer [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979) [08:16:56] 10Operations, 10Pywikibot, 10Traffic, 10HTTPS: Configure HTTPS for pywikibot.org - https://phabricator.wikimedia.org/T257537 (10Vgutierrez) our acme-chief production environment uses dns-01 challenges to validate domain ownership against Let's Encrypt. In order to be able to issue a certificate for pywikib... [08:19:13] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:20:32] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.stop-cluster (exit_code=0) [08:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:32] !log elukey@cumin1001 START - Cookbook sre.hadoop.change-distro [08:22:34] (03CR) 10Jbond: [C: 03+2] profile::librenms: update to use lookup instead of hiera call [puppet] - 10https://gerrit.wikimedia.org/r/610018 (https://phabricator.wikimedia.org/T256958) (owner: 10Jbond) [08:22:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:36] (03PS1) 10Ryan Kemper: maps: remove no-longer-used slave files [labs/private] - 10https://gerrit.wikimedia.org/r/610705 (https://phabricator.wikimedia.org/T254646) [08:22:39] (03CR) 10Jcrespo: "Answer." (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979) (owner: 10Privacybatm) [08:22:41] (03CR) 10Jbond: [C: 03+2] librenms: add support for apereo cas [puppet] - 10https://gerrit.wikimedia.org/r/610030 (https://phabricator.wikimedia.org/T256958) (owner: 10Jbond) [08:22:44] (03CR) 10Jbond: [C: 03+2] librenms: update librenms to use apereo_cas SSO [puppet] - 10https://gerrit.wikimedia.org/r/610291 (https://phabricator.wikimedia.org/T256958) (owner: 10Jbond) [08:23:47] !log imported osm2pgsql 0.96.0+ds-1~bpo9+1 to "main" component T256877 [08:23:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:51] T256877: Handle sunset of stretch-backports - https://phabricator.wikimedia.org/T256877 [08:23:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1136', diff saved to https://phabricator.wikimedia.org/P11827 and previous config saved to /var/cache/conftool/dbconfig/20200709-082355-marostegui.json [08:23:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:04] (03CR) 10DCausse: query_service: Remove more hardcoding of wdqs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/610401 (owner: 10Ebernhardson) [08:26:09] (03CR) 10Privacybatm: "I did not do any fix for the arbitrary checksum order issue. This find command was there before. I have resolved the other comments. Thank" (032 comments) [software/transferpy] - 10https://gerrit.wikimedia.org/r/608640 (https://phabricator.wikimedia.org/T254979) (owner: 10Privacybatm) [08:26:23] (03PS11) 10DCausse: [wcqs] gui custom config [puppet] - 10https://gerrit.wikimedia.org/r/606297 (https://phabricator.wikimedia.org/T251514) (owner: 10Mstyles) [08:26:25] (03PS4) 10DCausse: query_service: Remove more hardcoding of wdqs [puppet] - 10https://gerrit.wikimedia.org/r/610401 (owner: 10Ebernhardson) [08:26:27] (03PS6) 10Privacybatm: Transferer.py: Calculate source checksum parallel to the data transfer [software/transferpy] - 10https://gerrit.wikimedia.org/r/608640 (https://phabricator.wikimedia.org/T254979) [08:27:43] (03PS1) 10Muehlenhoff: No longer install osm2pgsql from stretch-backports [puppet] - 10https://gerrit.wikimedia.org/r/610706 (https://phabricator.wikimedia.org/T256877) [08:29:27] (03PS12) 10DCausse: [wcqs] gui custom config [puppet] - 10https://gerrit.wikimedia.org/r/606297 (https://phabricator.wikimedia.org/T251514) (owner: 10Mstyles) [08:29:29] (03PS5) 10DCausse: query_service: Remove more hardcoding of wdqs [puppet] - 10https://gerrit.wikimedia.org/r/610401 (owner: 10Ebernhardson) [08:31:23] (03PS1) 10Muehlenhoff: lxc: Remove jessie compat code [puppet] - 10https://gerrit.wikimedia.org/r/610707 [08:33:22] (03PS1) 10Jbond: idp: add librenms service [puppet] - 10https://gerrit.wikimedia.org/r/610708 [08:33:56] (03PS2) 10Jdrewniak: Enable Quicksurveys for Desktop Improvements Project. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609850 (https://phabricator.wikimedia.org/T246977) [08:34:08] (03CR) 10Jbond: [C: 03+2] idp: add librenms service [puppet] - 10https://gerrit.wikimedia.org/r/610708 (owner: 10Jbond) [08:34:58] (03PS5) 10ZPapierski: Correct url and path for nginx OAuth 1.0a [puppet] - 10https://gerrit.wikimedia.org/r/609909 (https://phabricator.wikimedia.org/T251498) [08:40:33] 10Operations: Upload git 2.20 package from stretch-backports to component/git - https://phabricator.wikimedia.org/T257308 (10hashar) I gave it a try, it seems it is working as intended: ` $ apt-cache policy git git: Installed: 1:2.20.1-2+deb10u3~wmf1 Candidate: 1:2.20.1-2+deb10u3~wmf1 Version table: *** 1... [08:42:33] (03PS1) 10Marostegui: install_server: Reimage dbproxy107 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/610712 (https://phabricator.wikimedia.org/T255408) [08:44:02] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work), 10Patch-For-Review: (Need by: 2020-04-02) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10fgiunchedi) I had a very quick look at the host from the mgmt console, looks like at least the disks don't show up with t... [08:44:18] !log Stop haproxy on dbproxy1017 before upgrading to buster - T255408 [08:44:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:23] T255408: Upgrade dbproxyXXXX to Buster - https://phabricator.wikimedia.org/T255408 [08:52:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1079', diff saved to https://phabricator.wikimedia.org/P11828 and previous config saved to /var/cache/conftool/dbconfig/20200709-085228-marostegui.json [08:52:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:03] 10Operations, 10Patch-For-Review: Handle sunset of stretch-backports - https://phabricator.wikimedia.org/T256877 (10MoritzMuehlenhoff) [09:00:23] (03PS13) 10DCausse: [wcqs] gui custom config [puppet] - 10https://gerrit.wikimedia.org/r/606297 (https://phabricator.wikimedia.org/T251514) (owner: 10Mstyles) [09:00:25] (03PS6) 10DCausse: query_service: Remove more hardcoding of wdqs [puppet] - 10https://gerrit.wikimedia.org/r/610401 (owner: 10Ebernhardson) [09:01:43] (03CR) 10Privacybatm: "Thank you for your comments, Please have a look at my new patch." (033 comments) [software/transferpy] - 10https://gerrit.wikimedia.org/r/608274 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm) [09:01:56] (03PS3) 10Privacybatm: Firewall.py: Solve auto port detection concurrency issue [software/transferpy] - 10https://gerrit.wikimedia.org/r/608274 (https://phabricator.wikimedia.org/T256450) [09:06:59] (03CR) 10Filippo Giunchedi: [C: 04-1] "See inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/610420 (owner: 10Andrew Bogott) [09:07:28] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.change-distro (exit_code=0) [09:07:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:34] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops, 10cloud-services-team (Hardware): (Need By: 2020-06-12) rack/setup/install WMCS 10G switches - https://phabricator.wikimedia.org/T251632 (10ayounsi) Note that you can start connecting the servers to the switches, if needed: * cloudcephosd: eth0:cloud-hosts1-... [09:10:48] (03PS1) 10Giuseppe Lavagetto: restbase: switch to use https, envoy for proton [puppet] - 10https://gerrit.wikimedia.org/r/610720 [09:12:22] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:12:59] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage dbproxy107 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/610712 (https://phabricator.wikimedia.org/T255408) (owner: 10Marostegui) [09:13:42] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:14:02] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:14:23] (03PS1) 10Elukey: sre.hadoop.change-distro: modify restart procedure and remove previous state [cookbooks] - 10https://gerrit.wikimedia.org/r/610721 (https://phabricator.wikimedia.org/T244499) [09:15:35] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/compiler1002/23788/restbase1025.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/610720 (owner: 10Giuseppe Lavagetto) [09:16:28] (03PS1) 10Marostegui: dbproxy1017: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/610723 [09:16:30] (03CR) 10Elukey: [C: 03+2] sre.hadoop.change-distro: modify restart procedure and remove previous state [cookbooks] - 10https://gerrit.wikimedia.org/r/610721 (https://phabricator.wikimedia.org/T244499) (owner: 10Elukey) [09:16:46] (03CR) 10Alexandros Kosiaris: [C: 03+1] restbase: switch to use https, envoy for proton [puppet] - 10https://gerrit.wikimedia.org/r/610720 (owner: 10Giuseppe Lavagetto) [09:17:10] (03CR) 10Marostegui: [C: 03+2] dbproxy1017: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/610723 (owner: 10Marostegui) [09:17:10] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:18:04] (03Abandoned) 10Alexandros Kosiaris: proton: Switch restbase production to TLS [puppet] - 10https://gerrit.wikimedia.org/r/607536 (https://phabricator.wikimedia.org/T225680) (owner: 10Alexandros Kosiaris) [09:19:16] (03CR) 10Giuseppe Lavagetto: [C: 03+2] restbase: switch to use https, envoy for proton [puppet] - 10https://gerrit.wikimedia.org/r/610720 (owner: 10Giuseppe Lavagetto) [09:21:27] <_joe_> akosiaris: testing on rb1025 [09:21:33] <_joe_> I'll also restart restbase [09:21:46] <_joe_> we can do this progressively btw [09:21:53] <_joe_> one rb node at a time [09:22:00] <_joe_> as the config change doesn't cause a restart [09:22:05] <_joe_> thankfully :P [09:22:44] ๐Ÿ‘ [09:23:26] <_joe_> uhm I get "no healthy upstream" from envoy [09:24:18] <_joe_> oh I see, sorry, I need another patch [09:24:42] RECOVERY - haproxy failover on dbproxy1013 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [09:24:49] (03CR) 10Privacybatm: "> Patch Set 15:" (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979) (owner: 10Privacybatm) [09:24:52] RECOVERY - haproxy failover on dbproxy1015 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [09:25:40] (03PS1) 10Giuseppe Lavagetto: services_proxy: use https service for proton [puppet] - 10https://gerrit.wikimedia.org/r/610724 [09:26:04] (03CR) 10Elukey: Add analytics-product system user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/595540 (https://phabricator.wikimedia.org/T230743) (owner: 10Bearloga) [09:26:15] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] services_proxy: use https service for proton [puppet] - 10https://gerrit.wikimedia.org/r/610724 (owner: 10Giuseppe Lavagetto) [09:27:42] !log bounce thanos-compact on thanos-fe2001 [09:27:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:55] <_joe_> !log restarting restbase on restbase1025 to pick up the switch to k8s of proton [09:28:56] RECOVERY - Thanos compact is halted on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [09:29:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:30] <_joe_> akosiaris: you should see requests flowing in ~ 1 minute [09:29:38] perfect [09:29:47] <_joe_> give restbase time to startup :P [09:31:56] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [09:31:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:23] 10Operations, 10DBA, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Request new database for idp-test.wikimedia.org - https://phabricator.wikimedia.org/T256120 (10Marostegui) @jbond were you able to check if the connection works fine and the application can reach the DB via the proxy? [09:33:36] (03PS1) 10Privacybatm: transfer.py: It is a test code for multiprocess transferpy [software/transferpy] - 10https://gerrit.wikimedia.org/r/610750 (https://phabricator.wikimedia.org/T248256) [09:34:01] (03CR) 10jerkins-bot: [V: 04-1] transfer.py: It is a test code for multiprocess transferpy [software/transferpy] - 10https://gerrit.wikimedia.org/r/610750 (https://phabricator.wikimedia.org/T248256) (owner: 10Privacybatm) [09:34:33] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:34:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:55] <_joe_> akosiaris: let's do a couple more servers? [09:35:08] +1 [09:35:32] (03CR) 10Privacybatm: [C: 04-1] "this is just a test" [software/transferpy] - 10https://gerrit.wikimedia.org/r/610750 (https://phabricator.wikimedia.org/T248256) (owner: 10Privacybatm) [09:36:02] PROBLEM - Thanos compact has not run on icinga1001 is CRITICAL: 4.429e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [09:36:44] <_joe_> !log restarting restbase on rb1026,1027 to switch to proton on k8s [09:36:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:50] PROBLEM - Check no envoy runtime configuration is left persistent on restbase1025 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 388 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [09:38:12] _joe_: ^ ? [09:38:21] <_joe_> oh yes [09:38:26] <_joe_> it's me messing with logging levels [09:38:34] <_joe_> it will go away as soon as I reload envoy [09:38:34] HTTP critical and then 200 is counter intuitive btw [09:38:48] <_joe_> damn icinga check_http :P [09:38:53] :) [09:39:05] (03PS4) 10Elukey: Add analytics-product system user [puppet] - 10https://gerrit.wikimedia.org/r/595540 (https://phabricator.wikimedia.org/T230743) (owner: 10Bearloga) [09:39:08] <_joe_> that's what's printing that damn stuff :P [09:40:04] RECOVERY - Thanos compact has not run on icinga1001 is OK: (C)24 ge (W)12 ge 0.01255 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [09:40:17] (03CR) 10jerkins-bot: [V: 04-1] Add analytics-product system user [puppet] - 10https://gerrit.wikimedia.org/r/595540 (https://phabricator.wikimedia.org/T230743) (owner: 10Bearloga) [09:42:14] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:42:49] (03PS5) 10Elukey: Add analytics-product system user [puppet] - 10https://gerrit.wikimedia.org/r/595540 (https://phabricator.wikimedia.org/T230743) (owner: 10Bearloga) [09:43:30] (03PS14) 10DCausse: [wcqs] gui custom config [puppet] - 10https://gerrit.wikimedia.org/r/606297 (https://phabricator.wikimedia.org/T251514) (owner: 10Mstyles) [09:43:32] (03PS7) 10DCausse: query_service: Remove more hardcoding of wdqs [puppet] - 10https://gerrit.wikimedia.org/r/610401 (owner: 10Ebernhardson) [09:46:37] (03CR) 10Elukey: "@Moritz: When you have a moment, would you please re-review and tell me if anything is missing?" [puppet] - 10https://gerrit.wikimedia.org/r/595540 (https://phabricator.wikimedia.org/T230743) (owner: 10Bearloga) [09:47:34] <_joe_> akosiaris: ok with moving two more? [09:47:42] <_joe_> still in eqiad, where most requests happen [09:48:07] <_joe_> there is some throttling going on I see [09:48:45] * akosiaris checking [09:49:52] so, the TLS throttling is probably irrelevant yet. But indeed the app seems to be throttled as well [09:50:09] _joe_: wanna do another 2 and let's see if it increases. If yes, I 'll increase a bit CPU usage [09:51:25] note though that limit is at 8CPUs [09:51:36] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one comment inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/595540 (https://phabricator.wikimedia.org/T230743) (owner: 10Bearloga) [09:51:41] (03PS15) 10DCausse: [wcqs] gui custom config [puppet] - 10https://gerrit.wikimedia.org/r/606297 (https://phabricator.wikimedia.org/T251514) (owner: 10Mstyles) [09:51:43] (03PS8) 10DCausse: query_service: Remove more hardcoding of wdqs [puppet] - 10https://gerrit.wikimedia.org/r/610401 (owner: 10Ebernhardson) [09:53:04] <_joe_> yes [09:53:24] <_joe_> !log restarting restbase on restbase1024,1023 [09:53:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:33] (03CR) 10Privacybatm: [C: 04-1] "I would say, Let's this patch be here, later I will make a new ticket and work on it. But in a different direction:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/610750 (https://phabricator.wikimedia.org/T248256) (owner: 10Privacybatm) [10:00:05] Mvolz: #bothumor My software never has bugs. It just develops random features. Rise for Services โ€“ Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200709T1000). [10:01:01] (03PS2) 10Mvolz: Update citoid to dcc45a42 [deployment-charts] - 10https://gerrit.wikimedia.org/r/607745 [10:01:43] _joe_: about halfway there I guess? [10:01:52] <_joe_> tes [10:01:54] <_joe_> *yes [10:02:17] <_joe_> we've restarted 6 servers, we have 7 more IIRC [10:02:40] <_joe_> and ofc all of codfw [10:04:09] (03CR) 10Mvolz: [C: 03+2] Update citoid to dcc45a42 [deployment-charts] - 10https://gerrit.wikimedia.org/r/607745 (owner: 10Mvolz) [10:05:19] (03Merged) 10jenkins-bot: Update citoid to dcc45a42 [deployment-charts] - 10https://gerrit.wikimedia.org/r/607745 (owner: 10Mvolz) [10:05:21] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 54, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:07:11] <_joe_> akosiaris: ok to continue or you want to spice up the deployment? [10:07:50] _joe_: continue I 'd say. It's a couple of ms of throttling [10:08:02] I am close to believing this is 512c999 [10:08:10] the usual issues with CFS artifacts [10:08:12] <_joe_> yeah [10:08:16] <_joe_> look at tlsproxy [10:08:31] RECOVERY - Check no envoy runtime configuration is left persistent on restbase1025 is OK: HTTP OK: HTTP/1.1 200 OK - 286 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [10:09:24] <_joe_> !log restarting restbase on rb1020-22 [10:09:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:38] (03PS16) 10Vgutierrez: ATS: Provide http to https redirection logic in lua [puppet] - 10https://gerrit.wikimedia.org/r/603447 (https://phabricator.wikimedia.org/T254235) [10:12:28] seems like I need to update the latency buckets for chromium-render [10:12:39] they are capped to 1s, which is meh for this [10:13:13] so percentiles aren't useful right now. Overall some more love is needed for the statsd->prometheus stats thing, but that's easy [10:13:14] (03PS17) 10Vgutierrez: ATS: Provide http to https redirection logic in lua [puppet] - 10https://gerrit.wikimedia.org/r/603447 (https://phabricator.wikimedia.org/T254235) [10:13:14] 10Operations, 10Analytics, 10netops: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10elukey) [10:13:22] 10Operations, 10Analytics-Clusters, 10netops, 10Patch-For-Review: Move netflow data to Eventgate Analytics - https://phabricator.wikimedia.org/T248865 (10elukey) [10:13:33] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1131 - https://phabricator.wikimedia.org/T257253 (10Cmjohnson) {F31921902}. @Jclark-ctr TSR report is attached and will email [10:14:40] 10Operations, 10Analytics-Clusters, 10netops, 10Patch-For-Review: Move netflow data to Eventgate Analytics - https://phabricator.wikimedia.org/T248865 (10elukey) [10:15:29] (03PS18) 10Vgutierrez: ATS: Provide http to https redirection logic in lua [puppet] - 10https://gerrit.wikimedia.org/r/603447 (https://phabricator.wikimedia.org/T254235) [10:17:13] (03PS19) 10Vgutierrez: ATS: Provide http to https redirection logic in lua [puppet] - 10https://gerrit.wikimedia.org/r/603447 (https://phabricator.wikimedia.org/T254235) [10:18:08] _joe_, akosiaris: ok to deploy citoid now or should I wait until you're done restarting restbase? [10:18:20] <_joe_> mvolz: go on [10:18:25] mvolz: go for it [10:18:32] ok thanks :) [10:19:03] <_joe_> we're just switching traffic, it should be zero-impact [10:19:05] _joe_: do all the rest I 'd say [10:19:12] <_joe_> akosiaris: ack [10:20:19] (03CR) 10Elukey: Add analytics-product system user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/595540 (https://phabricator.wikimedia.org/T230743) (owner: 10Bearloga) [10:22:10] !log mvolz@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'citoid' for release 'staging' . [10:22:11] _joe_: latencies btw are on the same range. 3s for the old infra, 3s avg for k8s [10:22:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:17] so we are looking pretty good [10:22:57] <_joe_> great [10:23:19] <_joe_> !log rolling restart the remaining restbases in eqiad, and all of codfw [10:23:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:19] 10Operations, 10Analytics-Clusters, 10Analytics-Radar, 10observability: Move kafkamon hosts to Debian Buster - https://phabricator.wikimedia.org/T252773 (10fgiunchedi) I copied `prometheus-burrow-exporter` to `buster-wikimedia` and tested in a WMCS instance. puppet runs successfully and burrow from Buster... [10:28:59] _joe_: lemme know when done [10:29:10] <_joe_> it will take some time [10:29:29] <_joe_> my cumin incantation also runs puppet, and runs on 2 hosts concurrently only [10:29:43] <_joe_> also restart-restbase won't last until the server is back in the pooil [10:31:02] ok [10:31:18] <_joe_> we're halfway through [10:31:26] <_joe_> and I do see requests incoming in codfw [10:31:38] <_joe_> now we should test what happens if we failover to a single dc [10:31:51] * _joe_ wants an horizontal autoscaler [10:32:25] we aren't that far from that by now [10:34:54] !log mvolz@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'citoid' for release 'production' . [10:34:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:26] <_joe_> akosiaris: envoy's view https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?panelId=6&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-origin=restbase&var-origin_instance=All&var-destination=proton [10:37:03] _joe_: ahhhh that's pretty useful for the buckets I 'll need [10:37:13] capped at 30s is nice [10:38:02] we should be using an arithmetic progression but it might not make sense in this case [10:39:12] _joe_: on the move, bbiab [10:40:15] <_joe_> akosiaris: we're done btw [10:41:15] PROBLEM - Thanos compact has not run on icinga1001 is CRITICAL: 4.429e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [10:44:09] akosiaris: so I just deployed on codfw, not eqiad yet, I think there might be trouble with the logs? [10:45:22] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [10:47:53] <_joe_> we're having a peak of api errors right now [10:49:07] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [10:51:05] (03PS1) 10Vgutierrez: ATS: Disable res_track_memory in cp1085 [puppet] - 10https://gerrit.wikimedia.org/r/610764 (https://phabricator.wikimedia.org/T249335) [10:52:45] eh probably unrelated then [10:54:14] !log mvolz@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'citoid' for release 'production' . [10:54:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:20] <_joe_> !log depool mw1282 [10:54:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:30] <_joe_> !log restarting php7.2-fpm on mw1282, workers failing with sigill [10:55:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:25] (03PS2) 10Vgutierrez: ATS: Disable res_track_memory in cp1085 [puppet] - 10https://gerrit.wikimedia.org/r/610764 (https://phabricator.wikimedia.org/T249335) [10:56:51] (03CR) 10Vgutierrez: "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1002/23793/" [puppet] - 10https://gerrit.wikimedia.org/r/610764 (https://phabricator.wikimedia.org/T249335) (owner: 10Vgutierrez) [10:58:35] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:58:51] (03PS3) 10Gergล‘ Tisza: Varnish: Include request ID in Set-Cookie warning [puppet] - 10https://gerrit.wikimedia.org/r/608709 (https://phabricator.wikimedia.org/T256395) [10:59:02] (03CR) 10Ema: [C: 03+1] ATS: Disable res_track_memory in cp1085 [puppet] - 10https://gerrit.wikimedia.org/r/610764 (https://phabricator.wikimedia.org/T249335) (owner: 10Vgutierrez) [10:59:19] (03CR) 10Vgutierrez: [C: 03+2] ATS: Disable res_track_memory in cp1085 [puppet] - 10https://gerrit.wikimedia.org/r/610764 (https://phabricator.wikimedia.org/T249335) (owner: 10Vgutierrez) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: That opportune time is upon us again. Time for a European mid-day backport window(Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200709T1100). [11:00:04] JDrewniak: A patch you scheduled for European mid-day backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:18] (03PS1) 10Filippo Giunchedi: thanos: set concurrency to 1 [puppet] - 10https://gerrit.wikimedia.org/r/610768 (https://phabricator.wikimedia.org/T252186) [11:01:07] <_joe_> the exceptions in mw are jobqueue again [11:01:15] <_joe_> I'll look later [11:01:24] !log restart ats-tls on cp1085 [11:01:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:41] RECOVERY - Thanos compact has not run on icinga1001 is OK: (C)24 ge (W)12 ge 0.006597 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [11:01:44] o/ anyone around to do a config swat? [11:02:17] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:02:41] jan_drewniak: I can do it if you want :) [11:03:15] Urbanecm: much appreciated [11:04:11] (03CR) 10Urbanecm: [C: 03+2] Enable Quicksurveys for Desktop Improvements Project. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609850 (https://phabricator.wikimedia.org/T246977) (owner: 10Jdrewniak) [11:04:37] (03Merged) 10jenkins-bot: Enable Quicksurveys for Desktop Improvements Project. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609850 (https://phabricator.wikimedia.org/T246977) (owner: 10Jdrewniak) [11:05:06] jan_drewniak: available at mwdebug1001 for testing [11:06:29] (03CR) 10Jcrespo: "Thank you, that is very useful to me, even if it is not intended to be deployed, because I can test in our use case. Just what I needed! I" [software/transferpy] - 10https://gerrit.wikimedia.org/r/610750 (https://phabricator.wikimedia.org/T248256) (owner: 10Privacybatm) [11:07:29] Urbanecm: Ok looks good to deploy! [11:07:39] thanks, syncing [11:07:51] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:08:20] _joe_: around again [11:08:53] (03CR) 10Urbanecm: [C: 03+2] "beta-only, noop for prod" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610424 (https://phabricator.wikimedia.org/T246420) (owner: 10Nray) [11:09:10] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: e6f442c6900524482806aeb1b5162e65bf7c97ac: Enable Quicksurveys for Desktop Improvements Project (T246977) (duration: 01m 06s) [11:09:11] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={swagger_check_eventgate_main_cluster_eqiad,swagger_check_eventgate_main_http_cluster_eqiad} site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:09:14] jan_drewniak: the beta patch will arrive at beta within 30 minutes. If not, create a task :). [11:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:16] T246977: Run baseline quicksurvey on test wikis - https://phabricator.wikimedia.org/T246977 [11:09:33] (03Merged) 10jenkins-bot: Enable limited-width layout for "Latest Vector" on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610424 (https://phabricator.wikimedia.org/T246420) (owner: 10Nray) [11:10:00] jan_drewniak: anything else I can do for you? ๐Ÿ™‚ [11:10:14] 10Operations, 10Analytics-Clusters, 10Analytics-Radar, 10observability: Move kafkamon hosts to Debian Buster - https://phabricator.wikimedia.org/T252773 (10fgiunchedi) Burrow + exporter seems to be working as expected when connected to real kafka/zk, we are good to go with production VMs [11:10:17] Urbanecm: that's all for today, thank you! [11:10:25] happy to help! [11:10:27] Urbanecm can I add https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Translate/+/610103 to the BACON? [11:10:40] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [11:10:41] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:10:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:03] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:11:32] depends on akosiaris and _joe; just saw the MW exception alert [11:11:37] _joe_: akosiaris : Can I continue? [11:11:38] _joe_: there is something peculiar. I don't see any other requests than a4: https://grafana.wikimedia.org/d/llIEd7MMz/proton?panelId=58&edit&fullscreen&orgId=1&from=now-3h&to=now&var-dc=eqiad%20prometheus%2Fk8s&var-service=chromium-render&var-namespace=proton [11:12:09] Urbanecm: looking [11:12:34] I think you are good to go. https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fops&from=now-30m&to=now [11:12:59] thanks akosiaris [11:13:03] yw [11:13:06] DannyS712|away: in that case, sure :). Add it to the calendar please [11:13:25] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:13:44] (03CR) 10Urbanecm: [C: 03+2] "B&C" [extensions/Translate] (wmf/1.35.0-wmf.40) - 10https://gerrit.wikimedia.org/r/610103 (https://phabricator.wikimedia.org/T257531) (owner: 10DannyS712) [11:14:02] added [11:14:19] thanks, I'll ping you once it's ready for testing [11:15:03] well, I'm not sure how the deprecation alerts were triggered (i.e. what action I should take to reproduce) and its just deprecation alerts showing up in logspam, so I can't really test edit [11:15:36] ok [11:15:59] (03PS6) 10Hnowlan: api-gateway: Basic envoy chart WIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) [11:16:48] (03CR) 10Hnowlan: api-gateway: Basic envoy chart WIP (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [11:16:57] (03CR) 10jerkins-bot: [V: 04-1] api-gateway: Basic envoy chart WIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [11:17:59] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [11:18:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:05] I did create about 10 artificial "legal" instead of "a4" type requests to proton and it worked fine and are reported now, so for some reason we aren't receiving those yet? [11:19:48] (03PS1) 10Urbanecm: Add *.oireachtas.ie to the wgCopyUploadsDomains whitelist for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610772 (https://phabricator.wikimedia.org/T256543) [11:19:59] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [11:20:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:32] same goes for "letter" [11:21:28] (03PS1) 10Urbanecm: Rename namespace on kn.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610773 (https://phabricator.wikimedia.org/T255337) [11:21:41] (03CR) 10Urbanecm: [C: 03+2] Add *.oireachtas.ie to the wgCopyUploadsDomains whitelist for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610772 (https://phabricator.wikimedia.org/T256543) (owner: 10Urbanecm) [11:22:32] (03Merged) 10jenkins-bot: Add *.oireachtas.ie to the wgCopyUploadsDomains whitelist for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610772 (https://phabricator.wikimedia.org/T256543) (owner: 10Urbanecm) [11:23:55] I am wondering whether all those are just health checks though [11:24:11] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 0a3c1f94a702b527842ed4f34d8bf41b26235e64: Add *.oireachtas.ie to the wgCopyUploadsDomains whitelist for commonswiki (T256543) (duration: 01m 04s) [11:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:17] T256543: Add data.oireachtas.ie to the wgCopyUploadsDomains allowlist of Wikimedia Commons - https://phabricator.wikimedia.org/T256543 [11:24:43] (03PS2) 10Urbanecm: Rename namespace on kn.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610773 (https://phabricator.wikimedia.org/T255337) [11:24:47] (03CR) 10Urbanecm: [C: 03+2] Rename namespace on kn.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610773 (https://phabricator.wikimedia.org/T255337) (owner: 10Urbanecm) [11:25:36] (03Merged) 10jenkins-bot: Rename namespace on kn.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610773 (https://phabricator.wikimedia.org/T255337) (owner: 10Urbanecm) [11:26:24] 10Operations, 10vm-requests: eqiad: 1 VM for kafkamon - https://phabricator.wikimedia.org/T257560 (10fgiunchedi) [11:27:18] 10Operations, 10vm-requests: codfw: 1 VM for kafkamon - https://phabricator.wikimedia.org/T257561 (10fgiunchedi) [11:28:04] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 3a7c1c33e58637437f819edf039008a00dc5be27: Rename namespace on kn.wikipedia.org (T255337) (duration: 01m 04s) [11:28:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:09] T255337: Rename namespace on kn.wikipedia.org - https://phabricator.wikimedia.org/T255337 [11:29:07] yeah, it looks like the other stuff for proton is health checks, I 'll see what I can do to move those to the new endpoint as well [11:31:21] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for mooeypoo - https://phabricator.wikimedia.org/T257502 (10jcrespo) [11:31:25] (03PS1) 10Arturo Borrero Gonzalez: cloud: network nodes: explicitly load nf_conntrack module on boot [puppet] - 10https://gerrit.wikimedia.org/r/610775 (https://phabricator.wikimedia.org/T257552) [11:34:38] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for mooeypoo - https://phabricator.wikimedia.org/T257502 (10jcrespo) As an existing user with existing production access, the only needed thing is @Nuria's aproval for access to analytics infrastructure, as service owner. @M... [11:34:47] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for mooeypoo - https://phabricator.wikimedia.org/T257502 (10jcrespo) p:05Triageโ†’03Medium [11:35:06] (03CR) 10Arturo Borrero Gonzalez: "PCC https://puppet-compiler.wmflabs.org/compiler1002/23794/" [puppet] - 10https://gerrit.wikimedia.org/r/610775 (https://phabricator.wikimedia.org/T257552) (owner: 10Arturo Borrero Gonzalez) [11:35:08] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for mooeypoo - https://phabricator.wikimedia.org/T257502 (10jcrespo) a:03Nuria [11:36:39] (03Merged) 10jenkins-bot: DeprecatablePropertyArray: Use MW_VERSION instead of array_key_exists [extensions/Translate] (wmf/1.35.0-wmf.40) - 10https://gerrit.wikimedia.org/r/610103 (https://phabricator.wikimedia.org/T257531) (owner: 10DannyS712) [11:40:11] DannyS712|away: I'm going to sync that [11:40:33] ty [11:40:39] let me know if there are any issues [11:41:22] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM, there might be rare remaining races (as discussed on IRC), but it certainly improves the status quo a lot." [puppet] - 10https://gerrit.wikimedia.org/r/610775 (https://phabricator.wikimedia.org/T257552) (owner: 10Arturo Borrero Gonzalez) [11:41:25] sure [11:42:24] !log urbanecm@deploy1001 Synchronized php-1.35.0-wmf.40/extensions/Translate/tag/SpecialPageTranslation.php: 6541d3ff51f52fe8a1bdbfa86022f8d97d6c7680: DeprecatablePropertyArray: Use MW_VERSION instead of array_key_exists (T257531) (duration: 01m 05s) [11:42:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:30] T257531: Cleanup forgotten usages of Revision - https://phabricator.wikimedia.org/T257531 [11:42:44] DannyS712|away: seems it went fine [11:44:37] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud: network nodes: explicitly load nf_conntrack module on boot [puppet] - 10https://gerrit.wikimedia.org/r/610775 (https://phabricator.wikimedia.org/T257552) (owner: 10Arturo Borrero Gonzalez) [11:50:00] !log rebooting debmonitor1001 for kernel update [11:50:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:13] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [11:50:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:15] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [11:52:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:09] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [11:54:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:10] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [11:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:31] (03PS6) 10ZPapierski: Correct url and path for nginx OAuth 1.0a [puppet] - 10https://gerrit.wikimedia.org/r/609909 (https://phabricator.wikimedia.org/T251498) [12:02:33] (03PS3) 10Jcrespo: admin: Add Jgiannelos production access [puppet] - 10https://gerrit.wikimedia.org/r/609752 (https://phabricator.wikimedia.org/T257187) [12:03:05] (03CR) 10jerkins-bot: [V: 04-1] admin: Add Jgiannelos production access [puppet] - 10https://gerrit.wikimedia.org/r/609752 (https://phabricator.wikimedia.org/T257187) (owner: 10Jcrespo) [12:03:50] (03CR) 10Jcrespo: [C: 04-1] "typo" [puppet] - 10https://gerrit.wikimedia.org/r/609752 (https://phabricator.wikimedia.org/T257187) (owner: 10Jcrespo) [12:05:01] (03PS4) 10Jcrespo: admin: Add Jgiannelos production access [puppet] - 10https://gerrit.wikimedia.org/r/609752 (https://phabricator.wikimedia.org/T257187) [12:11:37] !log enable asw2-b-eqiad:ae3 (to cloudsw1-c8) - T251632 [12:11:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:43] T251632: (Need By: 2020-06-12) rack/setup/install WMCS 10G switches - https://phabricator.wikimedia.org/T251632 [12:22:06] !log rebooting dbmonitor1001 / tendril.wikimedia.org for kernek update [12:22:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:17] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [12:22:18] (03PS7) 10Hnowlan: api-gateway: Basic envoy chart WIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) [12:22:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:32] (03CR) 10jerkins-bot: [V: 04-1] api-gateway: Basic envoy chart WIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [12:23:02] !log jmm@cumin2001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) [12:23:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:24] (03PS1) 10Alexandros Kosiaris: proton: Set LVS level OpenAPI checks on TLS [puppet] - 10https://gerrit.wikimedia.org/r/610789 (https://phabricator.wikimedia.org/T225680) [12:27:35] (03CR) 10Alexandros Kosiaris: [C: 03+2] proton: Set LVS level OpenAPI checks on TLS [puppet] - 10https://gerrit.wikimedia.org/r/610789 (https://phabricator.wikimedia.org/T225680) (owner: 10Alexandros Kosiaris) [12:27:39] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:29:30] ^ should recover shortly. Looks like a spike [12:29:34] 10Operations, 10SRE-tools: More structured cookbooks to reboot hosts - https://phabricator.wikimedia.org/T252807 (10MoritzMuehlenhoff) For posterity, one traceback triggered by some terminal sequence magic in Puppet's "Ruby 2.1 is deprecated" warning: ` Exception raised while executing cookbook sre.hosts.reb... [12:31:24] (03PS8) 10Hnowlan: api-gateway: Basic envoy chart WIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) [12:31:43] (03CR) 10jerkins-bot: [V: 04-1] api-gateway: Basic envoy chart WIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [12:31:46] (03CR) 10ZPapierski: [C: 04-1] query_service: Remove more hardcoding of wdqs [puppet] - 10https://gerrit.wikimedia.org/r/610401 (owner: 10Ebernhardson) [12:33:59] (03PS16) 10DCausse: [wcqs] gui custom config [puppet] - 10https://gerrit.wikimedia.org/r/606297 (https://phabricator.wikimedia.org/T251514) (owner: 10Mstyles) [12:34:01] (03PS9) 10DCausse: query_service: Remove more hardcoding of wdqs [puppet] - 10https://gerrit.wikimedia.org/r/610401 (owner: 10Ebernhardson) [12:36:15] (03PS9) 10Hnowlan: api-gateway: Basic envoy chart WIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) [12:37:17] (03CR) 10jerkins-bot: [V: 04-1] api-gateway: Basic envoy chart WIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [12:38:45] !log rebooting urldownloader1001/2001 for kernel update (failed over, these are now the inactive ones) [12:38:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:56] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [12:38:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:39] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [12:40:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:55] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [12:40:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:02] (03PS1) 10Jdrewniak: Updating config for Readers Web affinity quicksurvey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610797 (https://phabricator.wikimedia.org/T246977) [12:43:39] (03PS1) 10Alexandros Kosiaris: proton: Bump statsd-exporter to 0.0.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/610799 [12:43:41] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [12:43:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:50] (03CR) 10Muehlenhoff: [C: 03+1] admin: Add Jgiannelos production access [puppet] - 10https://gerrit.wikimedia.org/r/609752 (https://phabricator.wikimedia.org/T257187) (owner: 10Jcrespo) [12:45:06] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:47:50] (03PS1) 10Alexandros Kosiaris: admin: Bump staging quotas [deployment-charts] - 10https://gerrit.wikimedia.org/r/610800 [12:50:59] (03PS10) 10Hnowlan: api-gateway: Basic envoy chart WIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) [12:51:00] _joe_: proton seems pretty ok. I 'll call it a success [12:51:40] adding some finishing touches, but otherwise ๐Ÿ™Œ [12:51:40] <_joe_> great [12:51:56] (03CR) 10jerkins-bot: [V: 04-1] api-gateway: Basic envoy chart WIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [12:52:07] (03PS1) 10Ema: VTC: make 34-pass-set-cookie.vtc more explicit [puppet] - 10https://gerrit.wikimedia.org/r/610802 (https://phabricator.wikimedia.org/T256395) [12:52:13] next week: mobileapps [12:52:36] (03CR) 10Alexandros Kosiaris: [C: 03+2] proton: Bump statsd-exporter to 0.0.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/610799 (owner: 10Alexandros Kosiaris) [12:52:46] (03CR) 10Alexandros Kosiaris: [C: 03+2] admin: Bump staging quotas [deployment-charts] - 10https://gerrit.wikimedia.org/r/610800 (owner: 10Alexandros Kosiaris) [12:53:52] (03Merged) 10jenkins-bot: proton: Bump statsd-exporter to 0.0.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/610799 (owner: 10Alexandros Kosiaris) [12:53:56] (03Merged) 10jenkins-bot: admin: Bump staging quotas [deployment-charts] - 10https://gerrit.wikimedia.org/r/610800 (owner: 10Alexandros Kosiaris) [12:54:08] !log rebooting install* servers for kernel security update [12:54:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:17] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [12:54:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:29] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] ncredir: Update mappings for pywikibot redirects [puppet] - 10https://gerrit.wikimedia.org/r/610548 (https://phabricator.wikimedia.org/T234617) (owner: 10BryanDavis) [12:56:09] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'proton' for release 'production' . [12:56:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:42] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'proton' for release 'production' . [12:56:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:00] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [12:57:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:13] (03PS7) 10ZPapierski: Correct url and path for nginx OAuth 1.0a [puppet] - 10https://gerrit.wikimedia.org/r/609909 (https://phabricator.wikimedia.org/T251498) [12:57:45] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'proton' for release 'production' . [12:57:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:05] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [12:58:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:04] twentyafterfour and James_F: How many deployers does it take to do Mediawiki train - American+European Version (secondary timeslot) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200709T1300). [13:00:24] No train deployment planned; group2 will go out in the US slot as normal. [13:00:52] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [13:00:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:20] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [13:05:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:45] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [13:07:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:35] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [13:08:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:52] (03PS11) 10Hnowlan: api-gateway: Basic envoy chart WIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) [13:10:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1089', diff saved to https://phabricator.wikimedia.org/P11830 and previous config saved to /var/cache/conftool/dbconfig/20200709-131039-marostegui.json [13:10:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:57] (03CR) 10jerkins-bot: [V: 04-1] api-gateway: Basic envoy chart WIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [13:11:27] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [13:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:46] (03CR) 10ZPapierski: [wcqs] gui custom config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/606297 (https://phabricator.wikimedia.org/T251514) (owner: 10Mstyles) [13:14:02] 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 2 others: Implement redirect for hide banner cookie issue - https://phabricator.wikimedia.org/T251780 (10Pcoombe) @MBeat33 You can just point them directly to the following link, it will appear blank but does set the... [13:15:48] !log installing ffmpeg security updates [13:15:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:14] 10Operations, 10Traffic, 10netops: Remove multicast - https://phabricator.wikimedia.org/T257573 (10ayounsi) p:05Triageโ†’03Low [13:16:23] (03PS17) 10DCausse: [wcqs] gui custom config [puppet] - 10https://gerrit.wikimedia.org/r/606297 (https://phabricator.wikimedia.org/T251514) (owner: 10Mstyles) [13:16:25] (03PS10) 10DCausse: query_service: Remove more hardcoding of wdqs [puppet] - 10https://gerrit.wikimedia.org/r/610401 (owner: 10Ebernhardson) [13:16:27] (03CR) 10DCausse: [wcqs] gui custom config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/606297 (https://phabricator.wikimedia.org/T251514) (owner: 10Mstyles) [13:16:56] (03PS8) 10ZPapierski: Correct url and path for nginx OAuth 1.0a [puppet] - 10https://gerrit.wikimedia.org/r/609909 (https://phabricator.wikimedia.org/T251498) [13:19:20] (03PS12) 10Hnowlan: api-gateway: Basic envoy chart WIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) [13:22:06] (03CR) 10Ema: [C: 03+2] VTC: make 34-pass-set-cookie.vtc more explicit [puppet] - 10https://gerrit.wikimedia.org/r/610802 (https://phabricator.wikimedia.org/T256395) (owner: 10Ema) [13:23:21] (03PS1) 10Hnowlan: scaffold: add network boilerplate to scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/610813 [13:29:06] !log rebooting puppetboard1001 (puppetboard.wikimedia.org) for kernel update [13:29:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:09] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [13:29:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:11] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [13:31:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:34] 10Operations, 10Traffic, 10netops: Remove multicast - https://phabricator.wikimedia.org/T257573 (10ayounsi) [13:31:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1089', diff saved to https://phabricator.wikimedia.org/P11831 and previous config saved to /var/cache/conftool/dbconfig/20200709-133134-marostegui.json [13:31:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:11] 10Operations, 10Epic, 10Goal: automatically collect network error reports from users' browsers - https://phabricator.wikimedia.org/T257527 (10Ottomata) > e.g. it might be useful to also have them in their own table in Hive. If we do all the EventGate stuff right, importing them into Hive will be natural (and... [13:35:13] (03PS1) 10ArielGlenn: Add commons structured data dumps to the webpage! [puppet] - 10https://gerrit.wikimedia.org/r/610816 (https://phabricator.wikimedia.org/T221917) [13:35:16] 10Operations, 10Analytics-Clusters, 10netops, 10Patch-For-Review: Move netflow data to Eventgate Analytics - https://phabricator.wikimedia.org/T248865 (10Ottomata) Sure! [13:36:16] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/610816 (https://phabricator.wikimedia.org/T221917) (owner: 10ArielGlenn) [13:36:46] (03CR) 10ArielGlenn: [C: 03+2] Add commons structured data dumps to the webpage! [puppet] - 10https://gerrit.wikimedia.org/r/610816 (https://phabricator.wikimedia.org/T221917) (owner: 10ArielGlenn) [13:45:01] !log installing gnutls28 security updates [13:45:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:12] (03PS1) 10Ayounsi: Reports, add new cloudsw role [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/610820 (https://phabricator.wikimedia.org/T251632) [13:50:24] (03CR) 10Ppchelko: [C: 04-1] "Some nits and questions" (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [13:53:27] (03PS1) 10Vgutierrez: ATS: Support listening on multiple ports [puppet] - 10https://gerrit.wikimedia.org/r/610821 (https://phabricator.wikimedia.org/T254235) [13:54:39] (03CR) 10jerkins-bot: [V: 04-1] ATS: Support listening on multiple ports [puppet] - 10https://gerrit.wikimedia.org/r/610821 (https://phabricator.wikimedia.org/T254235) (owner: 10Vgutierrez) [13:56:00] (03PS2) 10Vgutierrez: ATS: Support listening on multiple ports [puppet] - 10https://gerrit.wikimedia.org/r/610821 (https://phabricator.wikimedia.org/T254235) [13:57:13] (03CR) 10jerkins-bot: [V: 04-1] ATS: Support listening on multiple ports [puppet] - 10https://gerrit.wikimedia.org/r/610821 (https://phabricator.wikimedia.org/T254235) (owner: 10Vgutierrez) [13:57:56] (03PS4) 10Andrew Bogott: Prometheus: gather db stats from wmcs galera db hosts [puppet] - 10https://gerrit.wikimedia.org/r/610420 [13:58:22] (03PS3) 10Vgutierrez: ATS: Support listening on multiple ports [puppet] - 10https://gerrit.wikimedia.org/r/610821 (https://phabricator.wikimedia.org/T254235) [13:59:08] (03CR) 10jerkins-bot: [V: 04-1] Prometheus: gather db stats from wmcs galera db hosts [puppet] - 10https://gerrit.wikimedia.org/r/610420 (owner: 10Andrew Bogott) [14:01:16] (03PS5) 10Andrew Bogott: Prometheus: gather db stats from wmcs galera db hosts [puppet] - 10https://gerrit.wikimedia.org/r/610420 [14:02:29] (03CR) 10jerkins-bot: [V: 04-1] Prometheus: gather db stats from wmcs galera db hosts [puppet] - 10https://gerrit.wikimedia.org/r/610420 (owner: 10Andrew Bogott) [14:02:36] !loh installing libtirpc security updates [14:03:42] !log installing libtirpc security updates [14:03:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:38] (03PS1) 10Filippo Giunchedi: role: fold weblog into centrallog [puppet] - 10https://gerrit.wikimedia.org/r/610832 (https://phabricator.wikimedia.org/T247968) [14:12:46] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1003/23800/" [puppet] - 10https://gerrit.wikimedia.org/r/610832 (https://phabricator.wikimedia.org/T247968) (owner: 10Filippo Giunchedi) [14:13:18] (03CR) 10Filippo Giunchedi: "I've tested the webrequest::ops role in Buster/WMCS and works as expected" [puppet] - 10https://gerrit.wikimedia.org/r/610832 (https://phabricator.wikimedia.org/T247968) (owner: 10Filippo Giunchedi) [14:17:29] (03PS4) 10Vgutierrez: ATS: Support listening on multiple ports [puppet] - 10https://gerrit.wikimedia.org/r/610821 (https://phabricator.wikimedia.org/T254235) [14:29:13] !log replacing msw-b1,b2,b3 and b4 [14:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:02] (03CR) 10Jcrespo: transferpy: Generate checksum parallel to the data transfer (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979) (owner: 10Privacybatm) [14:36:50] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/610832 (https://phabricator.wikimedia.org/T247968) (owner: 10Filippo Giunchedi) [14:37:59] (03CR) 10Cwhite: [C: 03+1] thanos: set concurrency to 1 [puppet] - 10https://gerrit.wikimedia.org/r/610768 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [14:42:45] (03CR) 10MSantos: "Is it possible to apply this change only for the eqiad machine that will be depooled for OSM replication fix? I guess this is overthinking" [puppet] - 10https://gerrit.wikimedia.org/r/610706 (https://phabricator.wikimedia.org/T256877) (owner: 10Muehlenhoff) [14:50:50] 10Operations, 10Patch-For-Review: Handle sunset of stretch-backports - https://phabricator.wikimedia.org/T256877 (10hashar) [14:58:03] (03CR) 10Herron: [C: 03+1] thanos: set concurrency to 1 [puppet] - 10https://gerrit.wikimedia.org/r/610768 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [14:58:04] 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 2 others: Implement redirect for hide banner cookie issue - https://phabricator.wikimedia.org/T251780 (10MBeat33) Thank you @Pcoombe this is great. Does the error message that I saw when testing affect the setting of... [15:01:56] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: set concurrency to 1 [puppet] - 10https://gerrit.wikimedia.org/r/610768 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [15:03:18] (03PS5) 10Vgutierrez: ATS: Support listening on multiple ports [puppet] - 10https://gerrit.wikimedia.org/r/610821 (https://phabricator.wikimedia.org/T254235) [15:06:18] (03PS6) 10Andrew Bogott: Prometheus: gather db stats from wmcs galera db hosts [puppet] - 10https://gerrit.wikimedia.org/r/610420 [15:06:47] (03PS1) 10JMeybohm: Add chartmuseum module to interact with the API [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/610849 (https://phabricator.wikimedia.org/T257333) [15:08:30] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Oops, good catch. Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/610813 (owner: 10Hnowlan) [15:08:32] (03CR) 10jerkins-bot: [V: 04-1] Add chartmuseum module to interact with the API [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/610849 (https://phabricator.wikimedia.org/T257333) (owner: 10JMeybohm) [15:09:12] (03Merged) 10jenkins-bot: scaffold: add network boilerplate to scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/610813 (owner: 10Hnowlan) [15:09:16] (03PS1) 10RLazarus: Revert "Disable IPv6 tests, which fail in the build container." [debs/envoyproxy] - 10https://gerrit.wikimedia.org/r/610850 [15:10:37] (03CR) 10Herron: [C: 03+1] "Great idea!" [puppet] - 10https://gerrit.wikimedia.org/r/610832 (https://phabricator.wikimedia.org/T247968) (owner: 10Filippo Giunchedi) [15:13:34] (03CR) 10Muehlenhoff: "Not easily, no. But when I deploy this, I'll disable Puppet on maps1001 and validate it on maps2001 first (and the validation will show im" [puppet] - 10https://gerrit.wikimedia.org/r/610706 (https://phabricator.wikimedia.org/T256877) (owner: 10Muehlenhoff) [15:14:27] 10Operations, 10Phabricator: Can't access phabricator from my server - https://phabricator.wikimedia.org/T257507 (10greg) [15:14:53] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/610832 (https://phabricator.wikimedia.org/T247968) (owner: 10Filippo Giunchedi) [15:19:24] 10Operations, 10Fundraising-Backlog, 10Traffic, 10fundraising-tech-ops: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10Jgreen) Checking back on this it looks like https://links.e.uso.org has slipped back to a B rating because they haven't ceased TLS 1.0/1.1 support. I... [15:19:51] 10Operations, 10Traffic, 10fundraising-tech-ops: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10Jgreen) [15:24:25] (03PS1) 10Alexandros Kosiaris: proton: Amend prometheus-statsd config [deployment-charts] - 10https://gerrit.wikimedia.org/r/610855 (https://phabricator.wikimedia.org/T225680) [15:26:31] (03CR) 10RLazarus: [C: 03+2] Revert "Disable IPv6 tests, which fail in the build container." [debs/envoyproxy] - 10https://gerrit.wikimedia.org/r/610850 (owner: 10RLazarus) [15:27:56] PROBLEM - Host mw2320.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:28:50] RECOVERY - Host mw2320.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.60 ms [15:29:38] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [15:29:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:50] PROBLEM - Host cloudvirt2002-dev.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:30:59] 10Operations, 10CAS-SSO, 10User-jbond: mod_auth_cas segfaulting on netmon - https://phabricator.wikimedia.org/T257587 (10jbond) p:05Triageโ†’03Medium [15:31:24] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/609796 (https://phabricator.wikimedia.org/T257219) (owner: 10Jbond) [15:32:02] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [15:32:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:00] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:34:59] (03PS2) 10JMeybohm: Add basic chartmuseum library and helm-chartctl CLI [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/610849 (https://phabricator.wikimedia.org/T257333) [15:35:46] RECOVERY - Host cloudvirt2002-dev.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.76 ms [15:36:12] (03CR) 10jerkins-bot: [V: 04-1] Add basic chartmuseum library and helm-chartctl CLI [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/610849 (https://phabricator.wikimedia.org/T257333) (owner: 10JMeybohm) [15:37:44] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:47:16] 10Operations, 10Wikimedia-Mailing-lists, 10Cloud-VPS (Project-requests), 10cloud-services-team (Kanban): Request creation of mailman VPS project - https://phabricator.wikimedia.org/T257270 (10Andrew) a:03bd808 approved! Bryan will take care of this shortly [15:54:26] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:58:08] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:00:04] godog and _joe_: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet request window(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200709T1600). [16:00:04] tgr: A patch you scheduled for Puppet request window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:17] o/ [16:00:39] (03PS1) 10Cwhite: debianization [debs/grafana-loki] (debian/sid) - 10https://gerrit.wikimedia.org/r/610864 [16:00:49] <_joe_> tgr: so I'm looking at the first patch [16:01:33] <_joe_> and I am dubious, you're adding stuff to a cookie without colons? [16:01:37] <_joe_> err semicolons [16:01:57] <_joe_> cdanis: ^^ [16:02:06] this is a log entry [16:02:16] <_joe_> oh ok [16:02:24] <_joe_> the new gerrit unified view didn't help [16:02:31] _joe_: yeah you're talking about the (admittedly very long) std.log / std.syslog lines, right? [16:02:38] <_joe_> yes [16:02:48] <_joe_> it wasn't clear from the diff immediately [16:03:26] <_joe_> tgr: is it ok if we just merge it and let puppet run at its pace, or you need the logging immediately? [16:03:34] it's not urgent [16:03:46] pace is about 30 minutes, right? [16:03:50] 10Operations, 10ops-eqsin, 10netops: cr3-eqsin disk 1 failure - https://phabricator.wikimedia.org/T257154 (10RobH) Delivery has occurred, but it seems SG3 requires a smart hands ticket to deliver to our rack (they disregard the deliver to cage checkbox on incoming shipments for SG3). I've entered new smart... [16:03:50] <_joe_> yes [16:04:08] <_joe_> between 0 and 30 minutes [16:04:19] yeah, that's fine [16:04:19] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Varnish: Include request ID in Set-Cookie warning [puppet] - 10https://gerrit.wikimedia.org/r/608709 (https://phabricator.wikimedia.org/T256395) (owner: 10Gergล‘ Tisza) [16:04:42] (03PS7) 10Andrew Bogott: Prometheus: gather db stats from wmcs galera db hosts [puppet] - 10https://gerrit.wikimedia.org/r/610420 [16:07:22] <_joe_> ok verified this work on one server [16:07:27] <_joe_> tgr: up to the next patch [16:07:39] thanks! [16:08:28] (03CR) 10Giuseppe Lavagetto: [C: 03+2] GrowthExperiments: shorten welcome survey retention window [puppet] - 10https://gerrit.wikimedia.org/r/609226 (https://phabricator.wikimedia.org/T252575) (owner: 10Gergล‘ Tisza) [16:08:59] (03PS8) 10Andrew Bogott: Prometheus: gather db stats from wmcs galera db hosts [puppet] - 10https://gerrit.wikimedia.org/r/610420 [16:10:53] <_joe_> tgr: {{done}} [16:11:01] thanks _joe_ ! [16:16:42] (03PS3) 10Ema: ATS: override Cache-Control for Set-Cookie responses [puppet] - 10https://gerrit.wikimedia.org/r/608017 (https://phabricator.wikimedia.org/T256395) [16:17:57] (03CR) 10jerkins-bot: [V: 04-1] ATS: override Cache-Control for Set-Cookie responses [puppet] - 10https://gerrit.wikimedia.org/r/608017 (https://phabricator.wikimedia.org/T256395) (owner: 10Ema) [16:20:25] (03CR) 10Ottomata: [C: 03+1] role: fold weblog into centrallog [puppet] - 10https://gerrit.wikimedia.org/r/610832 (https://phabricator.wikimedia.org/T247968) (owner: 10Filippo Giunchedi) [16:21:24] (03CR) 10CDanis: [C: 03+1] ATS: override Cache-Control for Set-Cookie responses [puppet] - 10https://gerrit.wikimedia.org/r/608017 (https://phabricator.wikimedia.org/T256395) (owner: 10Ema) [16:22:40] (03CR) 10Krinkle: ATS: override Cache-Control for Set-Cookie responses (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/608017 (https://phabricator.wikimedia.org/T256395) (owner: 10Ema) [16:24:07] (03CR) 10Krinkle: [C: 03+1] ATS: override Cache-Control for Set-Cookie responses [puppet] - 10https://gerrit.wikimedia.org/r/608017 (https://phabricator.wikimedia.org/T256395) (owner: 10Ema) [16:26:22] (03CR) 10Herron: [C: 03+2] logstash: set v7 cluster to version 7.8 [puppet] - 10https://gerrit.wikimedia.org/r/610135 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [16:29:39] (03PS4) 10Ema: ATS: override Cache-Control for Set-Cookie responses [puppet] - 10https://gerrit.wikimedia.org/r/608017 (https://phabricator.wikimedia.org/T256395) [16:30:35] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for mooeypoo - https://phabricator.wikimedia.org/T257502 (10Nuria) @jcrespo She will need hadoop plus also being added to the wmf LDAP group. Approved on my end [16:30:54] (03CR) 10jerkins-bot: [V: 04-1] ATS: override Cache-Control for Set-Cookie responses [puppet] - 10https://gerrit.wikimedia.org/r/608017 (https://phabricator.wikimedia.org/T256395) (owner: 10Ema) [16:33:38] (03CR) 10Gergล‘ Tisza: ATS: override Cache-Control for Set-Cookie responses (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/608017 (https://phabricator.wikimedia.org/T256395) (owner: 10Ema) [16:34:35] (03CR) 10Gergล‘ Tisza: ATS: override Cache-Control for Set-Cookie responses (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/608017 (https://phabricator.wikimedia.org/T256395) (owner: 10Ema) [16:35:26] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:37:16] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:45:29] (03PS3) 10JMeybohm: Add basic chartmuseum library and helm-chartctl CLI [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/610849 (https://phabricator.wikimedia.org/T257333) [16:45:31] (03PS1) 10JMeybohm: Drop support for python3.5 [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/610871 [16:49:41] (03CR) 10Andrew Bogott: "I don't quite know what to expect, but here is the pcc output: https://puppet-compiler.wmflabs.org/compiler1002/23806/prometheus1003.eqia" [puppet] - 10https://gerrit.wikimedia.org/r/610420 (owner: 10Andrew Bogott) [16:51:09] (03PS3) 10Krinkle: Enable wgForceHTTPS and wgCookieSameSite='None' (Beta Cluster) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610127 (https://phabricator.wikimedia.org/T256095) [16:53:39] (03CR) 10Krinkle: [C: 03+2] Enable wgForceHTTPS and wgCookieSameSite='None' (Beta Cluster) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610127 (https://phabricator.wikimedia.org/T256095) (owner: 10Krinkle) [16:54:30] (03Merged) 10jenkins-bot: Enable wgForceHTTPS and wgCookieSameSite='None' (Beta Cluster) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610127 (https://phabricator.wikimedia.org/T256095) (owner: 10Krinkle) [16:54:45] * Krinkle staging on mwdebug1002 [16:55:24] (03PS4) 10JMeybohm: Add basic chartmuseum library and helm-chartctl CLI [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/610849 (https://phabricator.wikimedia.org/T257333) [16:56:51] (03PS2) 10Dzahn: releases: remove duplicate rsync code from blubber and parsoid classes [puppet] - 10https://gerrit.wikimedia.org/r/610402 [16:59:30] (03PS2) 10Jdrewniak: Updating config for Readers Web affinity quicksurvey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610797 (https://phabricator.wikimedia.org/T246977) [17:00:04] halfak and accraze: #bothumor I ๏ฟฝ Unicode. All rise for Services โ€“ Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200709T1700). [17:02:33] (03CR) 10Jdlrobson: Enable Quicksurveys for Desktop Improvements Project. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609850 (https://phabricator.wikimedia.org/T246977) (owner: 10Jdrewniak) [17:04:01] (03PS5) 10Ema: ATS: override Cache-Control for Set-Cookie responses [puppet] - 10https://gerrit.wikimedia.org/r/608017 (https://phabricator.wikimedia.org/T256395) [17:05:14] (03CR) 10jerkins-bot: [V: 04-1] ATS: override Cache-Control for Set-Cookie responses [puppet] - 10https://gerrit.wikimedia.org/r/608017 (https://phabricator.wikimedia.org/T256395) (owner: 10Ema) [17:06:00] (03CR) 10Ema: ATS: override Cache-Control for Set-Cookie responses (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/608017 (https://phabricator.wikimedia.org/T256395) (owner: 10Ema) [17:06:07] (03PS2) 10CRusnov: mgmt: netbox-generated data for frack mgmt codfw [dns] - 10https://gerrit.wikimedia.org/r/607741 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [17:06:09] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1002/23807/releases1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/610402 (owner: 10Dzahn) [17:09:33] !log krinkle@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Ia2f5eddbf2aad2 (duration: 01m 05s) [17:09:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:41] (03PS3) 10Dzahn: rsync::quickdatacopy: add optional parameter to let rsync --delete files [puppet] - 10https://gerrit.wikimedia.org/r/610389 (https://phabricator.wikimedia.org/T247652) [17:10:33] 10Operations, 10SRE-Access-Requests: Requesting access to centralauth database for Jennifer Wang - https://phabricator.wikimedia.org/T255836 (10jwang) 05Openโ†’03Resolved a:03jwang @jcrespo With your suggested method, I can access to centralauth now. Thank you! [17:10:47] !log krinkle@deploy1001 Synchronized wmf-config/: Ia2f5eddbf2aad2 (duration: 01m 04s) [17:10:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:26] (03PS6) 10Ema: ATS: override Cache-Control for Set-Cookie responses [puppet] - 10https://gerrit.wikimedia.org/r/608017 (https://phabricator.wikimedia.org/T256395) [17:16:58] (03PS2) 10Thcipriani: Remove unused `scap swat` command [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609187 (https://phabricator.wikimedia.org/T254787) (owner: 10Ahmon Dancy) [17:19:20] (03CR) 10Thcipriani: [C: 03+2] Remove unused `scap swat` command [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609187 (https://phabricator.wikimedia.org/T254787) (owner: 10Ahmon Dancy) [17:20:08] (03Merged) 10jenkins-bot: Remove unused `scap swat` command [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609187 (https://phabricator.wikimedia.org/T254787) (owner: 10Ahmon Dancy) [17:20:23] thcipriani: OK if I sling out https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/607740 ? [17:20:39] James_F: just one sec, I'll ping you when we're out of the wy [17:20:41] *way [17:20:48] No rush/. [17:20:54] (03CR) 10Ema: [C: 03+2] ATS: override Cache-Control for Set-Cookie responses [puppet] - 10https://gerrit.wikimedia.org/r/608017 (https://phabricator.wikimedia.org/T256395) (owner: 10Ema) [17:22:49] James_F: all yours [17:23:14] Excellent. [17:25:52] 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 2 others: Implement redirect for hide banner cookie issue - https://phabricator.wikimedia.org/T251780 (10Pcoombe) @MBeat33 That's normal for Firefox. Some other browsers will also show a broken image symbol. However t... [17:26:00] (03CR) 10CRusnov: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/610820 (https://phabricator.wikimedia.org/T251632) (owner: 10Ayounsi) [17:26:38] (03PS1) 10Ema: ATS: limit Set-Cookie logging to actually cacheable responses [puppet] - 10https://gerrit.wikimedia.org/r/610876 (https://phabricator.wikimedia.org/T256395) [17:27:54] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [17:27:55] !log rebooting planet1002 (planet.wikimedia.org) for kernel update [17:27:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:56] !log crusnov@cumin2001 START - Cookbook sre.dns.netbox [17:28:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:00] (03PS1) 10Bstorm: paws-prometheus: add node exporter info to tools-prometheus for paws [puppet] - 10https://gerrit.wikimedia.org/r/610877 (https://phabricator.wikimedia.org/T256361) [17:31:50] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [17:31:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:57] (03PS2) 10Jforrester: ExtensionDistribution: Drop REL1_33, EOL'ed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607740 (https://phabricator.wikimedia.org/T256087) [17:32:14] (03CR) 10Jforrester: [C: 03+2] ExtensionDistribution: Drop REL1_33, EOL'ed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607740 (https://phabricator.wikimedia.org/T256087) (owner: 10Jforrester) [17:32:14] 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 2 others: Implement redirect for hide banner cookie issue - https://phabricator.wikimedia.org/T251780 (10MBeat33) Great, thank you for confirming @Peter Coombe [17:32:33] !log crusnov@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:32:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:04] (03Merged) 10jenkins-bot: ExtensionDistribution: Drop REL1_33, EOL'ed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607740 (https://phabricator.wikimedia.org/T256087) (owner: 10Jforrester) [17:33:07] (03PS1) 10Ottomata: Rename anaconda to anaconda-wmf repo [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/610879 [17:33:09] (03PS1) 10Ottomata: Initial debian commit [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/610880 (https://phabricator.wikimedia.org/T251006) [17:33:55] !log deploying frack codfw management dns automation [17:33:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:12] (03CR) 10CRusnov: [C: 03+2] mgmt: netbox-generated data for frack mgmt codfw [dns] - 10https://gerrit.wikimedia.org/r/607741 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [17:35:19] !log rebooting moscovium for kernel update [17:35:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:25] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [17:35:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:43] Hmm. [17:36:00] Manually logging as logmsgbot just died: [17:36:23] !log Synchronized wmf-config/CommonSettings.php: ExtensionDistribution: Drop REL1_33, EOL'ed T256087 [17:36:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:27] T256087: Formally EOL REL1_33 - https://phabricator.wikimedia.org/T256087 [17:36:33] Cool. All done. [17:36:34] 10Operations, 10DC-Ops, 10SRE-Access-Requests: access request on cumin[1-2]001 for John Clark - https://phabricator.wikimedia.org/T249916 (10Dzahn) 05Openโ†’03Stalled [17:36:41] (03CR) 10Ottomata: [C: 03+2] Rename anaconda to anaconda-wmf repo [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/610879 (owner: 10Ottomata) [17:36:45] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Rename anaconda to anaconda-wmf repo [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/610879 (owner: 10Ottomata) [17:37:43] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users and nda groups for edtadros - https://phabricator.wikimedia.org/T256435 (10Dzahn) a:05ssinghโ†’03Jrbranaa [17:37:52] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [17:37:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:05] 10Operations, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Nahid Sultan - https://phabricator.wikimedia.org/T256971 (10Dzahn) a:05ssinghโ†’03Nuria [17:40:46] (03PS1) 10Bstorm: tools-prometheus: reduce scrapes and increase scrape_timeout on cadvisor [puppet] - 10https://gerrit.wikimedia.org/r/610883 [17:42:35] !log codfw frack management dns automation deployment complete T233183 [17:42:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:39] T233183: Automate generation of Management DNS records from Netbox - https://phabricator.wikimedia.org/T233183 [17:46:16] (03PS3) 10Jforrester: Revert "dblists: Remove "do not modify" note from all.dblist" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594214 [17:46:24] (03PS4) 10Jforrester: buildDBLists: Remove circular dependency on all.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594216 (https://phabricator.wikimedia.org/T251715) [17:47:37] (03PS2) 10Ema: ATS: limit Set-Cookie logging to actually cacheable responses [puppet] - 10https://gerrit.wikimedia.org/r/610876 (https://phabricator.wikimedia.org/T256395) [17:48:11] (03PS2) 10Ottomata: Initial debian commit [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/610880 (https://phabricator.wikimedia.org/T251006) [17:48:29] (03PS5) 10JMeybohm: Add basic chartmuseum library and helm-chartctl CLI [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/610849 (https://phabricator.wikimedia.org/T257333) [17:48:31] (03PS1) 10JMeybohm: New package version [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/610886 [17:48:50] (03CR) 10jerkins-bot: [V: 04-1] ATS: limit Set-Cookie logging to actually cacheable responses [puppet] - 10https://gerrit.wikimedia.org/r/610876 (https://phabricator.wikimedia.org/T256395) (owner: 10Ema) [17:49:24] 10Operations, 10Phabricator: Can't access phabricator from my server - https://phabricator.wikimedia.org/T257507 (10Dzahn) > might be blocked due to past vandalism. Yes, confirmed. This IP is in a range that is blocked by us. It from T218589 most likely. CC: @chasemp > why isn't there a... more informati... [17:49:31] (03CR) 10jerkins-bot: [V: 04-1] Add basic chartmuseum library and helm-chartctl CLI [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/610849 (https://phabricator.wikimedia.org/T257333) (owner: 10JMeybohm) [17:49:53] (03CR) 10jerkins-bot: [V: 04-1] New package version [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/610886 (owner: 10JMeybohm) [17:50:01] 10Operations, 10Phabricator, 10Security-Team: Can't access phabricator from my server - https://phabricator.wikimedia.org/T257507 (10Dzahn) [17:50:06] (03PS3) 10Ema: ATS: limit Set-Cookie logging to actually cacheable responses [puppet] - 10https://gerrit.wikimedia.org/r/610876 (https://phabricator.wikimedia.org/T256395) [17:51:12] (03PS6) 10JMeybohm: Add basic chartmuseum library and helm-chartctl CLI [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/610849 (https://phabricator.wikimedia.org/T257333) [17:51:14] (03PS2) 10JMeybohm: New package version [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/610886 [17:51:52] jouncebot: next [17:51:52] In 0 hour(s) and 8 minute(s): Morning backport window(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200709T1800) [17:57:29] bblack: hi! yt? [17:57:44] Also, whom can I ping for feedback on Varnish caching? [18:00:04] RoanKattouw, Niharika, and Urbanecm: Dear deployers, time to do the Morning backport window(Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200709T1800). [18:00:04] JDrewniak: A patch you scheduled for Morning backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:56] (03CR) 10JMeybohm: [C: 03+1] "No idea about the actual values, but LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/610855 (https://phabricator.wikimedia.org/T225680) (owner: 10Alexandros Kosiaris) [18:01:01] ema vgutierrez ^ ? [18:01:14] o/ heyo [18:02:32] jan_drewniak: hi? [18:03:11] is anybody around to deploy a config change for me? [18:04:06] jan_drewniak: hi again :-) [18:04:32] robh cdanis hey just asking again ^ for help with Varnish stuff :) apologies for the bother and thanks in advance! [18:04:43] I can deploy today! [18:04:50] uhh, i dont do deploys, sorry [18:04:51] (03CR) 10Urbanecm: [C: 03+2] Updating config for Readers Web affinity quicksurvey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610797 (https://phabricator.wikimedia.org/T246977) (owner: 10Jdrewniak) [18:05:09] Urbanecm: ha ha, we meet again :P I messed up some quicksurvey config morning, needs a redeploy. [18:05:14] robh: who can I ping for Varnish? This is the change in question: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralNotice/+/609810 [18:05:35] Just need to be sure it'll play nicely with Varnish [18:05:41] (03Merged) 10jenkins-bot: Updating config for Readers Web affinity quicksurvey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610797 (https://phabricator.wikimedia.org/T246977) (owner: 10Jdrewniak) [18:06:04] AndyRussG: I'd assume someone in the traffic team. You pinged them all though about 6 minutes ago =] [18:06:19] unless its an emergency, then anyone with deploy [18:06:26] (03PS3) 10Ottomata: Initial debian commit [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/610880 (https://phabricator.wikimedia.org/T251006) [18:06:29] but usually you want the subject matter teams to review afaik [18:06:55] jan_drewniak: ready at mwdebug1001 for testing :-) [18:07:04] robh: yeah thanks! K I just didn't know if I might be missing anyone... apologies for the bother [18:07:13] no worries at all! [18:07:36] i mean, if its something broken and we need to scramble i can start texting folks on your behalf [18:07:45] but if its a normal feature change then my texting them isnt going to go down well ;D [18:08:27] Urbanecm: ok great, this time I actually clicked through on the surveys and they work. Looks good to deploy [18:08:35] (03CR) 10Krinkle: ATS: limit Set-Cookie logging to actually cacheable responses (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/610876 (https://phabricator.wikimedia.org/T256395) (owner: 10Ema) [18:08:38] jan_drewniak: cool, syncing! [18:10:10] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 9f2557f848e99facaa62ca6b3a948cc3e32c32a3: Updating config for Readers Web affinity quicksurvey (T246977) (duration: 01m 06s) [18:10:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:16] T246977: Run baseline quicksurvey on test wikis - https://phabricator.wikimedia.org/T246977 [18:11:23] jan_drewniak: just saw a warning, ` Failed to find reader-demographics-1-description (en)` [18:12:33] (in logstash, I mean) [18:12:38] could you have a look? happens at hewiki [18:13:29] Urbanecm: ok thanks! It looks like we'll need to add that page to hewiki in that case. [18:16:06] jan_drewniak: and frwiktionary, see https://logstash.wikimedia.org/goto/e601a67f8caefb1a2298bea4c79559e8 [18:16:10] 10Operations, 10SRE-Access-Requests: Requesting access to centralauth database for Jennifer Wang - https://phabricator.wikimedia.org/T255836 (10Niharika) Thank you so much, @jcrespo. :) [18:18:37] Urbanecm: thanks for that link! [18:18:45] AndyRussG: hey sorry, I'm off this afternoon and tomorrow, but I can take a look Monday if you don't hear from ema or bblack by then [18:19:19] cdanis: ah ok thanks much! [18:22:00] (03PS4) 10Ema: ATS: limit Set-Cookie logging to actually cacheable responses [puppet] - 10https://gerrit.wikimedia.org/r/610876 (https://phabricator.wikimedia.org/T256395) [18:23:21] AndyRussG: I'm still behind the keyboard after 8PM, I can take a look tomorrow though! [18:26:20] (03CR) 10Ema: [C: 03+2] ATS: limit Set-Cookie logging to actually cacheable responses [puppet] - 10https://gerrit.wikimedia.org/r/610876 (https://phabricator.wikimedia.org/T256395) (owner: 10Ema) [18:34:28] ema thanks! I just found out this is urgent :/ [18:35:04] ema: maybe there's some doc I can look at? It's just to figure out how we could set a cookie via a Special page for non-logged in users on Wikipedia [18:35:22] we've been doing it for a while, actually, but just got a task in recently saying it's not caching properly [18:36:21] ema: here's the task that we received indicating that the current system may not be working properly. We need to change the way Special:HideBanners works anyway, because browser changes: https://phabricator.wikimedia.org/T256447 [18:40:48] 10Operations, 10Analytics-Radar, 10Gerrit: update git-review to >= 1.27 on all stretch hosts across the board - https://phabricator.wikimedia.org/T257609 (10Dzahn) [18:42:48] 10Operations, 10Analytics-Radar, 10Gerrit: update git-review to >= 1.27 on all stretch hosts across the board - https://phabricator.wikimedia.org/T257609 (10Dzahn) per T257496#6294709 I imported the buster 1.27 package to stretch (first tested on mwdebug1001, then imported on apt.wikimedia.org with included... [18:44:37] !log stat1004, stat1006, stat1007 - upgrading git-review package from 1.25 to 1.27 so that it keeps working with new Gerrit 3.2 (T257609) [18:44:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:46] T257609: update git-review to >= 1.27 on all stretch hosts across the board - https://phabricator.wikimedia.org/T257609 [18:46:46] 10Operations, 10Wikimedia-Apache-configuration: Encoding discrepancy in Special:Redirect API - https://phabricator.wikimedia.org/T257608 (10Reedy) ` $ curl -I -L "https://wikipedia.org/wiki/Special:Redirect/file/(61-365)_Can_you_imagine%3F_(5320329773).jpg" | grep location location: https://www.wikipedia.org/w... [18:47:00] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1131 - https://phabricator.wikimedia.org/T257253 (10Jclark-ctr) @Marostegui Tsr report showed a few more errors and dell would like to address. what day works best to schedule downtime? ` Good morning John, Per our phone conversations this morning, w... [18:47:42] 10Operations, 10Analytics-Radar, 10Gerrit: upgrade git-review to >= 1.27 on all stretch hosts across the board - https://phabricator.wikimedia.org/T257609 (10Dzahn) [18:48:37] 10Operations, 10Analytics-Radar, 10Gerrit: upgrade git-review to >= 1.27 on all stretch hosts across the board - https://phabricator.wikimedia.org/T257609 (10Dzahn) 05Openโ†’03Resolved a:03Dzahn Upgraded the package on stat1004, stat1006 and stat1007 (apt-get update, apt-get install git-review). https:/... [18:51:44] !log update spark2 to 2.4.4-bin-hadoop2.6-3 for buster-wikimedia [18:51:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:04] twentyafterfour and James_F: I, the Bot under the Fountain, allow thee, The Deployer, to do Mediawiki train - American+European Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200709T1900). [19:02:45] (03PS1) 1020after4: all wikis to 1.35.0-wmf.40 refs T256668 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610904 [19:02:47] (03CR) 1020after4: [C: 03+2] all wikis to 1.35.0-wmf.40 refs T256668 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610904 (owner: 1020after4) [19:03:42] (03Merged) 10jenkins-bot: all wikis to 1.35.0-wmf.40 refs T256668 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610904 (owner: 1020after4) [19:05:47] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.40 refs T256668 [19:05:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:52] T256668: 1.35.0-wmf.40 deployment blockers - https://phabricator.wikimedia.org/T256668 [19:07:26] (03CR) 10Ryan Kemper: [C: 03+2] [wcqs] gui custom config [puppet] - 10https://gerrit.wikimedia.org/r/606297 (https://phabricator.wikimedia.org/T251514) (owner: 10Mstyles) [19:08:16] (03CR) 10Ryan Kemper: [C: 03+2] query_service: Remove more hardcoding of wdqs [puppet] - 10https://gerrit.wikimedia.org/r/610401 (owner: 10Ebernhardson) [19:09:56] (03CR) 10Andrew Bogott: paws-prometheus: add node exporter info to tools-prometheus for paws (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/610877 (https://phabricator.wikimedia.org/T256361) (owner: 10Bstorm) [19:10:25] (03CR) 10Ryan Kemper: "`sudo puppet-merge` done" [puppet] - 10https://gerrit.wikimedia.org/r/606297 (https://phabricator.wikimedia.org/T251514) (owner: 10Mstyles) [19:10:28] (03CR) 10Andrew Bogott: [C: 03+1] tools-prometheus: reduce scrapes and increase scrape_timeout on cadvisor [puppet] - 10https://gerrit.wikimedia.org/r/610883 (owner: 10Bstorm) [19:10:30] (03CR) 10Ryan Kemper: "`sudo puppet-merge` done" [puppet] - 10https://gerrit.wikimedia.org/r/610401 (owner: 10Ebernhardson) [19:16:27] !log upgraded eqiad elk7 cluster from 7.4.2 to 7.8.0 T234854 [19:16:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:33] T234854: Upgrade ELK Stack - https://phabricator.wikimedia.org/T234854 [19:17:54] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:19:44] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:26:07] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1131 - https://phabricator.wikimedia.org/T257253 (10Marostegui) Thanks @jclark-ctr. We need to schedule a maintenance window as this is an active master. I will get that done next week and let you know when you can power off the host and replace the board.... [19:27:10] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_eventgate_main_http_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:27:41] uh is this significant ^ [19:27:42] 10Operations, 10vm-requests: eqiad: 1 VM request for aphlict - https://phabricator.wikimedia.org/T257617 (10Dzahn) [19:28:06] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:28:23] 10Operations, 10Phabricator, 10vm-requests: eqiad: 1 VM request for aphlict - https://phabricator.wikimedia.org/T257617 (10Dzahn) [19:29:02] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:31:48] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:33:44] twentyafterfour: looks like they were legit [19:34:03] we had a large spike of exception around 19:27 [19:34:03] nothing was showing in logstash [19:34:20] not in my filtered dashboard [19:34:56] it shows up on https://logstash.wikimedia.org/app/kibana#/dashboard/mediawiki-errors [19:35:01] (the unfiltered errors) [19:35:19] ah it's JobQueue/JobQueueEventBus.php: Could not enqueue jobs: Unable to deliver all events: 503: Service Unavailable [19:35:47] which is the usual EventBus queue throwing an exception cause the backend yields 503 :/ [19:36:38] seems to be unrelated to the train [19:48:29] I guess the backend died yeah [19:49:51] As ever. [19:52:51] (03PS1) 10Andrew Bogott: Remove redundant profile::openstack::eqiad1::openstack_controllers settings [puppet] - 10https://gerrit.wikimedia.org/r/610914 [19:55:54] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:59:38] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:03:52] 10Operations, 10CAS-SSO, 10User-jbond: mod_auth_cas segfaulting on netmon - https://phabricator.wikimedia.org/T257587 (10jbond) Looking at this a bit more to try and build the module linking to only one lib. one thing that confuses me is that apache2-bin depends on libssl1.0.2 however apache2-dev depends on... [20:16:38] 10Operations, 10Phabricator, 10Security-Team: Can't access phabricator from my server - https://phabricator.wikimedia.org/T257507 (10chasemp) >>! In T257507#6294524, @Dzahn wrote: >> might be blocked due to past vandalism. > > Yes, confirmed. This IP is in a range that is blocked by us. It comes from T218... [20:25:15] 10Operations, 10Phabricator, 10Security-Team: Can't access phabricator from my server - https://phabricator.wikimedia.org/T257507 (10Dzahn) Aha, thanks Chase! [20:26:23] 10Operations, 10Phabricator, 10Security-Team: Can't access phabricator from my server - https://phabricator.wikimedia.org/T257507 (10Dzahn) [20:26:28] 10Operations, 10Patch-For-Review, 10User-jbond: update profile::waf::apache2::administrative to use the new abuse_networks hiera key - https://phabricator.wikimedia.org/T253632 (10Dzahn) [20:26:31] 10Operations, 10Phabricator, 10Security-Team: Can't access phabricator from my server - https://phabricator.wikimedia.org/T257507 (10Isarra) Okay, that would explain it, at least. T229620 seems pretty major in particular and I'd argue describes the //bare minimum// of what needs to be resolved here, as when... [20:27:23] 10Operations, 10Phabricator, 10Security-Team: Can't access phabricator from my server - https://phabricator.wikimedia.org/T257507 (10Dzahn) [20:29:57] 10Operations, 10Phabricator, 10Traffic: Accessing Phabricator from Tor - https://phabricator.wikimedia.org/T254568 (10Dzahn) [20:30:01] 10Operations, 10Patch-For-Review, 10User-jbond: update profile::waf::apache2::administrative to use the new abuse_networks hiera key - https://phabricator.wikimedia.org/T253632 (10Dzahn) [20:31:46] 10Operations, 10CAS-SSO, 10User-jbond: mod_auth_cas segfaulting on netmon - https://phabricator.wikimedia.org/T257587 (10jbond) I managed to rebuild a test module linking only to libcrypto1.0.2 ` root@netmon2001:~# ldd /usr/lib/apache2/modules/mod_auth_cas_dev.so | grep -E 'crypto|ssl' libcrypto.so... [20:32:12] 10Operations, 10Phabricator, 10Traffic: Accessing Phabricator from Tor - https://phabricator.wikimedia.org/T254568 (10Dzahn) T253632 should be seen as a parent task of this. Also see T257507#6294524 [20:33:02] 10Operations, 10Phabricator, 10Traffic: Accessing Phabricator from Tor (some ranges blocked but not others) - https://phabricator.wikimedia.org/T254568 (10Dzahn) [20:36:02] (03CR) 10Bstorm: paws-prometheus: add node exporter info to tools-prometheus for paws (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/610877 (https://phabricator.wikimedia.org/T256361) (owner: 10Bstorm) [20:42:23] 10Operations, 10CAS-SSO, 10User-jbond: mod_auth_cas segfaulting on netmon - https://phabricator.wikimedia.org/T257587 (10jbond) comparing graphite1004 (working) with netmon2001 (not working) i notice that the latter apache proces has /usr/lib/x86_64-linux-gnu/libssl.so.1.1 loaded but the former doesn't ` n... [20:50:54] 10Operations, 10CAS-SSO, 10User-jbond: mod_auth_cas segfaulting on netmon - https://phabricator.wikimedia.org/T257587 (10jbond) > the libssl.so.1.1 is loaded by /usr/lib/apache2/modules/libphp7.2.so removing this module prevents the segfault (obviously this is not a workable solution) [20:54:44] (03PS1) 10Jbond: Revert "librenms: update librenms to use apereo_cas SSO" [puppet] - 10https://gerrit.wikimedia.org/r/610732 [20:54:47] (03PS1) 10Jbond: Revert "librenms: add support for apereo cas" [puppet] - 10https://gerrit.wikimedia.org/r/610733 [20:54:49] (03PS1) 10Jbond: Revert "profile::librenms: update to use lookup instead of hiera..." [puppet] - 10https://gerrit.wikimedia.org/r/610734 [20:55:14] (03Abandoned) 10Jbond: Revert "librenms: update librenms to use apereo_cas SSO" [puppet] - 10https://gerrit.wikimedia.org/r/610732 (owner: 10Jbond) [20:55:22] (03Abandoned) 10Jbond: Revert "librenms: add support for apereo cas" [puppet] - 10https://gerrit.wikimedia.org/r/610733 (owner: 10Jbond) [20:55:45] (03Abandoned) 10Jbond: Revert "profile::librenms: update to use lookup instead of hiera..." [puppet] - 10https://gerrit.wikimedia.org/r/610734 (owner: 10Jbond) [20:57:50] PROBLEM - librenms.wikimedia.org requires authentication on netmon2001 is CRITICAL: connect to address 208.80.153.110 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [20:58:04] ^^ fixing [20:58:09] (03PS1) 10Jbond: librenms: convert back to ldap config [puppet] - 10https://gerrit.wikimedia.org/r/610927 (https://phabricator.wikimedia.org/T257587) [20:58:14] PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:58:20] ^^ and this [20:58:53] (03CR) 10Jbond: [C: 03+2] librenms: convert back to ldap config [puppet] - 10https://gerrit.wikimedia.org/r/610927 (https://phabricator.wikimedia.org/T257587) (owner: 10Jbond) [21:01:56] RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:02:54] 10Operations, 10Wikimedia-Mailing-lists, 10Cloud-VPS (Project-requests), 10cloud-services-team (Kanban): Request creation of mailman VPS project - https://phabricator.wikimedia.org/T257270 (10bd808) 05Openโ†’03Resolved ` $ sudo wmcs-openstack --os-region-name eqiad1-r project create --enable --descriptio... [21:03:02] 10Operations, 10Security-Team, 10Wikimedia-Mailing-lists: Upgrade GNU Mailman from 2.1 to 3.3 - https://phabricator.wikimedia.org/T52864 (10bd808) [21:52:28] !log all sessions have been invalidated due to T256395 [21:52:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:11:13] 10Operations, 10Pywikibot, 10cloud-services-team (Kanban): http://pywikibot.org/ is displaying Wikimedia error page - https://phabricator.wikimedia.org/T257536 (10bd808) Same error after merge of my config change, so either Puppet does not automatically regenerate something (possible) or the setup for redire... [22:18:14] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal: Add Guergana Tzatchkova to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T256201 (10KFrancis) @jcrespo Confirming the NDA is complete. Thanks! [22:18:54] PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Proton [22:19:52] PROBLEM - Check size of conntrack table on kubernetes1001 is CRITICAL: connect to address 10.64.0.121 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [22:20:06] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1001 is CRITICAL: connect to address 10.64.0.121 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [22:21:27] 10Operations, 10LDAP-Access-Requests: Add Conny Kawohl to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T257038 (10KFrancis) @Dzahn Confirming the NDA is complete! Thanks! [22:21:50] PROBLEM - MD RAID on kubernetes1001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.121: Connection reset by peer https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [22:22:00] 10Operations, 10LDAP-Access-Requests: Add Conny Kawohl to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T257038 (10Dzahn) a:03Dzahn [22:22:19] 10Operations, 10LDAP-Access-Requests: Add Conny Kawohl to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T257038 (10Dzahn) [22:22:44] RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [22:23:38] PROBLEM - Check size of conntrack table on kubernetes1001 is CRITICAL: connect to address 10.64.0.121 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [22:25:28] RECOVERY - Check size of conntrack table on kubernetes1001 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [22:25:47] (03CR) 10Bstorm: [C: 03+2] tools-prometheus: reduce scrapes and increase scrape_timeout on cadvisor [puppet] - 10https://gerrit.wikimedia.org/r/610883 (owner: 10Bstorm) [22:28:24] PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Proton [22:30:50] PROBLEM - puppet last run on kubernetes1001 is CRITICAL: connect to address 10.64.0.121 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [22:32:58] PROBLEM - Check size of conntrack table on kubernetes1001 is CRITICAL: connect to address 10.64.0.121 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [22:33:58] RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [22:34:02] PROBLEM - Check systemd state on kubernetes1001 is CRITICAL: connect to address 10.64.0.121 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:35:05] (03CR) 10Bstorm: "Since the only thing I really *need* node exporter for is the control plane on this. I think I'll go with a static config like we did for " [puppet] - 10https://gerrit.wikimedia.org/r/610877 (https://phabricator.wikimedia.org/T256361) (owner: 10Bstorm) [22:38:30] RECOVERY - Check size of conntrack table on kubernetes1001 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [22:39:34] RECOVERY - Check systemd state on kubernetes1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:39:50] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:42:30] RECOVERY - puppet last run on kubernetes1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [22:43:34] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:43:34] RECOVERY - MD RAID on kubernetes1001 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [22:45:24] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:48:27] (03PS2) 10Bstorm: paws-prometheus: add node exporter info to tools-prometheus for paws [puppet] - 10https://gerrit.wikimedia.org/r/610877 (https://phabricator.wikimedia.org/T256361) [22:50:56] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1001 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [22:50:59] (03PS2) 10Krinkle: Enable wgForceHTTPS and wgCookieSameSite='None' (Phase 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610147 (https://phabricator.wikimedia.org/T256095) [22:51:42] (03CR) 10Krinkle: [C: 03+2] Enable wgForceHTTPS and wgCookieSameSite='None' (Phase 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610147 (https://phabricator.wikimedia.org/T256095) (owner: 10Krinkle) [22:52:32] (03Merged) 10jenkins-bot: Enable wgForceHTTPS and wgCookieSameSite='None' (Phase 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610147 (https://phabricator.wikimedia.org/T256095) (owner: 10Krinkle) [22:55:06] (03PS3) 10Bstorm: paws-prometheus: add node exporter info to tools-prometheus for paws [puppet] - 10https://gerrit.wikimedia.org/r/610877 (https://phabricator.wikimedia.org/T256361) [22:56:12] (03CR) 10Bstorm: "Turns out prometheus-labs-targets accepts a project arg ๐Ÿ˜Š" [puppet] - 10https://gerrit.wikimedia.org/r/610877 (https://phabricator.wikimedia.org/T256361) (owner: 10Bstorm) [22:57:44] (03CR) 10Bstorm: "bstorm@tools-prometheus-03:~$ prometheus-labs-targets --project paws" [puppet] - 10https://gerrit.wikimedia.org/r/610877 (https://phabricator.wikimedia.org/T256361) (owner: 10Bstorm) [22:58:55] (03CR) 10Bstorm: paws-prometheus: add node exporter info to tools-prometheus for paws (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/610877 (https://phabricator.wikimedia.org/T256361) (owner: 10Bstorm) [23:00:05] RoanKattouw, Niharika, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for Evening backport window(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200709T2300). [23:00:21] (03CR) 10Andrew Bogott: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/610877 (https://phabricator.wikimedia.org/T256361) (owner: 10Bstorm) [23:00:25] (03CR) 10Andrew Bogott: [C: 03+1] paws-prometheus: add node exporter info to tools-prometheus for paws [puppet] - 10https://gerrit.wikimedia.org/r/610877 (https://phabricator.wikimedia.org/T256361) (owner: 10Bstorm) [23:01:18] (03CR) 10Bstorm: [C: 03+2] paws-prometheus: add node exporter info to tools-prometheus for paws [puppet] - 10https://gerrit.wikimedia.org/r/610877 (https://phabricator.wikimedia.org/T256361) (owner: 10Bstorm) [23:04:47] (03CR) 10BryanDavis: [C: 03+1] lxc: Remove jessie compat code [puppet] - 10https://gerrit.wikimedia.org/r/610707 (owner: 10Muehlenhoff) [23:09:18] RoanKattouw Niharika Urbanecm around? [23:10:06] I'm here, will SWAT [23:10:10] !log krinkle@deploy1001 Synchronized wmf-config/InitialiseSettings.php: I2c2dea832 (duration: 00m 56s) [23:10:10] Sorry for the delay [23:10:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:38] Krinkle: Could you let me know when you're done so I can start the backport window? [23:11:15] AndyRussG: Hmm I'm confused, there's nothing on the wiki page? Are you looking to do a last-minute addition? [23:11:17] RoanKattouw: hi! how's it going! I think I do have a patch but it's still being discussed whether to push it out now [23:11:26] Oh hmm OK [23:11:36] I'm about to go to the grocery store unfortunately [23:11:54] RoanKattouw: ahhh okok [23:13:30] RoanKattouw: greg-g do you know what the policy is these days for urgent Friday deploys? [23:18:02] RoanKattouw: all g ood [23:18:06] go ahead [23:22:09] Krinkle: RoanKattouw: I think for the CN patch we'll try to get permission to deploy this CN thing a bit later or tomorrow. Also if you know someone who's about now who can advise on Varnish and setting cookies for anons in PHP that'd be hugely appreciated... thanks much and sorry for the bother [23:22:43] I would generally recommend bblack but it's 6:22pm in his tz so he might not be around right this minute [23:25:22] We've had numerous incidents involving cookies on anons [23:25:38] I think t's safe to safe no such thing will be approved in any form in the foreseeable future. [23:25:46] safe to say* [23:26:21] AndyRussG: if this is about the WMDE campaign, I commented about a patch from Addshore. [23:26:29] I don't think it works as intended currenly on dewiki [23:27:29] Krinkle: no, it's for hiding banners for donors [23:28:01] Krinkle: here's the task: https://phabricator.wikimedia.org/T251780 [23:28:59] And here's a task from Tim-away mentioning the existing system doesn't really cache as expected [23:29:11] AndyRussG: indeed. [23:29:21] AndyRussG: what's new here? HideBanners is a thing. [23:30:13] Krinkle: hide cookies can no longer be set as third-party cookies in some (many?) browsers [23:30:41] So for donors, after they finish donating, we're implementing a redirect through Wikipedia to set a cookie there to hide banners [23:31:11] Krinkle: here's the patch in question: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralNotice/+/609810 [23:31:21] As things are, a lot of donors are not getting banners hidden [23:31:44] AndyRussG: I know all too well. This is why we're also working around the clock to fix CentralAuth as otherwise that won't work either :) [23:31:57] Also, I guess I don't understand why Special:HideBanners does work, if it can't be cached, and in view of what you said above ^ [23:32:00] oh interesting [23:32:13] Anyuthing setting cookies cannot be cached, never has been. [23:32:32] any code pretending to otherwise gets overridden at three different layers [23:32:32] Krinkle: ok interesting... but it will still set the cookie? [23:32:37] MW shutdown function, ATS, Varnish. [23:33:18] So basically Special:HideBanners is working, but just not caching, everyone's always just hitting the application layer? [23:34:23] Krinkle: the new patch doesn't do anything new... it would still set the cookie via php's setcookie (which is what was used before, though I see there's a method on WebRequest) [23:34:46] And the new patch no longer disables output, like the old one did [23:34:56] If you have a minute to look at it and sign off on it, that'd be absolutely fantastic [23:35:04] and apologies again for the bother [23:35:57] all I can say is please make no major changes to how CN works/behaves operationally in the coming weeks. We have a lot of incident follow up and other session/security stuff to deal with. [23:36:13] If setting SameSite=None works for you, that seems fine. [23:36:32] keep enabling or disabling OutputPage as currently if possible. [23:36:56] use WebResponse to set cookies [23:37:01] do not use setcookie() directly [23:37:32] Krinkle: well it currently disables output, and uses setcookie directly [23:37:54] ok, let's stick with that then [23:38:19] Krinkle: so then we'll also manually do the redirect and not enable output? [23:38:22] AndyRussG: where does the redirect come in? [23:38:28] does it already use a redirect today? [23:38:38] Krinkle: no, that's what's new [23:38:42] why? [23:38:59] Because this is for people not currently on Wikipedia [23:39:21] Current system is, they donate via payments wiki, and if the donation is successful, they go to donate wiki thank-you page [23:39:43] and donate wiki in the background calls Special:HideBanner for a ton of sites [23:39:56] So, for donors, the only site where we really need to hide banners is Wikipedia [23:40:04] since that's the only place we show fundraising banners [23:40:19] ok, but setting it for a ton of sites like today seems fine [23:40:23] the TY page on donatewiki has been making img tags with a src of Special:HideBanner [23:40:24] which part is not working [23:40:29] after donating they get all fundraising banners hidden for a year [23:40:41] but that's now not allowing third-party cookie [23:40:42] Krinkle: doesn't work 'cause it's third party [23:40:51] cookies don't live longer than two weeks in practice btw [23:40:56] whaaa [23:41:13] this is a browser thing, not a wmf thing [23:41:17] do people clear cookies that often? [23:41:19] ask me somme other time [23:41:21] no [23:41:23] ok [23:41:44] when you say third party, I assume you don't mean non-wmf [23:41:53] Krinkle: correct [23:41:55] OK, so let make take a guess. [23:41:59] it's just donate wiki vs Wikipedia [23:42:22] the browser is rejecting the cookie because it's not setting Set-cookie SameSite=None, which is a new requirement in browser this year slowly being rolled out. [23:42:30] It landed in Safari and is hitting CHrome stable next week [23:42:30] so the solution is, instead of sending them directly from payments to the thank-you page, we send'll send them through WP, where the cookie will get set, and then redirect to donate wiki thankyou page [23:42:53] I don't think that's neccecary [23:43:17] Krinkle: soooooo the solution is just to set that header on donate wiki? [23:43:21] Krinkle no, it's because it was img tags in donatewiki, loading enwiki/Special:Hidebanner, wikibooks/Special:HideBanner, etc [23:43:37] this is conceptually the same thing we do for logging in. [23:43:52] Do you have details about why the browser is refusing t hem? [23:44:01] A repro case? a console warning? [23:46:49] Krinkle it's this: https://webkit.org/blog/10218/full-third-party-cookie-blocking-and-more/ [23:47:57] Krinkle: I have not watched them be refused in browser tools, no [23:49:32] The first browser to do it was Safari, and most all the devs on our team have linux machines [23:49:43] Krinkle: there was some investigation, the details of which I didn't follow... https://phabricator.wikimedia.org/T244699 [23:49:58] but we have plenty of feeback from donor services that the amount of complaints of donors still seeing banners is rising [23:49:58] OK, so this is not about the upcoming SameSite restrictions [23:50:05] https://www.chromium.org/updates/same-site [23:50:07] nope, but we should get out ahead of that soon! [23:50:36] The thing with SameSite is that it applies to which (existing) cookies are applied to any given request. [23:50:47] We do know there are tickets coming in from Donor services from donors who are annoyed because the keep seeing banners [23:52:19] so the SameSite restriction will for example make it so that if I embed an