[00:01:38] 10Operations, 10ops-codfw, 10decommission-hardware, 10serviceops: decommission mc2028.codfw.wmnet - https://phabricator.wikimedia.org/T261168 (10Papaul) [00:02:26] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is CRITICAL: 58.25 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [00:03:04] all of above expected [00:03:10] eqiad traffic drop because of the geo-remapping involved [00:04:39] 10Operations, 10Documentation: Improve documentation for mirrors.wikimedia.org - https://phabricator.wikimedia.org/T179856 (10Dzahn) @Quiddity I made some more edits to the page to give it more structure and add the things you listed. Why, who, where, added the links.. etc. Good enough to resolve? [00:08:04] !log T259621 cdanis@re0.cr3-esams> request vmhost software add /var/tmp/junos-vmhost-install-mx-x86-64-17.3R3-S8.1.tgz re1 [00:08:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:11:08] (03CR) 10BryanDavis: "PCC output: https://puppet-compiler.wmflabs.org/compiler1003/24635/" [puppet] - 10https://gerrit.wikimedia.org/r/622237 (https://phabricator.wikimedia.org/T251628) (owner: 10BryanDavis) [00:14:05] !log T259621 cdanis@re0.cr3-esams> request vmhost reboot re1 [00:14:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:09] !log T259621 cdanis@re0.cr3-esams> request chassis routing-engine master switch [00:18:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:20:10] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [00:22:14] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 84, down: 4, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:22:41] PROBLEM - OSPF status on cr3-knams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:23:57] also expected [00:24:05] !log T259621 cdanis@re1.cr3-esams> request vmhost software add /var/tmp/junos-vmhost-install-mx-x86-64-17.3R3-S8.1.tgz re0 [00:24:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:24:11] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:24:40] RECOVERY - OSPF status on cr3-knams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:27:52] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is OK: (C)60 le (W)70 le 71.68 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [00:30:50] !log T259621 cdanis@re1.cr3-esams> request vmhost reboot re0 [00:30:52] 10Operations, 10Documentation: Improve documentation for mirrors.wikimedia.org - https://phabricator.wikimedia.org/T179856 (10Quiddity) 05Openβ†’03Resolved Looks great! Thank you :) [00:30:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:36:45] !log T259621 cdanis@re1.cr3-esams> request chassis routing-engine master switch [00:36:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:41:16] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:43:14] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:56:58] !log T259621 ❌cdanis@cumin1001.eqiad.wmnet ~ πŸ•˜πŸΊ homer 'cr*' commit 'drain cr2-esams transport link' [00:57:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:07:15] !log cdanis@re0.cr2-esams> request system software add validate re1 /var/tmp/junos-vmhost-install-mx-x86-64-17.3R3-S8.1.tgz [01:07:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:11:06] !log T259621 wrong junos version was staged on cr2-esams, abandoning this attempt and putting back in service [01:11:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:16:10] (03PS1) 10CDanis: Revert "depool esams for router upgrades" [dns] - 10https://gerrit.wikimedia.org/r/622208 [01:16:22] (03CR) 10CDanis: [C: 03+2] Revert "depool esams for router upgrades" [dns] - 10https://gerrit.wikimedia.org/r/622208 (owner: 10CDanis) [01:17:12] !log repool esams [01:17:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:27:16] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 51.43 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [01:35:40] expected [01:35:51] thinks look copacetic, signing off for now [01:45:33] (03CR) 10Dave Pifke: [C: 03+1] webperf: add data types to profiles [puppet] - 10https://gerrit.wikimedia.org/r/621756 (owner: 10Dzahn) [01:58:38] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [02:07:12] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.36.0-wmf.6 [core] (wmf/1.36.0-wmf.6) - 10https://gerrit.wikimedia.org/r/622250 [02:13:46] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=205 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [02:15:42] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [03:47:55] (03PS1) 10KartikMistry: Enable ContentTranslation as a default tool in Assamese and Burmese WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622257 (https://phabricator.wikimedia.org/T258503) [04:37:38] 10Operations, 10serviceops, 10Platform Team Workboards (Clinic Duty Team): PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10tstarling) Has anyone got an idea for giving the HMAC key to the server without allowing the command to have access to it? Otherwise an... [05:03:00] (03PS1) 10Marostegui: db1092,db1084: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/622259 [05:04:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1084,db1092 after MCR changes', diff saved to https://phabricator.wikimedia.org/P12332 and previous config saved to /var/cache/conftool/dbconfig/20200825-050451-marostegui.json [05:04:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:05:06] (03CR) 10Marostegui: [C: 03+2] db1092,db1084: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/622259 (owner: 10Marostegui) [05:10:02] !log Deploy MCR schema change on s1 codfw, this will create lag on s1 codfw - T238966 [05:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:10:07] T238966: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 [05:11:06] !log Remove revisions triggers from db2094:3311 T238966 [05:11:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:13:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1084,db1092 after MCR changes', diff saved to https://phabricator.wikimedia.org/P12333 and previous config saved to /var/cache/conftool/dbconfig/20200825-051327-marostegui.json [05:13:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:21:12] 10Operations, 10observability, 10Patch-For-Review, 10good first task: nagios-nrpe-server.service: systemd unit references path below legacy directory /var/run/ - https://phabricator.wikimedia.org/T252990 (10MoritzMuehlenhoff) Looking at git history, this service unit is shipped via Puppet since nagios-nrpe... [05:21:27] !log installing Java security updates on relforge* [05:21:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:26:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1084,db1092 after MCR changes', diff saved to https://phabricator.wikimedia.org/P12334 and previous config saved to /var/cache/conftool/dbconfig/20200825-052602-marostegui.json [05:26:03] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 51 probes of 556 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:26:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:23] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 49 probes of 556 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:38:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1084,db1092 after MCR changes', diff saved to https://phabricator.wikimedia.org/P12335 and previous config saved to /var/cache/conftool/dbconfig/20200825-053801-marostegui.json [05:38:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:38:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1111, db1118 for MCR change', diff saved to https://phabricator.wikimedia.org/P12336 and previous config saved to /var/cache/conftool/dbconfig/20200825-053856-marostegui.json [05:38:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:41:31] (03PS1) 10Marostegui: db1128: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/622260 (https://phabricator.wikimedia.org/T260324) [05:42:43] (03CR) 10Marostegui: [C: 03+2] db1128: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/622260 (https://phabricator.wikimedia.org/T260324) (owner: 10Marostegui) [05:51:52] (03PS2) 10DannyS712: Branch commit for wmf/1.36.0-wmf.6 [core] (wmf/1.36.0-wmf.6) - 10https://gerrit.wikimedia.org/r/622250 (https://phabricator.wikimedia.org/T257974) (owner: 10TrainBranchBot) [05:53:23] (03CR) 10Muehlenhoff: [C: 03+2] toolforge: Remove jessie conditionals [puppet] - 10https://gerrit.wikimedia.org/r/617995 (owner: 10Muehlenhoff) [05:54:11] (03PS6) 10Muehlenhoff: Disable backports on stretch for production [puppet] - 10https://gerrit.wikimedia.org/r/613611 (https://phabricator.wikimedia.org/T256877) [05:54:34] (03PS1) 10Ayounsi: LibreNMS report, whitelist c2l54ce-ycmfam90 PDUs [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/622261 [05:55:31] (03CR) 10Ayounsi: [C: 03+2] LibreNMS report, whitelist c2l54ce-ycmfam90 PDUs [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/622261 (owner: 10Ayounsi) [06:08:33] (03CR) 10Muehlenhoff: ldap: remove jessie support (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/621372 (owner: 10Dzahn) [06:10:47] (03CR) 10Muehlenhoff: [C: 04-1] "The scb hosts still use that class on jessie, so this would lead to failing Puppet runs there." [puppet] - 10https://gerrit.wikimedia.org/r/621374 (owner: 10Dzahn) [06:20:18] (03PS3) 10ZPapierski: Multiple instances of msearch_daemon [puppet] - 10https://gerrit.wikimedia.org/r/621988 (https://phabricator.wikimedia.org/T260305) [06:20:39] !log ayounsi@cumin1001 START - Cookbook sre.network.prepare-upgrade [06:20:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:43] 10Operations, 10User-MoritzMuehlenhoff, 10User-jbond: Updated java security policy in OpenJDK 8 u252 - https://phabricator.wikimedia.org/T251493 (10MoritzMuehlenhoff) I think this can be closed, given that 593467 is merged? [06:26:39] (03PS1) 10Marostegui: wmnet: Decrease m5-master TTL to 1M [dns] - 10https://gerrit.wikimedia.org/r/622266 (https://phabricator.wikimedia.org/T260324) [06:37:35] (03CR) 10Elukey: [C: 03+1] Scap: git_fat -> git_binary_manager [software/cassandra-twcs] - 10https://gerrit.wikimedia.org/r/404228 (https://phabricator.wikimedia.org/T184882) (owner: 10Thcipriani) [06:37:59] (03CR) 10Elukey: [C: 03+1] Scap: git_fat -> git_binary_manager [software/cassandra-metrics-collector] - 10https://gerrit.wikimedia.org/r/404226 (https://phabricator.wikimedia.org/T184882) (owner: 10Thcipriani) [06:39:56] (03PS1) 10Ayounsi: Change eqord ASN to 65020 [homer/public] - 10https://gerrit.wikimedia.org/r/622268 (https://phabricator.wikimedia.org/T259593) [06:39:58] (03PS5) 10Giuseppe Lavagetto: Test deployments with helmfile lint [deployment-charts] - 10https://gerrit.wikimedia.org/r/620934 (https://phabricator.wikimedia.org/T258572) [06:41:57] (03CR) 10jerkins-bot: [V: 04-1] Test deployments with helmfile lint [deployment-charts] - 10https://gerrit.wikimedia.org/r/620934 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [06:43:18] (03PS1) 10Ayounsi: Puppet: change eqord ASN to 65020 [puppet] - 10https://gerrit.wikimedia.org/r/622269 (https://phabricator.wikimedia.org/T259593) [06:43:55] (03CR) 10Giuseppe Lavagetto: Test deployments with helmfile lint (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/620934 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [06:45:22] (03CR) 10Ayounsi: [C: 03+2] Puppet: change eqord ASN to 65020 [puppet] - 10https://gerrit.wikimedia.org/r/622269 (https://phabricator.wikimedia.org/T259593) (owner: 10Ayounsi) [06:45:44] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:57:57] 10Operations, 10observability, 10Patch-For-Review, 10good first task: nagios-nrpe-server.service: systemd unit references path below legacy directory /var/run/ - https://phabricator.wikimedia.org/T252990 (10ema) >>! In T252990#6407587, @Southparkfan wrote: > I have uploaded a new patch using /run on all se... [07:03:28] PROBLEM - Check the last execution of php7.2-fpm_check_restart on mwdebug1001 is CRITICAL: CRITICAL: Status of the systemd unit php7.2-fpm_check_restart https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:03:33] (03PS4) 10ZPapierski: Multiple instances of msearch_daemon [puppet] - 10https://gerrit.wikimedia.org/r/621988 (https://phabricator.wikimedia.org/T260305) [07:04:31] !log restartint blazegraph on wdqs1005 (T242453) [07:04:34] (03CR) 10jerkins-bot: [V: 04-1] Multiple instances of msearch_daemon [puppet] - 10https://gerrit.wikimedia.org/r/621988 (https://phabricator.wikimedia.org/T260305) (owner: 10ZPapierski) [07:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:36] T242453: Deadlock in blazegraph blocking all queries and updates - https://phabricator.wikimedia.org/T242453 [07:05:14] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:05:28] PROBLEM - Query Service HTTP Port on wdqs1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [07:07:26] RECOVERY - Query Service HTTP Port on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.025 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [07:07:46] PROBLEM - WDQS high update lag on wdqs1005 is CRITICAL: 6.267e+04 ge 4.32e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [07:09:09] !log depooling wdqs1005 (high lag) [07:09:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:27] !log Upgrade MySQL on dbstore1004 [07:13:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:21] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.prepare-upgrade (exit_code=1) [07:22:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:56] (03PS5) 10ZPapierski: Multiple instances of msearch_daemon [puppet] - 10https://gerrit.wikimedia.org/r/621988 (https://phabricator.wikimedia.org/T260305) [07:28:27] (03PS6) 10ZPapierski: Multiple instances of msearch_daemon [puppet] - 10https://gerrit.wikimedia.org/r/621988 (https://phabricator.wikimedia.org/T260305) [07:29:26] (03CR) 10jerkins-bot: [V: 04-1] Multiple instances of msearch_daemon [puppet] - 10https://gerrit.wikimedia.org/r/621988 (https://phabricator.wikimedia.org/T260305) (owner: 10ZPapierski) [07:31:07] (03PS7) 10ZPapierski: Multiple instances of msearch_daemon [puppet] - 10https://gerrit.wikimedia.org/r/621988 (https://phabricator.wikimedia.org/T260305) [07:33:58] ACKNOWLEDGEMENT - WDQS high update lag on wdqs1005 is CRITICAL: 6.079e+04 ge 4.32e+04 Gehel Blazegraph restarted, catching up on lag https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [07:34:32] (03PS8) 10ZPapierski: Multiple instances of msearch_daemon [puppet] - 10https://gerrit.wikimedia.org/r/621988 (https://phabricator.wikimedia.org/T260305) [07:40:10] (03PS9) 10ZPapierski: Multiple instances of msearch_daemon [puppet] - 10https://gerrit.wikimedia.org/r/621988 (https://phabricator.wikimedia.org/T260305) [07:42:40] (03CR) 10Volans: "Code looks ok from my point of view, few nit/question inline." (035 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/619781 (owner: 10Ryan Kemper) [07:43:16] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/621605 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [07:45:12] 10Operations, 10observability: Grafana link redirecting to port :3000 - https://phabricator.wikimedia.org/T261184 (10elukey) [07:46:06] (03CR) 10Kormat: [C: 03+1] wmnet: Decrease m5-master TTL to 1M [dns] - 10https://gerrit.wikimedia.org/r/622266 (https://phabricator.wikimedia.org/T260324) (owner: 10Marostegui) [07:54:13] (03PS10) 10ZPapierski: Multiple instances of msearch_daemon [puppet] - 10https://gerrit.wikimedia.org/r/621988 (https://phabricator.wikimedia.org/T260305) [07:56:02] (03CR) 10JMeybohm: [C: 04-1] package_builder: add support for 'sloppy' backports (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/622190 (owner: 10CDanis) [08:01:42] 10Operations, 10serviceops, 10Platform Team Workboards (Clinic Duty Team): PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10daniel) >>! In T260330#6408193, @tstarling wrote: > Has anyone got an idea for giving the HMAC key to the server without allowing the co... [08:05:09] (03CR) 10Filippo Giunchedi: [C: 03+2] grafana: remove deprecated settings [puppet] - 10https://gerrit.wikimedia.org/r/621472 (https://phabricator.wikimedia.org/T259143) (owner: 10Filippo Giunchedi) [08:06:53] (03CR) 10Hashar: [C: 03+1] zuul: add data types, replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/621758 (owner: 10Dzahn) [08:07:47] (03CR) 10Volans: [C: 04-1] "Code looks good to me, one typo in a parameter." (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/621701 (https://phabricator.wikimedia.org/T260110) (owner: 10Jbond) [08:08:55] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: add #o11y tag to logstash alert descriptions [puppet] - 10https://gerrit.wikimedia.org/r/622161 (owner: 10Herron) [08:09:27] (03CR) 10Filippo Giunchedi: [C: 03+1] pontoon: Use a symlink for /etc/puppet/hieradata/pontoon [puppet] - 10https://gerrit.wikimedia.org/r/621688 (owner: 10Kormat) [08:10:14] (03CR) 10Kormat: [C: 03+2] pontoon: Use a symlink for /etc/puppet/hieradata/pontoon [puppet] - 10https://gerrit.wikimedia.org/r/621688 (owner: 10Kormat) [08:11:53] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, for more information on the wikidata metric I believe Addshore might know more" [puppet] - 10https://gerrit.wikimedia.org/r/615269 (https://phabricator.wikimedia.org/T180105) (owner: 10Cwhite) [08:13:39] (03CR) 10Filippo Giunchedi: prometheus: add apache2 es-exporter config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/621597 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [08:14:12] 10Operations, 10observability: Grafana link redirecting to port :3000 - https://phabricator.wikimedia.org/T261184 (10jijiki) p:05Triageβ†’03Medium [08:18:11] !log deactivate eqord peering/transit - T259593 [08:18:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:15] T259593: Make eqord its own AS - https://phabricator.wikimedia.org/T259593 [08:19:21] !log reconfigure eqord to be AS65020 - T259593 [08:19:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:41] (03CR) 10Elukey: [V: 03+2 C: 03+2] Scap: git_fat -> git_binary_manager [software/cassandra-twcs] - 10https://gerrit.wikimedia.org/r/404228 (https://phabricator.wikimedia.org/T184882) (owner: 10Thcipriani) [08:21:55] (03CR) 10Elukey: [V: 03+2 C: 03+2] Scap: git_fat -> git_binary_manager [software/cassandra-metrics-collector] - 10https://gerrit.wikimedia.org/r/404226 (https://phabricator.wikimedia.org/T184882) (owner: 10Thcipriani) [08:22:18] (03CR) 10Elukey: [V: 03+2 C: 03+2] Scap: git_fat -> git_binary_manager [software/logstash-logback-encoder] - 10https://gerrit.wikimedia.org/r/404227 (https://phabricator.wikimedia.org/T184882) (owner: 10Thcipriani) [08:23:24] (03CR) 10Legoktm: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/620934 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [08:24:01] (03PS1) 10Kormat: Remove unused sql.py and check_private_data.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/622310 (https://phabricator.wikimedia.org/T259516) [08:25:17] (03PS2) 10Kormat: Remove unused sql.py and check_private_data.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/622310 (https://phabricator.wikimedia.org/T259516) [08:28:01] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: collect prometheus metrics from alertmanager in metricsinfra [puppet] - 10https://gerrit.wikimedia.org/r/620760 (owner: 10BryanDavis) [08:31:20] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "no actual idea what this is doing, but I trust you :-)" [puppet] - 10https://gerrit.wikimedia.org/r/622238 (https://phabricator.wikimedia.org/T158216) (owner: 10BryanDavis) [08:31:37] 10Operations, 10GrowthExperiments-NewcomerTasks, 10Product-Infrastructure-Team-Backlog, 10serviceops: Service operations setup for Add a Link project - https://phabricator.wikimedia.org/T258978 (10Joe) I have a few questions for you, before giving a refined recommendation: - do you think you'll need to de... [08:36:22] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "can you please split this change into smaller ones? That would make me more confident to merge and babysit, given I could easily identify " [puppet] - 10https://gerrit.wikimedia.org/r/622237 (https://phabricator.wikimedia.org/T251628) (owner: 10BryanDavis) [08:37:09] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "can't merge this until https://gerrit.wikimedia.org/r/c/operations/puppet/+/622237 is merged" [puppet] - 10https://gerrit.wikimedia.org/r/622238 (https://phabricator.wikimedia.org/T158216) (owner: 10BryanDavis) [08:39:52] (03CR) 10Marostegui: "How do you foresee the future development of check_private_data? So we just make changes to it on the puppet repo and ship it as it is now" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/622310 (https://phabricator.wikimedia.org/T259516) (owner: 10Kormat) [08:41:52] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1010-production-logstash-eqiad on logstash1010 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1010&panelId=37 [08:45:54] (03CR) 10Kormat: "> Patch Set 2:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/622310 (https://phabricator.wikimedia.org/T259516) (owner: 10Kormat) [08:48:02] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:49:41] (03CR) 10Filippo Giunchedi: "Yes a full PCC run in this case would be good to validate the change. A valid strategy would be to push the change for one/two exporter fo" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/621759 (owner: 10Dzahn) [08:50:09] !log re-activate eqord peering/transit - T259593 [08:50:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:13] T259593: Make eqord its own AS - https://phabricator.wikimedia.org/T259593 [08:52:36] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 51 probes of 554 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:53:56] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:55:50] 10Operations, 10Traffic, 10conftool, 10serviceops: confd's watch functionality appears to be partially broken when interacting with etcd 3.x - https://phabricator.wikimedia.org/T260889 (10Joe) For the record, the problem is more general, and also affects servers connecting to etcd 2.x - the watch functiona... [08:57:40] 10Operations, 10netops, 10Patch-For-Review: Make eqord its own AS - https://phabricator.wikimedia.org/T259593 (10ayounsi) [08:59:04] (03PS1) 10Giuseppe Lavagetto: confd: use -interval 3 as a lower bound in all datacenters [puppet] - 10https://gerrit.wikimedia.org/r/622314 (https://phabricator.wikimedia.org/T260889) [09:02:16] 10Operations, 10netops, 10Patch-For-Review: Make eqord its own AS - https://phabricator.wikimedia.org/T259593 (10ayounsi) 05Openβ†’03Resolved All done and checked that: 1/ internal prefixes are properly exchange in all direction (eg. ulsfo sees eqiad via eqord) even if not always the active path 2/ externa... [09:03:09] (03CR) 10Ema: [C: 03+2] cache: remove 'backend_services' hiera setting [puppet] - 10https://gerrit.wikimedia.org/r/622131 (https://phabricator.wikimedia.org/T222937) (owner: 10Ema) [09:05:40] (03CR) 10Giuseppe Lavagetto: [C: 03+2] confd: use -interval 3 as a lower bound in all datacenters [puppet] - 10https://gerrit.wikimedia.org/r/622314 (https://phabricator.wikimedia.org/T260889) (owner: 10Giuseppe Lavagetto) [09:06:38] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 52 probes of 556 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:07:33] (03CR) 10Ayounsi: [C: 03+2] Change eqord ASN to 65020 [homer/public] - 10https://gerrit.wikimedia.org/r/622268 (https://phabricator.wikimedia.org/T259593) (owner: 10Ayounsi) [09:07:56] (03Merged) 10jenkins-bot: Change eqord ASN to 65020 [homer/public] - 10https://gerrit.wikimedia.org/r/622268 (https://phabricator.wikimedia.org/T259593) (owner: 10Ayounsi) [09:18:23] (03CR) 10Gehel: "A few more comments." (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/619781 (owner: 10Ryan Kemper) [09:18:34] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 50 probes of 556 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:21:20] (03CR) 10Gehel: [C: 04-1] elasticsearch: verify all write queues are empty (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/619781 (owner: 10Ryan Kemper) [09:22:18] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 50 probes of 554 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:26:12] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Test deployments with helmfile lint [deployment-charts] - 10https://gerrit.wikimedia.org/r/620934 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [09:26:46] (03PS1) 10Filippo Giunchedi: icinga_exporter: export problems only from Icinga active_host [puppet] - 10https://gerrit.wikimedia.org/r/622316 (https://phabricator.wikimedia.org/T258948) [09:28:10] 10Operations, 10Wikimedia-Mailing-lists: Disable google code in mailinglists - https://phabricator.wikimedia.org/T261084 (10Urbanecm) >>! In T261084#6407773, @Dzahn wrote: > So to know which one you want you basically just have to answer the question if you want to re-enable it later and still have the same li... [09:28:37] (03Merged) 10jenkins-bot: Test deployments with helmfile lint [deployment-charts] - 10https://gerrit.wikimedia.org/r/620934 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [09:29:44] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Remove the original X-Forwarded-Proto header if injecting https [deployment-charts] - 10https://gerrit.wikimedia.org/r/622118 (owner: 10Giuseppe Lavagetto) [09:30:02] (03CR) 10Filippo Giunchedi: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1002/24644/" [puppet] - 10https://gerrit.wikimedia.org/r/622316 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [09:30:33] _joe_: I'll merge your change too [09:30:42] (03PS1) 10Kormat: Add mypy to tox, and check in CI. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/622317 [09:30:45] <_joe_> ouch, yes, please [09:30:56] _joe_: 7f7e0554e3 that is [09:31:10] <_joe_> yes [09:31:25] <_joe_> I had a puppet-merge that I aborted by mistyping yes [09:31:32] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=205 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [09:31:41] <_joe_> 205? [09:32:00] (03Merged) 10jenkins-bot: Remove the original X-Forwarded-Proto header if injecting https [deployment-charts] - 10https://gerrit.wikimedia.org/r/622118 (owner: 10Giuseppe Lavagetto) [09:33:28] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [09:38:03] 10Operations: Make bpfcc-tools available fleet-wide - https://phabricator.wikimedia.org/T261193 (10MoritzMuehlenhoff) [09:40:14] (03PS1) 10Ayounsi: Apply netflow group to existing fpc X statements [homer/public] - 10https://gerrit.wikimedia.org/r/622318 (https://phabricator.wikimedia.org/T257392) [09:40:56] (03PS2) 10Ayounsi: Apply sampling group to existing fpc X statements [homer/public] - 10https://gerrit.wikimedia.org/r/622318 (https://phabricator.wikimedia.org/T257392) [09:42:22] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 51 probes of 556 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:45:51] !log Create missing table cx_notification_log on x1 wikishared T261190 [09:45:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:55] T261190: Create notification-log table in Production (wikishared) - https://phabricator.wikimedia.org/T261190 [09:51:08] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=205 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [09:53:04] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [09:55:05] 10Operations, 10netops, 10Patch-For-Review: automatically sample from all FPCs on core routers - https://phabricator.wikimedia.org/T257392 (10ayounsi) While working on that I noticed that the `apply group` only applies to existing `fpc X` statements, for example if they are configured with `pic` sub-section... [09:55:08] (03PS11) 10ZPapierski: Multiple instances of msearch_daemon [puppet] - 10https://gerrit.wikimedia.org/r/621988 (https://phabricator.wikimedia.org/T260305) [09:56:46] (03CR) 10Ayounsi: [C: 04-1] "See comment on the task." [homer/public] - 10https://gerrit.wikimedia.org/r/622318 (https://phabricator.wikimedia.org/T257392) (owner: 10Ayounsi) [09:56:56] (03PS1) 10Giuseppe Lavagetto: termbox: bump chart to pick up changes in the envoy configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/622319 [09:58:23] 10Operations, 10DBA, 10serviceops, 10Goal, 10Patch-For-Review: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10jcrespo) [09:58:26] 10Operations, 10Documentation: Wikitech: update Bacula article - https://phabricator.wikimedia.org/T100954 (10jcrespo) 05Openβ†’03Resolved Done months ago. [09:58:41] (03PS1) 10Filippo Giunchedi: prometheus: update summary for IcingaServiceProblem alert [puppet] - 10https://gerrit.wikimedia.org/r/622320 (https://phabricator.wikimedia.org/T258948) [10:00:26] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 50 probes of 556 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:00:33] 10Operations: Setup an Offsite backup infrastructure - https://phabricator.wikimedia.org/T85278 (10jcrespo) 05Stalledβ†’03Open This is being reimplemented doing parallel backup jobs into Codfw. This has started by now only with database backups and DatabasesCodfw pool on backup2001, other pools to follow at a... [10:00:44] (03PS4) 10Hnowlan: api-gateway: strip cookie headers from requests and responses. [deployment-charts] - 10https://gerrit.wikimedia.org/r/620311 (https://phabricator.wikimedia.org/T259296) [10:01:44] (03PS1) 10Vgutierrez: vcl: Use synthetic warning for DHE-RSA-AES128-SHA pageviews [puppet] - 10https://gerrit.wikimedia.org/r/622321 (https://phabricator.wikimedia.org/T258405) [10:01:45] (03CR) 10Marostegui: [C: 03+1] Remove unused sql.py and check_private_data.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/622310 (https://phabricator.wikimedia.org/T259516) (owner: 10Kormat) [10:02:20] (03CR) 10Giuseppe Lavagetto: [C: 03+2] termbox: bump chart to pick up changes in the envoy configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/622319 (owner: 10Giuseppe Lavagetto) [10:02:24] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: update summary for IcingaServiceProblem alert [puppet] - 10https://gerrit.wikimedia.org/r/622320 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [10:02:28] (03CR) 10Kormat: [C: 03+2] Remove unused sql.py and check_private_data.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/622310 (https://phabricator.wikimedia.org/T259516) (owner: 10Kormat) [10:03:12] (03Merged) 10jenkins-bot: Remove unused sql.py and check_private_data.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/622310 (https://phabricator.wikimedia.org/T259516) (owner: 10Kormat) [10:04:16] (03Merged) 10jenkins-bot: termbox: bump chart to pick up changes in the envoy configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/622319 (owner: 10Giuseppe Lavagetto) [10:04:37] (03PS1) 10Jcrespo: Revert "mariadb-backups: Ignore backup freshness check for dbprov1* hosts" [puppet] - 10https://gerrit.wikimedia.org/r/622209 (https://phabricator.wikimedia.org/T260764) [10:04:52] (03CR) 10jerkins-bot: [V: 04-1] Revert "mariadb-backups: Ignore backup freshness check for dbprov1* hosts" [puppet] - 10https://gerrit.wikimedia.org/r/622209 (https://phabricator.wikimedia.org/T260764) (owner: 10Jcrespo) [10:06:13] (03PS1) 10Filippo Giunchedi: karma: match Icinga background colors for 'severity' and hide 'info' label [puppet] - 10https://gerrit.wikimedia.org/r/622322 (https://phabricator.wikimedia.org/T258948) [10:06:19] (03CR) 10Hnowlan: [C: 03+2] api-gateway: strip cookie headers from requests and responses. [deployment-charts] - 10https://gerrit.wikimedia.org/r/620311 (https://phabricator.wikimedia.org/T259296) (owner: 10Hnowlan) [10:06:21] (03PS2) 10Jcrespo: mariadb-backups: Setup dbprov1003 [puppet] - 10https://gerrit.wikimedia.org/r/621987 (https://phabricator.wikimedia.org/T257551) [10:06:23] (03PS2) 10Jcrespo: Revert "mariadb-backups: Ignore backup freshness check for dbprov1* hosts" [puppet] - 10https://gerrit.wikimedia.org/r/622209 (https://phabricator.wikimedia.org/T260764) [10:06:34] (03PS3) 10Jcrespo: Revert "mariadb-backups: Ignore backup freshness check for dbprov1* hosts" [puppet] - 10https://gerrit.wikimedia.org/r/622209 (https://phabricator.wikimedia.org/T260764) [10:08:03] (03CR) 10Jcrespo: [C: 03+2] Revert "mariadb-backups: Ignore backup freshness check for dbprov1* hosts" [puppet] - 10https://gerrit.wikimedia.org/r/622209 (https://phabricator.wikimedia.org/T260764) (owner: 10Jcrespo) [10:08:19] (03CR) 10Marostegui: [C: 03+1] Add mypy to tox, and check in CI. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/622317 (owner: 10Kormat) [10:08:25] (03Merged) 10jenkins-bot: api-gateway: strip cookie headers from requests and responses. [deployment-charts] - 10https://gerrit.wikimedia.org/r/620311 (https://phabricator.wikimedia.org/T259296) (owner: 10Hnowlan) [10:08:48] (03CR) 10Filippo Giunchedi: [C: 03+2] karma: match Icinga background colors for 'severity' and hide 'info' label [puppet] - 10https://gerrit.wikimedia.org/r/622322 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [10:09:28] 10Operations: Updated java security policy in OpenJDK 8 u265 - https://phabricator.wikimedia.org/T261196 (10MoritzMuehlenhoff) [10:10:08] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 51 probes of 556 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:10:13] 10Operations, 10User-MoritzMuehlenhoff, 10User-jbond: Updated java security policy in OpenJDK 8 u252 - https://phabricator.wikimedia.org/T251493 (10MoritzMuehlenhoff) And right in time there's new changes in u265 :-) Opened T261196 to track these. [10:16:52] (03CR) 10Kormat: [C: 03+2] Add mypy to tox, and check in CI. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/622317 (owner: 10Kormat) [10:17:29] (03CR) 10Elukey: Multiple instances of msearch_daemon (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/621988 (https://phabricator.wikimedia.org/T260305) (owner: 10ZPapierski) [10:18:14] (03Merged) 10jenkins-bot: Add mypy to tox, and check in CI. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/622317 (owner: 10Kormat) [10:23:18] !log removed fermium.wikimedia.org from debmonitor [10:23:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:38] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [10:23:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:39] !log oblivian@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'test' . [10:28:39] !log oblivian@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' . [10:28:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:28] !log hnowlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [10:32:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:01] (03PS1) 10Muehlenhoff: Set U2F token expiry to 3650 on the production IDPs [puppet] - 10https://gerrit.wikimedia.org/r/622324 (https://phabricator.wikimedia.org/T258029) [10:37:11] !log import all binary packages from tesseract-ocr-lang into stretch-wikimedia/component/tesseract-410-bpo (T247422) [10:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:15] T247422: Update Tesseract on Toolforge to v4.1.0 - https://phabricator.wikimedia.org/T247422 [10:45:54] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 50 probes of 556 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:46:20] 10Operations, 10Mail: Create Group Aliases for itservices@ - https://phabricator.wikimedia.org/T259727 (10Aklapper) [10:49:32] (03PS1) 10Muehlenhoff: Add component/ceph [puppet] - 10https://gerrit.wikimedia.org/r/622326 (https://phabricator.wikimedia.org/T256877) [10:51:18] is WP down for anybody else, or is it just me ? [10:52:35] NotASpy: I can see WP [10:52:43] do you get any error message? [10:53:00] nope, just getting time out errors, not connecting at all [10:56:43] NotASpy: can you please follow https://wikitech.wikimedia.org/wiki/Reporting_a_connectivity_issue? [10:57:14] oh, you probably can't access that as well... [10:57:28] https://wikitech-static.wikimedia.org/wiki/Reporting_a_connectivity_issue should work, as it's outside the cluster :) [10:58:50] it's my ISP, Urbanecm [10:59:15] okay then NotASpy - just providing resources you might need when reporting :) [10:59:47] !log installing remaining libx11 security updates [10:59:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for European mid-day backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200825T1100). [11:00:04] kart_: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:04] I can remote access my office PC and WP is working there (different ISP) [11:00:16] I see NotASpy [11:00:18] \o/ [11:00:29] kart_: Hi :). I can deploy, if needed :) [11:00:51] Urbanecm: Thanks :) Please go ahead. I'll do testing. [11:01:06] ack kart_ :) [11:01:17] (And, having bad network too :/) [11:01:35] (03CR) 10Urbanecm: [C: 03+2] Enable ContentTranslation as a default tool in Assamese and Burmese WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622257 (https://phabricator.wikimedia.org/T258503) (owner: 10KartikMistry) [11:01:42] :( kart_ [11:02:28] (03Merged) 10jenkins-bot: Enable ContentTranslation as a default tool in Assamese and Burmese WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622257 (https://phabricator.wikimedia.org/T258503) (owner: 10KartikMistry) [11:02:54] looks like my ISP's IPv6 has fallen over [11:03:29] hehe [11:04:01] kart_: ready for you to test at mwdebug1002 [11:04:14] Testing. [11:05:40] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 51 probes of 556 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:06:20] Urbanecm: looks good. Go ahead. [11:06:27] syncing kart_ [11:07:52] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: d869e308492ee72cb3d1998b15409aa44a4af9c7: Enable ContentTranslation as a default tool in Assamese and Burmese WPs (T258503; T258505) (duration: 01m 00s) [11:07:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:57] T258503: Enable Content Translation in Burmese Wikipedia as a default tool - https://phabricator.wikimedia.org/T258503 [11:07:57] T258505: Enable Content Translation in Assamese Wikipedia as a default tool - https://phabricator.wikimedia.org/T258505 [11:08:12] kart_: should be live! [11:08:14] anything else? [11:08:44] Urbanecm: thanks a lot! [11:08:51] happy to help! [11:08:52] Nothing else from me :) [11:10:21] ack :) [11:11:16] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 51 probes of 554 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:11:36] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 50 probes of 556 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:11:46] (03CR) 10Effie Mouzeli: [C: 03+2] helmfile: add values for staging environment [deployment-charts] - 10https://gerrit.wikimedia.org/r/621605 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [11:12:20] and we're back, Urbanecm [11:12:26] cool! [11:12:34] enjoy IPv6 again :) [11:12:37] can go block people now [11:13:18] lol [11:13:20] you could've used /etc/hosts and forced IPv4 too :) [11:14:08] (03Merged) 10jenkins-bot: helmfile: add values for staging environment [deployment-charts] - 10https://gerrit.wikimedia.org/r/621605 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [11:15:24] (03PS1) 10Hnowlan: api-gateway: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/622328 [11:16:46] !log EU B&C done [11:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:12] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 50 probes of 554 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:21:52] (03CR) 10Muehlenhoff: [C: 03+2] Add component/ceph [puppet] - 10https://gerrit.wikimedia.org/r/622326 (https://phabricator.wikimedia.org/T256877) (owner: 10Muehlenhoff) [11:22:43] (03CR) 10Hnowlan: [C: 03+2] api-gateway: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/622328 (owner: 10Hnowlan) [11:23:50] (03PS1) 10Effie Mouzeli: push-notifications: enable TLS for all environments [deployment-charts] - 10https://gerrit.wikimedia.org/r/622330 (https://phabricator.wikimedia.org/T256973) [11:25:23] !log Upgrade mysql on db1118 after MCR change [11:25:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:33] (03Merged) 10jenkins-bot: api-gateway: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/622328 (owner: 10Hnowlan) [11:25:55] (03CR) 10JMeybohm: [C: 03+2] push-notifications: enable TLS for all environments [deployment-charts] - 10https://gerrit.wikimedia.org/r/622330 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [11:28:00] (03Merged) 10jenkins-bot: push-notifications: enable TLS for all environments [deployment-charts] - 10https://gerrit.wikimedia.org/r/622330 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [11:29:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1118 MCR changes', diff saved to https://phabricator.wikimedia.org/P12337 and previous config saved to /var/cache/conftool/dbconfig/20200825-112859-marostegui.json [11:29:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:00] 10Operations: Make bpfcc-tools available fleet-wide - https://phabricator.wikimedia.org/T261193 (10ema) Thanks for opening this! When it comes to systemtap, user-space tracing requires the linux-headers package for the currently running kernel, plus the debug symbols for whatever software is under scrutiny (eg:... [11:31:28] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 51 probes of 556 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:32:03] !log hnowlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [11:32:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:59] (03CR) 10Volans: [C: 03+2] dns: fix corner case that should not happen [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/619982 (owner: 10Volans) [11:36:37] !log hnowlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [11:36:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:24] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 50 probes of 556 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:37:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1118 MCR changes', diff saved to https://phabricator.wikimedia.org/P12338 and previous config saved to /var/cache/conftool/dbconfig/20200825-113758-marostegui.json [11:38:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:52] RECOVERY - WDQS high update lag on wdqs1005 is OK: (C)4.32e+04 ge (W)2.16e+04 ge 2.127e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [11:41:53] (03CR) 10ZPapierski: Multiple instances of msearch_daemon (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/621988 (https://phabricator.wikimedia.org/T260305) (owner: 10ZPapierski) [11:42:27] (03PS1) 10Cparle: CAT blocklist update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622332 (https://phabricator.wikimedia.org/T260958) [11:45:11] (03PS1) 10Effie Mouzeli: push-notifications: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/622333 (https://phabricator.wikimedia.org/T256973) [11:45:46] (03CR) 10JMeybohm: [C: 03+2] push-notifications: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/622333 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [11:46:14] 10Operations, 10observability: nagios-nrpe-server in jessie not compatibile with Buster version - https://phabricator.wikimedia.org/T261198 (10MoritzMuehlenhoff) [11:47:50] (03CR) 10jerkins-bot: [V: 04-1] push-notifications: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/622333 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [11:48:48] <_joe_> uhm random failure from chartmuseum it seems [11:48:51] <_joe_> jayme: ^^ [11:49:16] we are on it [11:49:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1118 MCR changes', diff saved to https://phabricator.wikimedia.org/P12339 and previous config saved to /var/cache/conftool/dbconfig/20200825-114938-marostegui.json [11:49:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:19] (03CR) 10Matthias Mullie: [C: 03+2] CAT blocklist update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622332 (https://phabricator.wikimedia.org/T260958) (owner: 10Cparle) [11:51:04] (03Merged) 10jenkins-bot: CAT blocklist update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622332 (https://phabricator.wikimedia.org/T260958) (owner: 10Cparle) [11:52:49] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 51 probes of 554 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:56:40] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] "Temporary error in CI, rerun looks okay" [deployment-charts] - 10https://gerrit.wikimedia.org/r/622333 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [11:59:15] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'push-notifications' for release 'main' . [11:59:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1118 MCR changes', diff saved to https://phabricator.wikimedia.org/P12340 and previous config saved to /var/cache/conftool/dbconfig/20200825-120211-marostegui.json [12:02:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1135 MCR change', diff saved to https://phabricator.wikimedia.org/P12341 and previous config saved to /var/cache/conftool/dbconfig/20200825-120708-marostegui.json [12:07:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:54] 10Operations, 10observability: nagios-nrpe-server in jessie not compatibile with Buster version - https://phabricator.wikimedia.org/T261198 (10fgiunchedi) >>! In T261198#6408841, @MoritzMuehlenhoff wrote: > That rings a bell, we've seen similar issues before: https://phabricator.wikimedia.org/T157853 > >> The... [12:10:19] !log installing ruby-json security updates [12:10:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:38] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 50 probes of 554 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:13:01] (03PS2) 10Vgutierrez: vcl: Use synthetic warning for DHE-RSA-AES128-SHA pageviews [puppet] - 10https://gerrit.wikimedia.org/r/622321 (https://phabricator.wikimedia.org/T258405) [12:13:51] 10Operations: Integrate Stretch 9.13 point update - https://phabricator.wikimedia.org/T258407 (10MoritzMuehlenhoff) [12:19:34] 10Operations, 10observability: nagios-nrpe-server in jessie not compatibile with Buster version - https://phabricator.wikimedia.org/T261198 (10fgiunchedi) >>! In T261198#6408898, @fgiunchedi wrote: >>>! In T261198#6408841, @MoritzMuehlenhoff wrote: >> That rings a bell, we've seen similar issues before: https:... [12:19:37] (03PS12) 10ZPapierski: Multiple instances of msearch_daemon [puppet] - 10https://gerrit.wikimedia.org/r/621988 (https://phabricator.wikimedia.org/T260305) [12:20:38] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10serviceops: Deploy push-notifications service to Kubernetes - https://phabricator.wikimedia.org/T256973 (10jijiki) [12:25:07] (03PS13) 10ZPapierski: Multiple instances of msearch_daemon [puppet] - 10https://gerrit.wikimedia.org/r/621988 (https://phabricator.wikimedia.org/T260305) [12:26:50] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10serviceops: Deploy push-notifications service to Kubernetes - https://phabricator.wikimedia.org/T256973 (10jijiki) Push-notifications is up and running in staging. Our next step is to perform the LVS steps and expose the ap... [12:29:22] 10Operations, 10Mail: Create Group Aliases for itservices@ - https://phabricator.wikimedia.org/T259727 (10jijiki) p:05Triageβ†’03Medium [12:29:24] 10Operations, 10observability: nagios-nrpe-server in jessie not compatibile with Buster version - https://phabricator.wikimedia.org/T261198 (10jijiki) p:05Triageβ†’03Medium [12:35:31] !log imported ceph packages from stretch-backports to component/ceph T256877 [12:35:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:35] T256877: Handle sunset of stretch-backports - https://phabricator.wikimedia.org/T256877 [12:35:49] (03PS1) 10Muehlenhoff: Switch openstack::serverpackages::rocky::stretch to component/ceph [puppet] - 10https://gerrit.wikimedia.org/r/622340 (https://phabricator.wikimedia.org/T256877) [12:39:16] !log test nagios-nrpe-server with dh 2048 on scb2001 - T261198 [12:39:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:20] T261198: nagios-nrpe-server in jessie not compatibile with Buster version - https://phabricator.wikimedia.org/T261198 [12:39:39] !log alter table sites on s6, directly on the primary master T260476 [12:39:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:43] T260476: Extend sites.site_global_key on WMF production - https://phabricator.wikimedia.org/T260476 [12:42:50] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 51 probes of 556 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:45:45] 10Operations, 10observability: nagios-nrpe-server in jessie not compatibile with Buster version - https://phabricator.wikimedia.org/T261198 (10fgiunchedi) >>! In T261198#6409003, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=https://sal.toolforge.org/log/St-hJXQBv7KcG9M... [12:45:50] !log Update MySQL on db1111 after MCR change [12:45:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:40] Urbanecm: you can revoke the bot's flag in arwiki so it gets throttled [12:48:46] What do you think? [12:48:46] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 50 probes of 556 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:48:58] Amir1: which bot do you mean? The one I locked yesterday? [12:49:13] yup [12:49:38] well, I can, but that would mean its edits would flood the RC [12:49:59] 10Operations, 10Discovery-Search (Current work): wdqs1009 has puppet changes on each run - https://phabricator.wikimedia.org/T260123 (10Gehel) 05Openβ†’03Resolved [12:50:30] Urbanecm: since it's locked it shouldn't be able to edit [12:50:38] Maybe I'm missing something obvious [12:50:57] Amir1: well, since it's locked, it can't login [12:51:08] or are you saying it still tries to login, even it's failing? [12:51:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1111 MCR changes', diff saved to https://phabricator.wikimedia.org/P12343 and previous config saved to /var/cache/conftool/dbconfig/20200825-125108-marostegui.json [12:51:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:27] Urbanecm: yup [12:51:31] gotcha [12:51:36] maybe we should check if it tries [12:51:41] (03PS14) 10ZPapierski: Multiple instances of msearch_daemon [puppet] - 10https://gerrit.wikimedia.org/r/621988 (https://phabricator.wikimedia.org/T260305) [12:51:46] let me look [12:52:22] Amir1: there is an issue with the throttling through. I'm not seeing throttling in https://github.com/wikimedia/mediawiki/blob/master/includes/api/ApiLogin.php at all, but I might be blind, or looking in the wrong place [12:52:59] maybe it's centralized in auth manager? [12:53:02] maybe [12:53:23] (03PS1) 10DCausse: Use dedicated schedules for the various wikidata ttl dumps [puppet] - 10https://gerrit.wikimedia.org/r/622342 (https://phabricator.wikimedia.org/T261204) [12:54:57] Amir1: I'm not seeing any call to pingLimiter from any code that seems to be relevant with the auth process [12:55:14] :( [12:56:07] !log upgrade nagios-nrpe-server on scb2* and mwlog* - T261198 [12:56:12] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 52 probes of 554 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:56:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:15] T261198: nagios-nrpe-server in jessie not compatibile with Buster version - https://phabricator.wikimedia.org/T261198 [12:57:12] Amir1: there's https://github.com/wikimedia/mediawiki/blob/master/includes/auth/ThrottlePreAuthenticationProvider.php, but that seems to work for incorrect attempts mainly [12:57:44] does it count if the account is blocked? [12:58:16] 10Operations, 10observability: nagios-nrpe-server in jessie not compatibile with Buster version - https://phabricator.wikimedia.org/T261198 (10fgiunchedi) >>! In T261198#6409081, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=https://sal.toolforge.org/log/z9-wJXQBv7KcG9M... [12:58:29] 10Operations, 10observability, 10User-fgiunchedi: nagios-nrpe-server in jessie not compatibile with Buster version - https://phabricator.wikimedia.org/T261198 (10fgiunchedi) [12:58:42] checking Amir1 [13:02:08] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 50 probes of 554 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:06:46] 10Operations: Make bpfcc-tools available fleet-wide - https://phabricator.wikimedia.org/T261193 (10CDanis) p:05Triageβ†’03Medium Thanks for opening this! Really happy to see it (and was also talking to @wkandek just yesterday about making bpfcc generally available in the fleet). +1 to the wrapper idea. In m... [13:08:03] (03CR) 10Elukey: Multiple instances of msearch_daemon (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/621988 (https://phabricator.wikimedia.org/T260305) (owner: 10ZPapierski) [13:13:47] (03PS1) 10Giuseppe Lavagetto: Revert "termbox/staging: rollback the configuration, it clearly doesn't work." [deployment-charts] - 10https://gerrit.wikimedia.org/r/622212 [13:15:34] (03PS3) 10Jcrespo: mariadb-backups: Setup dbprov1003 [puppet] - 10https://gerrit.wikimedia.org/r/621987 (https://phabricator.wikimedia.org/T257551) [13:15:54] (03CR) 10Elukey: Multiple instances of msearch_daemon (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/621988 (https://phabricator.wikimedia.org/T260305) (owner: 10ZPapierski) [13:16:56] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Setup dbprov1003 [puppet] - 10https://gerrit.wikimedia.org/r/621987 (https://phabricator.wikimedia.org/T257551) (owner: 10Jcrespo) [13:17:13] !log installing firejail security updates on remaining mw* servers in eqiad [13:17:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1111 MCR changes', diff saved to https://phabricator.wikimedia.org/P12344 and previous config saved to /var/cache/conftool/dbconfig/20200825-132027-marostegui.json [13:20:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:38] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash instance=kafkamon1001 job=burrow partition={2,3} prometheus=ops site=eqiad topic={udp_localhost-info,udp_localhost-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=tha [13:20:38] ogging-eqiad&var-topic=All&var-consumer_group=All [13:21:22] 10Operations, 10CommRel-Specialists-Support (Jul-Sep-2020), 10User-notice: CommRel support for FY2020-2021 Q1 DC switchover - https://phabricator.wikimedia.org/T244808 (10Trizek-WMF) I plan to send the announcement to communities tomorrow. At the moment, https://wikitech.wikimedia.org/wiki/Switch_Datacente... [13:21:52] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 53 probes of 555 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:22:02] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Revert "termbox/staging: rollback the configuration, it clearly doesn't work." [deployment-charts] - 10https://gerrit.wikimedia.org/r/622212 (owner: 10Giuseppe Lavagetto) [13:24:23] (03CR) 10jerkins-bot: [V: 04-1] Revert "termbox/staging: rollback the configuration, it clearly doesn't work." [deployment-charts] - 10https://gerrit.wikimedia.org/r/622212 (owner: 10Giuseppe Lavagetto) [13:27:56] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/621779 (owner: 10Dzahn) [13:31:36] RECOVERY - Rate of JVM GC Old generation-s runs - logstash1010-production-logstash-eqiad on logstash1010 is OK: (C)100 gt (W)80 gt 76.27 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1010&panelId=37 [13:32:19] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 51 probes of 557 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:34:32] (03CR) 10Jbond: [C: 03+1] "LGTM, thx" [puppet] - 10https://gerrit.wikimedia.org/r/621758 (owner: 10Dzahn) [13:34:34] (03PS1) 10Filippo Giunchedi: alertmanager: remove inhibit rules until we need them [puppet] - 10https://gerrit.wikimedia.org/r/622348 (https://phabricator.wikimedia.org/T258948) [13:35:37] (03PS1) 10Kormat: Tidy up import ordering using isort. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/622349 [13:37:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1111 MCR changes', diff saved to https://phabricator.wikimedia.org/P12345 and previous config saved to /var/cache/conftool/dbconfig/20200825-133734-marostegui.json [13:37:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:42] (03PS1) 10Muehlenhoff: Retire stub firejail code in service::uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/622350 [13:38:43] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: remove inhibit rules until we need them [puppet] - 10https://gerrit.wikimedia.org/r/622348 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [13:39:18] 10Operations, 10DBA, 10Patch-For-Review, 10User-Kormat: DBA python layout - https://phabricator.wikimedia.org/T259516 (10Kormat) [13:39:42] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 50 probes of 555 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:39:50] (03PS3) 10Jbond: cookbook sre.puppet.renew-cert: add cookbook to renew a puppet cert [cookbooks] - 10https://gerrit.wikimedia.org/r/621701 (https://phabricator.wikimedia.org/T260110) [13:39:59] 10Operations, 10DBA, 10Patch-For-Review, 10User-Kormat: DBA python layout - https://phabricator.wikimedia.org/T259516 (10Kormat) 05Openβ†’03Resolved Mission accomplished. [13:40:05] (03CR) 10Jbond: "updated thx" (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/621701 (https://phabricator.wikimedia.org/T260110) (owner: 10Jbond) [13:42:32] (03PS15) 10ZPapierski: Multiple instances of msearch_daemon [puppet] - 10https://gerrit.wikimedia.org/r/621988 (https://phabricator.wikimedia.org/T260305) [13:43:58] (03PS1) 10Filippo Giunchedi: prometheus: fix 'summary' annotation for IcingaServiceProblem [puppet] - 10https://gerrit.wikimedia.org/r/622352 (https://phabricator.wikimedia.org/T258948) [13:44:14] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 50 probes of 557 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:45:25] (03CR) 10Volans: [C: 03+1] "Thx, LGTM!" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/621701 (https://phabricator.wikimedia.org/T260110) (owner: 10Jbond) [13:46:09] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [13:46:43] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: fix 'summary' annotation for IcingaServiceProblem [puppet] - 10https://gerrit.wikimedia.org/r/622352 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [13:47:04] (03CR) 10Jbond: "LGTM barring comments from filippo. A general comment however, Stdlib::Host may be more preferable to Stdlib::Fqdn. the former also allo" [puppet] - 10https://gerrit.wikimedia.org/r/621759 (owner: 10Dzahn) [13:47:16] (03PS1) 10JMeybohm: Use include instead of template to include defines [deployment-charts] - 10https://gerrit.wikimedia.org/r/622354 [13:47:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'fully repool db1111 MCR changes', diff saved to https://phabricator.wikimedia.org/P12346 and previous config saved to /var/cache/conftool/dbconfig/20200825-134736-marostegui.json [13:47:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:02] (03PS16) 10ZPapierski: Multiple instances of msearch_daemon [puppet] - 10https://gerrit.wikimedia.org/r/621988 (https://phabricator.wikimedia.org/T260305) [13:49:05] (03PS1) 10ZPapierski: Remove unnecessary daemon definitions [puppet] - 10https://gerrit.wikimedia.org/r/622355 (https://phabricator.wikimedia.org/T260305) [13:49:47] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/622324 (https://phabricator.wikimedia.org/T258029) (owner: 10Muehlenhoff) [13:51:52] 10Operations, 10serviceops: assess and re-evaluate 'weight' settings of appservers in codfw - https://phabricator.wikimedia.org/T261159 (10Joe) I would rather try to elaborate starting from what eqiad does with similar hardware. The api cluster has, excluding servers to decom 65 servers, distributed as follo... [13:52:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1114 for MCR change', diff saved to https://phabricator.wikimedia.org/P12347 and previous config saved to /var/cache/conftool/dbconfig/20200825-135248-marostegui.json [13:52:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:53] 10Operations, 10CommRel-Specialists-Support (Jul-Sep-2020), 10User-notice: CommRel support for FY2020-2021 Q1 DC switchover - https://phabricator.wikimedia.org/T244808 (10RLazarus) Yeah, sorry that's later than I expected -- we're meeting today to confirm the timing details and I'll post the update immediate... [13:54:54] (03PS17) 10ZPapierski: Multiple instances of msearch_daemon [puppet] - 10https://gerrit.wikimedia.org/r/621988 (https://phabricator.wikimedia.org/T260305) [13:55:52] (03CR) 10Jbond: cookbook sre.puppet.renew-cert: add cookbook to renew a puppet cert (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/621701 (https://phabricator.wikimedia.org/T260110) (owner: 10Jbond) [13:55:53] (03CR) 10Jbond: [C: 03+2] cookbook sre.puppet.renew-cert: add cookbook to renew a puppet cert [cookbooks] - 10https://gerrit.wikimedia.org/r/621701 (https://phabricator.wikimedia.org/T260110) (owner: 10Jbond) [13:57:23] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 51 probes of 555 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:57:45] (03PS18) 10ZPapierski: Multiple instances of msearch_daemon [puppet] - 10https://gerrit.wikimedia.org/r/621988 (https://phabricator.wikimedia.org/T260305) [13:58:37] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash instance=kafkamon1001 job=burrow partition={2,3} prometheus=ops site=eqiad topic={udp_localhost-info,udp_localhost-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=tha [13:58:37] ogging-eqiad&var-topic=All&var-consumer_group=All [14:00:47] (03PS2) 10ZPapierski: Remove unnecessary daemon definitions [puppet] - 10https://gerrit.wikimedia.org/r/622355 (https://phabricator.wikimedia.org/T260305) [14:01:51] (03CR) 10ZPapierski: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/24651/" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/621988 (https://phabricator.wikimedia.org/T260305) (owner: 10ZPapierski) [14:02:30] (03PS1) 10Volans: junos: colorize configuration diff [software/homer] - 10https://gerrit.wikimedia.org/r/622356 (https://phabricator.wikimedia.org/T260769) [14:06:31] !log andrew@deploy1001 Started deploy [horizon/deploy@7a3221d]: add hostname checking --bug T207538 [14:06:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:21] !log andrew@deploy1001 Finished deploy [horizon/deploy@7a3221d]: add hostname checking --bug T207538 (duration: 03m 50s) [14:10:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:10] 10Operations, 10observability, 10Patch-For-Review, 10good first task: nagios-nrpe-server.service: systemd unit references path below legacy directory /var/run/ - https://phabricator.wikimedia.org/T252990 (10Southparkfan) >>! In T252990#6408248, @ema wrote: > This still uses the legacy `/var/run` though, he... [14:25:44] ema: ^ sorry for making stuff complicated :) [14:26:00] !log disable IPv6 BGP to Init7 in knams [14:26:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:16] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/622212 (owner: 10Giuseppe Lavagetto) [14:29:44] (03Merged) 10jenkins-bot: Revert "termbox/staging: rollback the configuration, it clearly doesn't work." [deployment-charts] - 10https://gerrit.wikimedia.org/r/622212 (owner: 10Giuseppe Lavagetto) [14:32:05] doing some cables work in c5,c6,c7 and c8 in case you see any mgmt interface going down [14:32:12] !log volker-e@deploy1001 Started deploy [design/style-guide@e3fda83]: Deploy design/style-guide: [14:32:14] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Seve Kim - https://phabricator.wikimedia.org/T261208 (10sdkim) [14:32:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:17] !log volker-e@deploy1001 Finished deploy [design/style-guide@e3fda83]: Deploy design/style-guide: (duration: 00m 05s) [14:32:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:29] 10Operations, 10ops-eqiad, 10DC-Ops: decomission oresrdb100[12] - https://phabricator.wikimedia.org/T254238 (10Cmjohnson) [14:34:30] 10Operations, 10ops-eqiad, 10DC-Ops: decomission oresrdb100[12] - https://phabricator.wikimedia.org/T254238 (10Cmjohnson) 05Openβ†’03Resolved [14:36:44] 10Operations, 10observability, 10Patch-For-Review, 10good first task: nagios-nrpe-server.service: systemd unit references path below legacy directory /var/run/ - https://phabricator.wikimedia.org/T252990 (10MoritzMuehlenhoff) >>! In T252990#6409276, @Southparkfan wrote: > A change from /var/run to /run was... [14:36:46] !log oblivian@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' . [14:36:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:41] PROBLEM - Host ganeti2013.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:49:43] PROBLEM - Host db2127.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:49:43] PROBLEM - Host ganeti2014.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:55:01] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 52 probes of 557 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:55:35] PROBLEM - MariaDB Replica SQL: analytics_meta on db1108 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1146, Errmsg: Error executing row event: Table superset_staging.ab_user doesnt exist https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:55:41] RECOVERY - Host ganeti2013.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.94 ms [14:55:43] RECOVERY - Host db2127.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.93 ms [14:55:43] RECOVERY - Host ganeti2014.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.04 ms [14:56:10] elukey ^ [14:56:24] !log installing take security updates on stretch [14:56:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:34] !log installing rake security updates on stretch [14:56:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:37] elukey: around? [14:56:45] ah, sorry didn't see jaime pinged you already [14:56:57] I am yes, but I didn't do anything to superset's db [14:56:58] I am going to a meeting now, but I can check later if you need help [14:57:03] (03CR) 10ArielGlenn: "I don't have a problem with this as long as everyone bears in mind that multiple (maybe all three) of these jobs could end up running at t" [puppet] - 10https://gerrit.wikimedia.org/r/622342 (https://phabricator.wikimedia.org/T261204) (owner: 10DCausse) [14:57:04] yes yes no problem [14:57:14] I can help at least debugging [14:57:50] ah wait I think I may now what's happening [14:58:00] (03PS4) 10Southparkfan: nagios-nrpe-server systemd unit: use /run for PID files + add new versions for os_version [puppet] - 10https://gerrit.wikimedia.org/r/621967 (https://phabricator.wikimedia.org/T252990) [14:58:35] so we have a database called superset_staging on an-coord1001, that is not one that we need to replicate [14:58:45] elukey: ping if you need help [14:58:53] yep yep thanks :) [14:58:57] elukey: we'd need to set a replication filter then [14:59:22] possibly yes, we can do it later when you people have time [15:00:25] so something like [15:00:27] can you do the logical work, as in, get a list of db that have to be skipped or something [15:00:40] or tables [15:00:42] stop slaves; SET GLOBAL replicate_ignore_db=superset_staging; start slave; ? [15:00:56] yeah, but also set it to my.cnf so it doesn't get lost on restart [15:00:58] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 50 probes of 557 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:01:20] I strongly recommend against ignore, I recommend ignore_wild_db [15:01:38] I can show you examples on labsdb [15:01:48] elukey: you can check modules/profile/templates/mariadb/mysqld_config/sanitarium_multiinstance.my.cnf.erb as an example [15:02:13] * marostegui goes to the meeting for real [15:02:33] jynus: sure thanks! [15:02:51] so replicate-wild-ignore-table = db.% [15:03:41] if you send a patch I can review it if necessary [15:03:56] ah so replicate-wild-ignore-table = superset_staging.% [15:04:09] the difference is that it is ignored on application [15:04:19] where it is less probablility of going bad [15:04:30] binlog gets untouched [15:04:47] ah right [15:04:52] e.g. ignore ignores based on the current db [15:05:15] can I try to set it dynamically to see if it works or better to directly apply the my.cnf patch and restart mariadb? [15:05:17] so if there is cross-db updates, things will break [15:05:25] elukey: yes [15:05:29] (03CR) 10Ebernhardson: Multiple instances of msearch_daemon (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/621988 (https://phabricator.wikimedia.org/T260305) (owner: 10ZPapierski) [15:05:34] although it requires stopping replication [15:05:43] yep yep no problem, trying now [15:05:53] lets move discussion to -database to not spam here [15:05:56] if needed [15:06:16] it has to be applied to the replica, not the master [15:06:27] so it is also safer because of that [15:06:28] (03CR) 10DCausse: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/622342 (https://phabricator.wikimedia.org/T261204) (owner: 10DCausse) [15:06:30] yep yep let's move to database [15:10:51] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 51 probes of 557 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:15:03] RECOVERY - MariaDB Replica SQL: analytics_meta on db1108 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:15:08] \o/ [15:19:21] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 49 probes of 555 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:21:05] (03PS9) 10Bstorm: wikireplicas: refactor to eliminate confusing "labsdb" naming [puppet] - 10https://gerrit.wikimedia.org/r/621618 (https://phabricator.wikimedia.org/T260843) [15:22:54] !log testing upcoming Scap release on beta [15:22:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:15] 10Operations, 10observability, 10Patch-For-Review, 10good first task: nagios-nrpe-server.service: systemd unit references path below legacy directory /var/run/ - https://phabricator.wikimedia.org/T252990 (10Southparkfan) @MoritzMuehlenhoff understood. Patch set 4 will use the custom unit (with /run) on sys... [15:31:30] (03PS1) 10Elukey: Exclude superset_staging from the db1108's meta replication [puppet] - 10https://gerrit.wikimedia.org/r/622382 [15:32:55] (03CR) 10Elukey: "pcc looks good https://puppet-compiler.wmflabs.org/compiler1002/24652/db1108.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/622382 (owner: 10Elukey) [15:33:52] (03CR) 10Jcrespo: [C: 03+1] Exclude superset_staging from the db1108's meta replication [puppet] - 10https://gerrit.wikimedia.org/r/622382 (owner: 10Elukey) [15:34:35] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 50 probes of 557 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:35:09] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 52 probes of 555 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:36:23] (03PS5) 10Lucas Werkmeister (WMDE): Add new slow-bot group for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618245 (https://phabricator.wikimedia.org/T258354) (owner: 10Tobias Andersson) [15:37:12] (03CR) 10Elukey: [C: 03+2] Exclude superset_staging from the db1108's meta replication [puppet] - 10https://gerrit.wikimedia.org/r/622382 (owner: 10Elukey) [15:37:55] (03CR) 10Lucas Werkmeister (WMDE): Add new slow-bot group for Wikidata (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618245 (https://phabricator.wikimedia.org/T258354) (owner: 10Tobias Andersson) [15:39:49] !log dcausse@deploy1001 Started deploy [wikimedia/discovery/analytics@cbf2f9d]: Add wikidata ttl import [15:39:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:27] !log dcausse@deploy1001 Finished deploy [wikimedia/discovery/analytics@cbf2f9d]: Add wikidata ttl import (duration: 01m 38s) [15:41:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:03] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 50 probes of 555 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:47:40] !log restart mariadb@analytics_meta on db1108 to apply a replication filter (exclude superset_staging database from replication) [15:47:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:57] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 53 probes of 555 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:57:15] (03CR) 10Bstorm: [C: 03+2] wikireplicas: refactor to eliminate confusing "labsdb" naming [puppet] - 10https://gerrit.wikimedia.org/r/621618 (https://phabricator.wikimedia.org/T260843) (owner: 10Bstorm) [16:00:01] !log repool wdqs1005 - catched up on lag [16:00:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:04] godog and _joe_: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Puppet request window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200825T1600). [16:00:20] ryankemper: ^^^ [16:00:31] <_joe_> no changes today :) [16:01:53] (03CR) 10Dduvall: [C: 03+2] Branch commit for wmf/1.36.0-wmf.6 [core] (wmf/1.36.0-wmf.6) - 10https://gerrit.wikimedia.org/r/622250 (https://phabricator.wikimedia.org/T257974) (owner: 10TrainBranchBot) [16:02:53] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 50 probes of 555 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:06:09] 1.36.0-wmf.6 was branched at 8c26ce9746bd57c8c7801c4c99b60cbb0cbc0703 for T257974 [16:06:10] T257974: 1.36.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T257974 [16:07:28] !log dcausse@deploy1001 Started deploy [wikimedia/discovery/analytics@ae6dd8d]: test: Add wikidata ttl import [16:07:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:22] !log dcausse@deploy1001 Finished deploy [wikimedia/discovery/analytics@ae6dd8d]: test: Add wikidata ttl import (duration: 00m 54s) [16:08:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:32] (03PS1) 10Bstorm: wikireplicas: fix path typo for the heartbeat-views file [puppet] - 10https://gerrit.wikimedia.org/r/622387 (https://phabricator.wikimedia.org/T260843) [16:10:12] (03CR) 10Jcrespo: [C: 03+1] "He he, as predicted :-)" [puppet] - 10https://gerrit.wikimedia.org/r/622387 (https://phabricator.wikimedia.org/T260843) (owner: 10Bstorm) [16:10:45] (03CR) 10Bstorm: [C: 03+2] "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/622387 (https://phabricator.wikimedia.org/T260843) (owner: 10Bstorm) [16:12:45] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 52 probes of 555 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:13:00] 10Operations, 10DC-Ops, 10fundraising-tech-ops: RAID controller failing on frdb1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T261221 (10Jgreen) [16:13:33] 10Operations, 10DC-Ops, 10fundraising-tech-ops: RAID controller failing on frdb1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T261221 (10Jgreen) [16:21:20] 10Operations, 10DC-Ops, 10fundraising-tech-ops: RAID controller failing on frdb1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T261221 (10Jgreen) [16:21:55] !log restart logstash on logstash1007 -- gc duration outlier [16:21:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:20] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By:TBD) rack/setup/install rows C and D new PDUs - https://phabricator.wikimedia.org/T253694 (10wiki_willy) [16:24:05] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 51 probes of 557 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:25:07] (03Merged) 10jenkins-bot: Branch commit for wmf/1.36.0-wmf.6 [core] (wmf/1.36.0-wmf.6) - 10https://gerrit.wikimedia.org/r/622250 (https://phabricator.wikimedia.org/T257974) (owner: 10TrainBranchBot) [16:26:09] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10Nuria) a:03fdans [16:30:39] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 50 probes of 555 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:36:29] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [16:40:35] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 51 probes of 555 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:41:38] (03PS2) 10Herron: logstash: add #o11y tag to logstash alert descriptions [puppet] - 10https://gerrit.wikimedia.org/r/622161 [16:41:57] !log dcausse@deploy1001 Started deploy [wikimedia/discovery/analytics@89b4f74]: test: Add wikidata ttl import [16:41:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:46] !log dcausse@deploy1001 Finished deploy [wikimedia/discovery/analytics@89b4f74]: test: Add wikidata ttl import (duration: 00m 49s) [16:42:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:00] (03CR) 10Herron: [C: 03+2] logstash: add #o11y tag to logstash alert descriptions [puppet] - 10https://gerrit.wikimedia.org/r/622161 (owner: 10Herron) [16:47:22] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 50 probes of 557 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:54:43] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 51 probes of 557 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:59:13] (03CR) 10Cwhite: prometheus: add apache2 es-exporter config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/621597 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [17:00:04] halfak and accraze: May I have your attention please! Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200825T1700) [17:00:50] (03CR) 10Cwhite: profile: install and configure statsd_exporter and retarget statsv (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/615269 (https://phabricator.wikimedia.org/T180105) (owner: 10Cwhite) [17:01:31] !log imported logstash, elasticsearch, and kibana 7.9.0 -oss packages into buster-wikimedia thirdparty/elastic79 [17:01:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:48] !log dduvall@deploy1001 Pruned MediaWiki: 1.36.0-wmf.3 (duration: 19m 12s) [17:01:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:22] (03PS1) 10Herron: logstash: set elk7 cluster elasticsearch version to 7.9 [puppet] - 10https://gerrit.wikimedia.org/r/622395 [17:05:36] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 50 probes of 557 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:07:06] 10Operations, 10ops-codfw, 10netops: (Need by: ) codfw:rack/setup/new management switches - https://phabricator.wikimedia.org/T253154 (10Papaul) [17:08:12] !log dduvall@deploy1001 Pruned MediaWiki: 1.36.0-wmf.4 (duration: 01m 40s) [17:08:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:32] 10Operations, 10ops-codfw, 10netops: (Need by: ) codfw:rack/setup/new management switches - https://phabricator.wikimedia.org/T253154 (10Papaul) @Jgreen is it okay for me to replace fmsw on the 27th (start time 9:30am end time 11:30am) CT [17:09:59] 10Operations, 10ops-eqiad, 10Analytics-Clusters: (Need By: TBD) upgrade ram in an-master100[12] - https://phabricator.wikimedia.org/T259162 (10elukey) @Jclark-ctr I'd say 5/10 minutes for each host to do proper failover, and the host can stay down even for half an hour but better if less of course :) [17:10:37] (03PS1) 10Dduvall: testwikis wikis to 1.36.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622396 [17:10:39] (03CR) 10Dduvall: [C: 03+2] testwikis wikis to 1.36.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622396 (owner: 10Dduvall) [17:11:29] (03Merged) 10jenkins-bot: testwikis wikis to 1.36.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622396 (owner: 10Dduvall) [17:12:38] 10Operations, 10DC-Ops, 10fundraising-tech-ops: RAID controller failing on frdb1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T261221 (10Jgreen) [17:13:00] 10Operations, 10DC-Ops, 10fundraising-tech-ops: RAID controller failing on frdb1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T261221 (10Jgreen) [17:13:29] (03CR) 10BryanDavis: [C: 04-1] "Will split this up" [puppet] - 10https://gerrit.wikimedia.org/r/622237 (https://phabricator.wikimedia.org/T251628) (owner: 10BryanDavis) [17:13:31] 10Operations, 10DC-Ops, 10fundraising-tech-ops: RAID controller failing on frdb1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T261221 (10Jgreen) [17:17:00] !log dduvall@deploy1001 Started scap: testwikis wikis to 1.36.0-wmf.6 [17:17:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:58] PROBLEM - PHP7 rendering on mwdebug1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:22:54] RECOVERY - PHP7 rendering on mwdebug1001 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 8.082 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:28:38] !log dcausse@deploy1001 Started deploy [wikimedia/discovery/analytics@bc2f7f1]: test: Add wikidata ttl import [17:28:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:31] !log dcausse@deploy1001 Finished deploy [wikimedia/discovery/analytics@bc2f7f1]: test: Add wikidata ttl import (duration: 01m 52s) [17:30:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:12] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 50 probes of 555 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:32:52] 10Operations, 10DC-Ops, 10fundraising-tech-ops: RAID controller failing on frdb1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T261221 (10Jgreen) [17:33:18] (03CR) 10Cwhite: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/622395 (owner: 10Herron) [17:33:36] (03CR) 10Herron: [C: 03+2] logstash: set elk7 cluster elasticsearch version to 7.9 [puppet] - 10https://gerrit.wikimedia.org/r/622395 (owner: 10Herron) [17:34:57] 10Operations, 10DC-Ops, 10fundraising-tech-ops: RAID controller failing on frdb1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T261221 (10Jgreen) [17:35:31] 10Operations, 10CommRel-Specialists-Support (Jul-Sep-2020), 10User-notice: CommRel support for FY2020-2021 Q1 DC switchover - https://phabricator.wikimedia.org/T244808 (10RLazarus) Done: https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Schedule_for_2020_switch [17:37:45] (03PS1) 10Herron: profile::elasticserach: add version 7.9 to enum [puppet] - 10https://gerrit.wikimedia.org/r/622402 [17:40:03] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1001/24653/" [puppet] - 10https://gerrit.wikimedia.org/r/622402 (owner: 10Herron) [17:40:33] 10Operations, 10DC-Ops, 10fundraising-tech-ops: RAID controller failing on frdb1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T261221 (10Jgreen) [17:40:51] (03CR) 10Herron: [C: 03+2] profile::elasticserach: add version 7.9 to enum [puppet] - 10https://gerrit.wikimedia.org/r/622402 (owner: 10Herron) [17:41:44] (03CR) 10Dzahn: [V: 03+1 C: 03+2] zuul: add data types, replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/621758 (owner: 10Dzahn) [17:47:59] (03Abandoned) 10Dzahn: mediawiki::fonts: remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/621374 (owner: 10Dzahn) [17:48:15] (03PS2) 10Dzahn: webperf: add data types to profiles [puppet] - 10https://gerrit.wikimedia.org/r/621756 [17:49:12] (03CR) 10Dzahn: [C: 03+2] webperf: add data types to profiles [puppet] - 10https://gerrit.wikimedia.org/r/621756 (owner: 10Dzahn) [17:58:58] !log dduvall@deploy1001 Finished scap: testwikis wikis to 1.36.0-wmf.6 (duration: 41m 58s) [17:59:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:31] 10Operations, 10DC-Ops, 10fundraising-tech-ops: RAID controller failing on frdb1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T261221 (10Jgreen) [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200825T1800) [18:00:57] 10Operations, 10DC-Ops, 10fundraising-tech-ops: RAID controller failing on frdb1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T261221 (10Jgreen) Note that fr-tech-ops intents to schedule a firmware upgrade, but our concern is that the upgrade is likely surface an underlying hardware issue rather... [18:05:44] (03CR) 10Dzahn: ldap: remove jessie support (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/621372 (owner: 10Dzahn) [18:06:05] (03PS2) 10Dzahn: ldap: remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/621372 [18:08:22] (03Abandoned) 10Dzahn: service::catalog: switch ORES to encryption: true [puppet] - 10https://gerrit.wikimedia.org/r/621564 (owner: 10Dzahn) [18:08:24] (03CR) 10Ryan Kemper: [C: 03+2] "Shipping this" [puppet] - 10https://gerrit.wikimedia.org/r/618954 (owner: 10DCausse) [18:10:42] (03CR) 10Ryan Kemper: "`sudo puppet-merge` successful" [puppet] - 10https://gerrit.wikimedia.org/r/618954 (owner: 10DCausse) [18:11:45] (03PS1) 10Ahmon Dancy: Improve error message if wikiversions.php has wrong format [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622408 [18:11:58] (03CR) 10Dzahn: prometheus: hiera() -> lookup(), add data type for prometheus_nodes (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/621759 (owner: 10Dzahn) [18:12:37] (03CR) 10jerkins-bot: [V: 04-1] Improve error message if wikiversions.php has wrong format [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622408 (owner: 10Ahmon Dancy) [18:13:57] (03PS2) 10Ahmon Dancy: Improve error message if wikiversions.php has wrong format [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622408 [18:15:43] 10Operations, 10CommRel-Specialists-Support (Jul-Sep-2020), 10User-notice: CommRel support for FY2020-2021 Q1 DC switchover - https://phabricator.wikimedia.org/T244808 (10Trizek-WMF) >>! In T244808#6409998, @RLazarus wrote: > Done: https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Schedule_for_2020_switc... [18:16:10] 10Operations, 10CommRel-Specialists-Support (Jul-Sep-2020), 10User-notice: CommRel support for FY2020-2021 Q1 DC switchover - https://phabricator.wikimedia.org/T244808 (10Trizek-WMF) [18:16:24] (03CR) 10Ryan Kemper: [C: 03+2] "Newest PCC looks good:" [puppet] - 10https://gerrit.wikimedia.org/r/619289 (https://phabricator.wikimedia.org/T251515) (owner: 10ZPapierski) [18:19:34] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-reload [18:19:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:42] (03CR) 10Dzahn: "@hashar How long should we wait before releases1001 is deleted?" [puppet] - 10https://gerrit.wikimedia.org/r/621090 (https://phabricator.wikimedia.org/T260742) (owner: 10Dzahn) [18:34:23] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 52, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:34:34] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 57 probes of 555 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:34:36] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:34:38] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [18:40:16] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 54, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:40:28] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:40:32] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 50 probes of 555 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:43:38] (03PS3) 10CDanis: package_builder: add support for 'sloppy' backports [puppet] - 10https://gerrit.wikimedia.org/r/622190 [18:45:07] (03CR) 10CDanis: "https://puppet-compiler.wmflabs.org/compiler1001/24655/deneb.codfw.wmnet/index.html" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/622190 (owner: 10CDanis) [18:46:00] that is an odd set of alerts to be coincident [18:50:08] (03PS4) 10CDanis: package_builder: add support for 'sloppy' backports [puppet] - 10https://gerrit.wikimedia.org/r/622190 (https://phabricator.wikimedia.org/T261193) [18:50:40] (03PS6) 10Dzahn: prometheus: hiera() -> lookup(), add data type for prometheus_nodes [puppet] - 10https://gerrit.wikimedia.org/r/621759 [18:53:29] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/621372 (owner: 10Dzahn) [18:56:24] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 53 probes of 555 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [19:00:04] marxarelli and longma: (Dis)respected human, time to deploy Mediawiki train - American Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200825T1900). Please do the needful. [19:01:05] (03CR) 10Muehlenhoff: "The patch is technically correct, but stretch-backports will be removed from the Debian mirrors soon (along with the sloppy- counterpart)" [puppet] - 10https://gerrit.wikimedia.org/r/622190 (https://phabricator.wikimedia.org/T261193) (owner: 10CDanis) [19:02:22] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 51 probes of 557 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [19:02:22] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 50 probes of 555 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [19:02:39] !log installing Java security updates on elastic* hosts [19:02:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:09] !log installing Java security updates on cloudelastic* hosts [19:05:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:12] o/ deploying to group0 shortly [19:05:43] 10Operations, 10Discovery-Search, 10Datacenter-Switchover-2018: Warn when CirrusSearch is not configured to use local DC for an extended time - https://phabricator.wikimedia.org/T204135 (10Gehel) p:05Highβ†’03Low [19:05:55] (03CR) 10CDanis: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/622190 (https://phabricator.wikimedia.org/T261193) (owner: 10CDanis) [19:07:02] (03PS1) 10Dduvall: group0 wikis to 1.36.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622414 [19:07:04] (03CR) 10Dduvall: [C: 03+2] group0 wikis to 1.36.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622414 (owner: 10Dduvall) [19:07:50] (03Merged) 10jenkins-bot: group0 wikis to 1.36.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622414 (owner: 10Dduvall) [19:08:07] (03CR) 10Muehlenhoff: "I have built a stretch-wikimedia bpfcc backport on deneb, only needs to be imported to apt.wikimedia.org along with the wrapper discussed " [puppet] - 10https://gerrit.wikimedia.org/r/622190 (https://phabricator.wikimedia.org/T261193) (owner: 10CDanis) [19:08:18] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 50 probes of 557 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [19:09:42] (03CR) 10CDanis: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/622190 (https://phabricator.wikimedia.org/T261193) (owner: 10CDanis) [19:09:43] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.6 [19:09:44] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work), 10Patch-For-Review: (Need by: 2020-04-02) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10Gehel) [19:09:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:14] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 51 probes of 555 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [19:15:20] !log 1.36.0-wmf.6 promoted to group0 (T257974). no new errors [19:15:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:24] T257974: 1.36.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T257974 [19:16:26] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:17:31] bah [19:18:14] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 52 probes of 557 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [19:18:59] 10Operations, 10video2commons: video-redis-buster.video.eqiad.wmflabs:6379. Connection refused. - https://phabricator.wikimedia.org/T261245 (10Jidanni) [19:20:24] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:27:39] (03Abandoned) 10CDanis: package_builder: add support for 'sloppy' backports [puppet] - 10https://gerrit.wikimedia.org/r/622190 (https://phabricator.wikimedia.org/T261193) (owner: 10CDanis) [19:30:10] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 50 probes of 555 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [19:39:18] !log dcausse@deploy1001 Started deploy [wikimedia/discovery/analytics@125cb6d]: test: Add wikidata ttl import [19:39:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:12] !log dcausse@deploy1001 Finished deploy [wikimedia/discovery/analytics@125cb6d]: test: Add wikidata ttl import (duration: 00m 54s) [19:40:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:04] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 51 probes of 555 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [19:52:02] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 50 probes of 555 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [19:58:09] (03PS2) 10Ahmon Dancy: Updated some cross references in comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621589 [20:05:56] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 50 probes of 557 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:15:52] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 51 probes of 557 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:19:48] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: (Need By: TBD) rack/setup/install kubernetes1017.eqiad.wmnet - https://phabricator.wikimedia.org/T258747 (10Jclark-ctr) a:05Jclark-ctrβ†’03Cmjohnson [20:19:54] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: (Need By: TBD) rack/setup/install kubernetes1017.eqiad.wmnet - https://phabricator.wikimedia.org/T258747 (10Jclark-ctr) Racked and cabled host kubernetes1017 A5. U31. Port 31 [20:20:57] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-09-15) rack/setup/install db1150 (see note on hostname) - https://phabricator.wikimedia.org/T260817 (10Jclark-ctr) [20:21:50] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 50 probes of 557 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:22:05] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-09-15) rack/setup/install db1150 (see note on hostname) - https://phabricator.wikimedia.org/T260817 (10Jclark-ctr) [20:53:38] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 53 probes of 557 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:55:30] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 55 probes of 555 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:55:41] (03PS1) 10Andrew Bogott: cloudvirts: add ceph config to non-ceph-enabled cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/622427 (https://phabricator.wikimedia.org/T261252) [20:59:12] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 51 probes of 557 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [21:05:10] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 50 probes of 557 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [21:05:35] the IPv6 internet seems kind of generally unhappy the last several hours / last day :/ [21:10:16] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 52, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:10:34] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:13:50] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 51 probes of 557 (alerts on 50) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [21:16:28] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:18:06] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 54, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:19:46] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 50 probes of 557 (alerts on 50) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [21:25:14] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 50 probes of 555 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [21:41:26] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 50 probes of 557 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [21:46:35] !log importing xhgui 0.12.0-2-wmf1 to buster-wikimedia APT repo (T260397) [21:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:39] T260397: XHGui is returning all results, with wrong sort order - https://phabricator.wikimedia.org/T260397 [21:49:00] dpifke: on xhgui1001 i can now see that it _would_ install 012.0-2-wmf1 when i simulate the install with -s. would you like me to install it for real or do it yourself [21:49:49] Go for it. [21:50:28] !log xhgui1001 - Unpacking xhgui (0.12.0-2-wmf1) over (0.9.0-1-wmf1) ... [21:50:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:31] !log xhgui1001/xhgui2001 - Unpacking xhgui (0.12.0-2-wmf1) over (0.9.0-1-wmf1) (T260397) [21:51:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:45] done on both [21:52:05] Looks good, thanks! [21:52:12] great. yw [21:53:53] (03PS2) 10Andrew Bogott: cloudvirts: add ceph config to non-ceph-enabled cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/622427 (https://phabricator.wikimedia.org/T261252) [21:58:15] (03PS3) 10Andrew Bogott: cloudvirts: add ceph config to non-ceph-enabled cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/622427 (https://phabricator.wikimedia.org/T261252) [21:59:06] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=205 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [21:59:22] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 51 probes of 557 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [22:01:08] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [22:05:20] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 50 probes of 557 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [22:07:37] (03PS4) 10Andrew Bogott: cloudvirts: add ceph config to non-ceph-enabled cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/622427 (https://phabricator.wikimedia.org/T261252) [22:11:00] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 51 probes of 555 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [22:14:37] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirts: add ceph config to non-ceph-enabled cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/622427 (https://phabricator.wikimedia.org/T261252) (owner: 10Andrew Bogott) [22:16:07] (03PS3) 10BryanDavis: dynamicproxy: serve default /robots.txt and /favicon.ico for Toolforge [puppet] - 10https://gerrit.wikimedia.org/r/622237 (https://phabricator.wikimedia.org/T251628) [22:16:09] (03PS3) 10BryanDavis: dynamicproxy: allow service workers in Toolforge [puppet] - 10https://gerrit.wikimedia.org/r/622238 (https://phabricator.wikimedia.org/T158216) [22:16:11] (03PS1) 10BryanDavis: dynamicproxy: Drop temporary file cleanup blocks [puppet] - 10https://gerrit.wikimedia.org/r/622434 [22:16:13] (03PS1) 10BryanDavis: dynamicproxy: update Content-Security-Policy-Report-Only header [puppet] - 10https://gerrit.wikimedia.org/r/622435 [22:16:15] (03PS1) 10BryanDavis: dynamicproxy: Remove X-Wikimedia-Debug error page overrides [puppet] - 10https://gerrit.wikimedia.org/r/622436 [22:16:17] (03PS1) 10BryanDavis: dynamicproxy: Update proxy_redirect to use $host to limit scheme rewrites [puppet] - 10https://gerrit.wikimedia.org/r/622437 [22:20:16] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [22:22:58] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 50 probes of 555 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [22:29:01] (03PS1) 10Andrew Bogott: wmcs admin scripts: add wmcs-ceph-migrate [puppet] - 10https://gerrit.wikimedia.org/r/622440 (https://phabricator.wikimedia.org/T261252) [22:29:40] (03CR) 10Andrew Bogott: [C: 03+2] wmcs admin scripts: add wmcs-ceph-migrate [puppet] - 10https://gerrit.wikimedia.org/r/622440 (https://phabricator.wikimedia.org/T261252) (owner: 10Andrew Bogott) [22:30:28] 10Operations, 10video2commons: video-redis-buster.video.eqiad.wmflabs:6379. Connection refused. - https://phabricator.wikimedia.org/T261245 (10Aklapper) 05Openβ†’03Invalid In my understanding this needs to be reported on Github and not here; see https://phabricator.wikimedia.org/project/profile/2141/ [22:32:52] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 51 probes of 555 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [22:38:50] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 50 probes of 555 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [22:55:00] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 51 probes of 557 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:00:04] RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200825T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:00:36] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 51 probes of 555 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:06:58] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 50 probes of 557 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:20:56] PROBLEM - Time elapsed since the last kafka event processed by purged on cp4030 is CRITICAL: cluster=cache_text instance=cp4030 job=purged site=ulsfo topic=codfw.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4030 [23:22:56] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 51 probes of 557 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:22:56] RECOVERY - Time elapsed since the last kafka event processed by purged on cp4030 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4030 [23:34:28] (03PS1) 10Bstorm: wikireplicas: create multiinstance roles and profiles [puppet] - 10https://gerrit.wikimedia.org/r/622444 (https://phabricator.wikimedia.org/T260843) [23:35:27] (03CR) 10jerkins-bot: [V: 04-1] wikireplicas: create multiinstance roles and profiles [puppet] - 10https://gerrit.wikimedia.org/r/622444 (https://phabricator.wikimedia.org/T260843) (owner: 10Bstorm) [23:36:24] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 50 probes of 555 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:43:00] (03PS2) 10Bstorm: wikireplicas: create multiinstance roles and profiles [puppet] - 10https://gerrit.wikimedia.org/r/622444 (https://phabricator.wikimedia.org/T260843) [23:43:59] (03CR) 10jerkins-bot: [V: 04-1] wikireplicas: create multiinstance roles and profiles [puppet] - 10https://gerrit.wikimedia.org/r/622444 (https://phabricator.wikimedia.org/T260843) (owner: 10Bstorm) [23:46:32] (03CR) 10Bstorm: "This is obviously adapted from dbstore_multiinstance with some updates to be more keyed off hiera and create all the stuff needed for labs" [puppet] - 10https://gerrit.wikimedia.org/r/622444 (https://phabricator.wikimedia.org/T260843) (owner: 10Bstorm) [23:53:42] (03CR) 10Bstorm: [C: 03+2] dynamicproxy: Drop temporary file cleanup blocks [puppet] - 10https://gerrit.wikimedia.org/r/622434 (owner: 10BryanDavis) [23:56:06] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5009 is CRITICAL: cluster=cache_text instance=cp5009 job=purged site=eqsin topic=codfw.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5009 [23:56:10] PROBLEM - Time elapsed since the last kafka event processed by purged on cp4028 is CRITICAL: cluster=cache_text instance=cp4028 job=purged site=ulsfo topic=codfw.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4028 [23:57:42] PROBLEM - Time elapsed since the last kafka event processed by purged on cp3058 is CRITICAL: cluster=cache_text instance=cp3058 job=purged site=esams topic=codfw.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3058 [23:58:06] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5009 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5009 [23:58:10] RECOVERY - Time elapsed since the last kafka event processed by purged on cp4028 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4028 [23:59:40] RECOVERY - Time elapsed since the last kafka event processed by purged on cp3058 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3058