[00:01:41] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew) [00:01:48] !log reinstalling testvm[345]001 to confirm OS installs work as normal after switching DHCP servers in POPs (T252526) [00:01:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:01:54] T252526: serve tftpboot environment from the install servers and create one in each edge POP - https://phabricator.wikimedia.org/T252526 [00:03:11] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1013 with 10G interfaces - https://phabricator.wikimedia.org/T243414 (10Andrew) p:05Low→03Medium a:05Jclark-ctr→03Cmjohnson @cmjohnson , This host is empty again; you can power it down any time.... [00:03:42] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 57.22 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [00:04:11] also expected because of how the esams drain works [00:04:35] lol, codfw has a 500% increase... poor thing is under-used [00:04:43] thanks, yes, the eqiad part raised an eyebrow [00:04:58] there is a long-backburnered task to rebalance the steady state of that a bit [00:15:12] (03CR) 10Ryan Kemper: [C: 03+2] envoy: Set appropriate service names for three level wikimedia.org domains [puppet] - 10https://gerrit.wikimedia.org/r/631503 (https://phabricator.wikimedia.org/T263073) (owner: 10Ebernhardson) [00:15:14] !log T259621 cdanis@re0.cr2-esams> request system software add re1 no-validate /var/tmp/junos-install-mx-x86-64-17.3R3-S8.1.tgz [00:15:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:46] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 136 probes of 564 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [00:16:09] that's... interesting [00:16:52] that's definitely real wtf [00:17:04] the router is still in service, it just has its external BGP sessions turned off [00:17:37] I mean it doesn't matter since the site is depooled, but in theory you can run this procedure without depooling the site (in which case we'd be in trouble rn) [00:18:32] *nod* good you did not trust it heh [00:19:20] so strange [00:19:30] I wonder if BGP graceful shutdown would have helped [00:19:38] Hey guys, on alswiki varnish is giving me 429 error [00:19:42] ok, seems to have recovered [00:20:09] And most of the API requests result in NULL responses. [00:20:28] Cyberpower678: can you give a sample API request? [00:21:06] cdanis: it's just an allpages list query. [00:21:28] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 37 probes of 564 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [00:21:57] AFAIK, it only affects alswiki [00:22:23] !log T259621 cdanis@re0.cr2-esams> request system reboot other-routing-engine [00:22:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:42] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [00:25:25] Cyberpower678: I don't think there are any known issues at present, can you open a task? [00:25:36] Not at the moment. [00:26:14] !log T259621 cdanis@re0.cr2-esams> request chassis routing-engine master switch [00:26:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:28:18] Cyberpower678, maybe you should provide the URL :) [00:29:00] Krenair: once the VM frees up, I can do that. :-) [00:29:12] if it's https://als.wikipedia.org/w/api.php?action=query&list=allpages then it works fine for me [00:30:05] 429 is Too Many Requests [00:30:14] where are you running it from? [00:30:28] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 76.39 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [00:31:47] (03PS1) 10Clarakosi: Fix OAuthRateLimiter rate limit configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632589 [00:34:58] if you're logged out and your request is not originating from a wikimedia network (e.g. wmcs), your public IP should get 1000/10s (100/s long term, with 1000 burst) requests to the API [00:36:44] !log T259621 cdanis@re1.cr2-esams> request system software add /var/tmp/junos-install-mx-x86-64-17.3R3-S8.1.tgz re0 no-validate [00:36:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:34] 10Operations, 10Patch-For-Review: serve tftpboot environment from the install servers and create one in each edge POP - https://phabricator.wikimedia.org/T252526 (10Dzahn) ☑️ eqsin tested - reinstalled testvm5001 and confirmed in syslog that install5001 is serving both DHCP and TFTP now; not bast5001 (tftp) o... [00:40:39] !log T259621 cdanis@re1.cr2-esams> request system reboot other-routing-engine [00:40:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:41:28] (03PS1) 10Dzahn: install_server: remove testvm[345]001 [puppet] - 10https://gerrit.wikimedia.org/r/632590 [00:41:57] 10Operations, 10Patch-For-Review: serve tftpboot environment from the install servers and create one in each edge POP - https://phabricator.wikimedia.org/T252526 (10Dzahn) 05Open→03Resolved [00:42:00] 10Operations, 10Patch-For-Review: Sort out plan for install* servers in edge sites - https://phabricator.wikimedia.org/T242602 (10Dzahn) [00:42:42] 10Operations, 10Patch-For-Review: serve tftpboot environment from the install servers and create one in each edge POP - https://phabricator.wikimedia.org/T252526 (10Dzahn) [00:43:27] 10Operations: serve tftpboot environment from the install servers and create one in each edge POP - https://phabricator.wikimedia.org/T252526 (10Dzahn) [00:43:50] !log T259621 cdanis@re1.cr2-esams> request chassis routing-engine master switch [00:43:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:47:20] PROBLEM - OSPF status on cr3-knams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:47:20] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:47:37] expected [00:49:04] RECOVERY - OSPF status on cr3-knams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:49:06] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:50:41] 10Operations, 10Patch-For-Review: Sort out plan for install* servers in edge sites - https://phabricator.wikimedia.org/T242602 (10Dzahn) quoting from the original task description and replying inline: > when moving to buster and into the edge sites we should split the apt.wikimedia.org repository to a separat... [00:58:07] (03PS1) 10CDanis: Revert "depool esams for router upgrade" [dns] - 10https://gerrit.wikimedia.org/r/632561 [00:59:48] (03CR) 10CDanis: [C: 03+2] Revert "depool esams for router upgrade" [dns] - 10https://gerrit.wikimedia.org/r/632561 (owner: 10CDanis) [01:00:15] repooling esams, expect a codfw traffic drop alert in ~10 mins [01:00:28] !log repool esams; cr2-esams router upgrade complete [01:00:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:00:52] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:05:00] (03PS1) 10Dzahn: switch webproxy for esams/ulsfo/eqsin to their local install server [dns] - 10https://gerrit.wikimedia.org/r/632591 (https://phabricator.wikimedia.org/T242602) [01:10:32] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on alert1001 is CRITICAL: 55.37 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [01:10:42] 10 minutes and 17 seconds ahah [01:11:22] (03CR) 10Dzahn: "squid is running on all 3, listening on 8080, iptables rules are there" [dns] - 10https://gerrit.wikimedia.org/r/632591 (https://phabricator.wikimedia.org/T242602) (owner: 10Dzahn) [01:14:42] ok, things look good, normal background rate of NELs -- aside from the fact that we get `dns.address_changed` reports which I hadn't realized we would! [01:14:53] very interesting, I'll have to look at that codepath in Chromium later [01:15:43] https://logstash.wikimedia.org/goto/acc8c9cfc164d44e12df5e6d3b2cdb84 [01:15:47] anyway, logging off for now :) [01:16:12] same here. DHCP servers have been switched. things still working, confirmed.. and logging off in a minute. cya [01:16:32] aaahhhhhh [01:16:42] https://www.w3.org/TR/network-error-logging/#generate-a-network-error-report fascinating [01:17:00] I had forgotten about the 'downgrade' step [01:18:09] (03CR) 10Dzahn: "[prometheus5001:~] $ http_proxy='install5001.wikimedia.org:8080' curl example.com | grep h1" [dns] - 10https://gerrit.wikimedia.org/r/632591 (https://phabricator.wikimedia.org/T242602) (owner: 10Dzahn) [01:18:15] (03PS2) 10Dzahn: switch webproxy for esams/ulsfo/eqsin to their local install server [dns] - 10https://gerrit.wikimedia.org/r/632591 (https://phabricator.wikimedia.org/T242602) [01:28:32] RECOVERY - Check systemd state on ms-be2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:41:32] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [01:58:00] (03Abandoned) 10Jeena Huneidi: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/629758 (owner: 10PipelineBot) [02:27:19] (03CR) 10Ppchelko: "Wait a second. This means this whole entire thing didn't work before??" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632589 (owner: 10Clarakosi) [02:41:43] (03CR) 10Clarakosi: "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632589 (owner: 10Clarakosi) [02:44:07] (03CR) 10Ppchelko: [C: 03+1] "I'm surprised because we all have been testing it quite a lot. I guess nobody actually ever got to the high-tier clients.. Will deploy fir" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632589 (owner: 10Clarakosi) [03:11:38] (03PS1) 10Jforrester: [DNM] loginwiki: Allow users to mark Notifications as read [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632598 [03:19:15] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt1023: to Ceph and Buster [puppet] - 10https://gerrit.wikimedia.org/r/632545 (https://phabricator.wikimedia.org/T259399) (owner: 10Andrew Bogott) [03:43:28] (03PS2) 10Jforrester: [DNM] loginwiki: Allow users to mark Notifications as read [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632598 (https://phabricator.wikimedia.org/T264834) [03:43:42] (03CR) 10Jforrester: [C: 04-2] "Needs Security sign-off." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632598 (https://phabricator.wikimedia.org/T264834) (owner: 10Jforrester) [03:48:50] PROBLEM - Check systemd state on ms-be2044 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:38:40] RECOVERY - Check systemd state on ms-be2044 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:03:28] 10Operations, 10DBA, 10Wikidata, 10serviceops: Hourly read spikes against s8 resulting in occasional user-visible latency & error spikes - https://phabricator.wikimedia.org/T264821 (10Marostegui) Adding @Addshore and @Ladsgroup as they have lots of cool dashboards (that I am unable to find) where they mayb... [05:46:28] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [05:46:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:52:28] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [05:52:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:31] (03PS1) 10Ayounsi: Pmacct add standard BGP community to flows [puppet] - 10https://gerrit.wikimedia.org/r/632603 (https://phabricator.wikimedia.org/T254332) [06:27:28] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops, 10Patch-For-Review: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10ayounsi) >>! In T254332#6523115, @mforns wrote: > This, I assume, needs to be added to the pmacct producer? That's correct,... [06:55:24] (03CR) 10Elukey: "Fran let me know if you are ready to merge, and I'll do it :)" [puppet] - 10https://gerrit.wikimedia.org/r/629409 (https://phabricator.wikimedia.org/T251777) (owner: 10Fdans) [06:57:54] RECOVERY - snapshot of s3 in eqiad on alert1001 is OK: Last snapshot for s3 at eqiad (db1095.eqiad.wmnet:3313) taken on 2020-10-07 05:26:45 (957 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [06:58:12] 10Operations, 10ops-eqiad: Move eqiad WIFI subnet to 10.66.1.0/24 - https://phabricator.wikimedia.org/T264839 (10ayounsi) p:05Triage→03High [07:01:48] 10Operations, 10ops-eqiad: Move eqiad WIFI subnet to 10.66.1.0/24 - https://phabricator.wikimedia.org/T264839 (10ayounsi) Created/renamed: https://netbox.wikimedia.org/ipam/prefixes/357/ and https://netbox.wikimedia.org/ipam/prefixes/348/ [07:05:25] (03PS1) 10Ayounsi: Renumber eqiad wifi prefix [homer/public] - 10https://gerrit.wikimedia.org/r/632642 (https://phabricator.wikimedia.org/T264839) [07:05:39] (03CR) 10Muehlenhoff: [C: 03+2] Don't add diamond to new images [puppet] - 10https://gerrit.wikimedia.org/r/632477 (https://phabricator.wikimedia.org/T210993) (owner: 10Muehlenhoff) [07:05:48] (03PS2) 10Muehlenhoff: Don't add diamond to new images [puppet] - 10https://gerrit.wikimedia.org/r/632477 (https://phabricator.wikimedia.org/T210993) [07:06:07] (03CR) 10Ayounsi: [C: 03+2] Renumber eqiad wifi prefix [homer/public] - 10https://gerrit.wikimedia.org/r/632642 (https://phabricator.wikimedia.org/T264839) (owner: 10Ayounsi) [07:08:02] 10Operations, 10ops-eqiad, 10Patch-For-Review: Move eqiad WIFI subnet to 10.66.1.0/24 - https://phabricator.wikimedia.org/T264839 (10ayounsi) `name=homer diff after above changes,lang=diff [edit system services dhcp] + pool 10.66.1.0/24 { + address-range low 10.66.1.64 high 10.66.1.127; +... [07:10:31] 10Operations, 10ops-eqiad: Move eqiad WIFI subnet to 10.66.1.0/24 - https://phabricator.wikimedia.org/T264839 (10ayounsi) 05Open→03Resolved Done and should be fully transparent for onsite. Let me know if any issues. [07:13:10] (03PS1) 10Marostegui: es2015: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/632643 (https://phabricator.wikimedia.org/T264700) [07:13:20] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/632571 (https://phabricator.wikimedia.org/T210993) (owner: 10Dzahn) [07:14:11] (03CR) 10Marostegui: [C: 03+2] es2015: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/632643 (https://phabricator.wikimedia.org/T264700) (owner: 10Marostegui) [07:14:34] !log Stop MySQL es2015 for decommissioning T264700 [07:14:38] (03CR) 10Muehlenhoff: [C: 04-1] "Dupe of https://gerrit.wikimedia.org/r/c/operations/puppet/+/632477" [puppet] - 10https://gerrit.wikimedia.org/r/632569 (https://phabricator.wikimedia.org/T210993) (owner: 10Dzahn) [07:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:41] 10Operations, 10ops-codfw: ps1-b3-codfw AB feed current > 12A - https://phabricator.wikimedia.org/T262809 (10ayounsi) 05Resolved→03Open This has been alerting again. [07:14:42] T264700: decommission es2015.codfw.wmnet - https://phabricator.wikimedia.org/T264700 [07:15:24] (03CR) 10Ayounsi: dns: consolidate reverse zone files (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/632574 (https://phabricator.wikimedia.org/T264273) (owner: 10Volans) [07:16:48] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/632591 (https://phabricator.wikimedia.org/T242602) (owner: 10Dzahn) [07:27:43] 10Operations, 10ops-codfw, 10DBA, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Marostegui) The host crashed again yesterday while loading a backup, same CPU error as always ` --------------------------------------------------------------------... [07:33:29] (03PS1) 10Volans: Move eqiad WIFI subnet to 10.66.1.0/24 [dns] - 10https://gerrit.wikimedia.org/r/632645 (https://phabricator.wikimedia.org/T264839) [07:40:36] (03PS2) 10Volans: Move eqiad WIFI subnet to 10.66.1.0/24 [dns] - 10https://gerrit.wikimedia.org/r/632645 (https://phabricator.wikimedia.org/T264839) [07:45:27] (03PS1) 10Marostegui: instances.yaml: Remove es2015 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/632647 (https://phabricator.wikimedia.org/T264700) [07:48:16] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [07:48:39] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove es2015 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/632647 (https://phabricator.wikimedia.org/T264700) (owner: 10Marostegui) [07:49:04] yes that's me, the prefix change requires some manual action, it's the first time we're taking time to get it done properly :) [07:49:42] PROBLEM - Check systemd state on ms-be2043 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:49:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove es2015 from dbctl T264700', diff saved to https://phabricator.wikimedia.org/P12941 and previous config saved to /var/cache/conftool/dbconfig/20201007-074951-marostegui.json [07:49:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:58] T264700: decommission es2015.codfw.wmnet - https://phabricator.wikimedia.org/T264700 [07:55:05] (03CR) 10Volans: [C: 03+2] Move eqiad WIFI subnet to 10.66.1.0/24 [dns] - 10https://gerrit.wikimedia.org/r/632645 (https://phabricator.wikimedia.org/T264839) (owner: 10Volans) [07:56:11] (03PS1) 10Muehlenhoff: Install ldap-replica200[34] as additional LDAP replicas [puppet] - 10https://gerrit.wikimedia.org/r/632648 (https://phabricator.wikimedia.org/T264388) [07:58:04] !log volans@cumin1001 START - Cookbook sre.dns.netbox [07:58:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:20] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:05:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:25] 10Operations, 10DBA, 10Wikidata, 10serviceops: Hourly read spikes against s8 resulting in occasional user-visible latency & error spikes - https://phabricator.wikimedia.org/T264821 (10Marostegui) I forgot to paste: ` root@mwmaint2001:~# crontab -l -uwww-data | grep -w wikidatawiki */3 * * * * echo "$$: Sta... [08:08:37] (03PS3) 10Elukey: Import the action module from spicerack [software/pywmflib] - 10https://gerrit.wikimedia.org/r/631792 (https://phabricator.wikimedia.org/T257905) [08:08:40] (03PS3) 10Elukey: Import the config module from Spicerack [software/pywmflib] - 10https://gerrit.wikimedia.org/r/631909 (https://phabricator.wikimedia.org/T257905) [08:08:44] (03PS2) 10Elukey: Import the phabricator module from Spicerack [software/pywmflib] - 10https://gerrit.wikimedia.org/r/631910 (https://phabricator.wikimedia.org/T257905) [08:09:16] !log updated envoyproxy to 1.15.1-2 on all non mw and restbase hosts [08:09:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:03] (03CR) 10Elukey: Import the config module from Spicerack (032 comments) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/631909 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey) [08:11:12] (03CR) 10Elukey: Import the phabricator module from Spicerack (031 comment) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/631910 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey) [08:13:44] (03PS4) 10Elukey: Import the config module from Spicerack [software/pywmflib] - 10https://gerrit.wikimedia.org/r/631909 (https://phabricator.wikimedia.org/T257905) [08:13:47] (03PS3) 10Elukey: Import the phabricator module from Spicerack [software/pywmflib] - 10https://gerrit.wikimedia.org/r/631910 (https://phabricator.wikimedia.org/T257905) [08:14:18] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: enable rsyslog queues for kafka in esams/eqsin/eqiad [puppet] - 10https://gerrit.wikimedia.org/r/632512 (https://phabricator.wikimedia.org/T226703) (owner: 10Filippo Giunchedi) [08:14:43] (03CR) 10Elukey: Import the config module from Spicerack (032 comments) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/631909 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey) [08:16:57] 10Operations, 10DBA, 10Wikidata, 10serviceops: Hourly read spikes against s8 resulting in occasional user-visible latency & error spikes - https://phabricator.wikimedia.org/T264821 (10Marostegui) The only common query I have found that executes on all hosts is: ` SELECT /* Wikibase\Lib\Store\Sql\Terms\Data... [08:20:32] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: expand swift object server statsd mappings [puppet] - 10https://gerrit.wikimedia.org/r/632205 (https://phabricator.wikimedia.org/T264588) (owner: 10Filippo Giunchedi) [08:21:35] (03PS1) 10Elukey: Decommission analytics1042 from Analytics Hadoop [puppet] - 10https://gerrit.wikimedia.org/r/632650 (https://phabricator.wikimedia.org/T255140) [08:22:06] (03CR) 10jerkins-bot: [V: 04-1] Decommission analytics1042 from Analytics Hadoop [puppet] - 10https://gerrit.wikimedia.org/r/632650 (https://phabricator.wikimedia.org/T255140) (owner: 10Elukey) [08:24:45] (03PS2) 10Elukey: Decommission analytics1042 from Analytics Hadoop [puppet] - 10https://gerrit.wikimedia.org/r/632650 (https://phabricator.wikimedia.org/T255140) [08:25:20] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [08:31:54] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10ayounsi) Note that now racks `C8` and `D5` are dedicated to WMCS servers (including cloudvirt). So please move servers there when able. [08:32:50] RECOVERY - Check systemd state on ms-be2043 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:32:50] !log roll-restart statsd-exporter across ms-be* after puppet run - T264588 [08:32:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:56] T264588: Report swift-object server per-method latencies - https://phabricator.wikimedia.org/T264588 [08:33:17] (03PS3) 10Elukey: Decommission analytics1042 from Analytics Hadoop [puppet] - 10https://gerrit.wikimedia.org/r/632650 (https://phabricator.wikimedia.org/T255140) [08:33:19] (03PS1) 10Elukey: profile::hadoop::master::standby: improve hiera lookups [puppet] - 10https://gerrit.wikimedia.org/r/632653 (https://phabricator.wikimedia.org/T255140) [08:34:43] (03PS1) 10Muehlenhoff: Remove LDAP access for cgauthier [puppet] - 10https://gerrit.wikimedia.org/r/632654 [08:36:20] (03PS2) 10Elukey: profile::hadoop::master::standby: improve hiera lookups [puppet] - 10https://gerrit.wikimedia.org/r/632653 (https://phabricator.wikimedia.org/T255140) [08:38:11] (03PS3) 10Elukey: profile::hadoop::master::standby: improve hiera lookups [puppet] - 10https://gerrit.wikimedia.org/r/632653 (https://phabricator.wikimedia.org/T255140) [08:38:50] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [08:38:50] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:38:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:03] !log kormat@cumin1001 dbctl commit (dc=all): 'db2138:3314 depooling: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12942 and previous config saved to /var/cache/conftool/dbconfig/20201007-083903-kormat.json [08:39:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:09] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [08:39:16] (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for cgauthier [puppet] - 10https://gerrit.wikimedia.org/r/632654 (owner: 10Muehlenhoff) [08:39:54] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1003/25755/" [puppet] - 10https://gerrit.wikimedia.org/r/632650 (https://phabricator.wikimedia.org/T255140) (owner: 10Elukey) [08:39:57] (03CR) 10Elukey: [C: 03+2] Decommission analytics1042 from Analytics Hadoop [puppet] - 10https://gerrit.wikimedia.org/r/632650 (https://phabricator.wikimedia.org/T255140) (owner: 10Elukey) [08:40:39] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1003/25755/" [puppet] - 10https://gerrit.wikimedia.org/r/632653 (https://phabricator.wikimedia.org/T255140) (owner: 10Elukey) [08:40:42] (03CR) 10Elukey: [C: 03+2] profile::hadoop::master::standby: improve hiera lookups [puppet] - 10https://gerrit.wikimedia.org/r/632653 (https://phabricator.wikimedia.org/T255140) (owner: 10Elukey) [08:40:49] (03PS4) 10Elukey: profile::hadoop::master::standby: improve hiera lookups [puppet] - 10https://gerrit.wikimedia.org/r/632653 (https://phabricator.wikimedia.org/T255140) [08:40:55] 10Operations, 10SRE-swift-storage, 10Patch-For-Review, 10User-fgiunchedi: Report swift-object server per-method latencies - https://phabricator.wikimedia.org/T264588 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This is complete! [08:40:58] 10Operations, 10SRE-swift-storage, 10Patch-For-Review, 10User-fgiunchedi: Put ms-be2057 (Dell R740xd2) in service - https://phabricator.wikimedia.org/T261633 (10fgiunchedi) [08:42:07] (03CR) 10Muehlenhoff: "The patch itself is fine, but you'll need to rebase it on top of https://gerrit.wikimedia.org/r/c/operations/puppet/+/632471 (once merged)" [puppet] - 10https://gerrit.wikimedia.org/r/632570 (https://phabricator.wikimedia.org/T210993) (owner: 10Dzahn) [08:51:08] PROBLEM - Hadoop NodeManager on analytics1042 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [08:52:25] elukey: FYI ^^ [08:53:43] yes yes it is me, I am decomming it, I didn't expect the daemon to die though :D [08:54:31] ah yes it gracefully shutsdown, will add downtime for the next ones [08:55:20] lol thx [09:04:08] (03PS1) 10Filippo Giunchedi: hieradata: enable rsyslog queues for kafka in codfw [puppet] - 10https://gerrit.wikimedia.org/r/632655 (https://phabricator.wikimedia.org/T226703) [09:04:20] (03CR) 10jerkins-bot: [V: 04-1] hieradata: enable rsyslog queues for kafka in codfw [puppet] - 10https://gerrit.wikimedia.org/r/632655 (https://phabricator.wikimedia.org/T226703) (owner: 10Filippo Giunchedi) [09:05:04] (03PS2) 10Filippo Giunchedi: hieradata: enable rsyslog queues for kafka in codfw [puppet] - 10https://gerrit.wikimedia.org/r/632655 (https://phabricator.wikimedia.org/T226703) [09:07:31] looking for volunteers for a +1 on https://gerrit.wikimedia.org/r/c/operations/puppet/+/632655 should be safe, just codfw left to roll out [09:09:08] 10Operations, 10Traffic, 10Performance-Team (Radar): Elevated latency starting 2020-09-28 - https://phabricator.wikimedia.org/T264398 (10ema) It seems that this task is starting to follow the communication pattern that emerged in T238494, which wasn't great. Flagging it early on to try to avoid the unpleasan... [09:09:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1076 T264755 ', diff saved to https://phabricator.wikimedia.org/P12943 and previous config saved to /var/cache/conftool/dbconfig/20201007-090943-marostegui.json [09:09:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:50] T264755: db1076 crashed - BBU failure - https://phabricator.wikimedia.org/T264755 [09:11:40] 10Operations, 10DBA, 10Data-Persistence: db1076 crashed - BBU failure - https://phabricator.wikimedia.org/T264755 (10Marostegui) 05Open→03Resolved a:03Marostegui The table comparison came back clean. I have repooled the host. This host is a candidate master for s2, so it runs stretch and 10.1. It will... [09:11:42] 10Operations, 10DBA: db1080-95 batch possibly suffering BBU issues - https://phabricator.wikimedia.org/T258386 (10Marostegui) [09:21:02] !log imported icu63 63.1-6+deb10u1~wmf1 to component/icu63 for stretch-wikimedia [09:21:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:05] 10Operations, 10Analytics-Clusters, 10Traffic: varnishkafka 1.1.0 CPU usage increase - https://phabricator.wikimedia.org/T264074 (10Vgutierrez) yeah.. I'll handle the backport :) [09:33:00] 10Operations, 10Analytics-Clusters, 10Traffic: varnishkafka 1.1.0 CPU usage increase - https://phabricator.wikimedia.org/T264074 (10Vgutierrez) @elukey double checking https://gerrit.wikimedia.org/r/plugins/gitiles/operations/debs/varnish4/+/refs/heads/debian-wmf/lib/libvarnishapi/vut.c#424 it looks like ed1... [09:37:08] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/624237 (https://phabricator.wikimedia.org/T220625) (owner: 10Ryan Kemper) [09:38:07] (03CR) 10Gehel: [V: 03+2 C: 03+2] "LGTM" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/623819 (owner: 10Ryan Kemper) [09:39:00] 10Operations, 10Traffic, 10Performance-Team (Radar): Elevated latency starting 2020-09-28 - https://phabricator.wikimedia.org/T264398 (10Gilles) It seems normal to me to paste screenshots of findings, create tasks when a significant regression is witnessed, verify the cause via a targeted rollback. Performan... [09:44:12] !log kormat@cumin1001 dbctl commit (dc=all): 'db2138:3314 (re)pooling @ 100%: 75', diff saved to https://phabricator.wikimedia.org/P12944 and previous config saved to /var/cache/conftool/dbconfig/20201007-094412-kormat.json [09:44:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:31] (03PS1) 10JMeybohm: deployment_server::helmfile: Allow to define general values in hiera [puppet] - 10https://gerrit.wikimedia.org/r/632658 [09:48:28] (03PS1) 10Arturo Borrero Gonzalez: openstack: cloudgw: refresh network setup [puppet] - 10https://gerrit.wikimedia.org/r/632659 (https://phabricator.wikimedia.org/T263622) [09:49:07] 10Operations, 10Analytics-Clusters, 10Traffic: varnishkafka 1.1.0 CPU usage increase - https://phabricator.wikimedia.org/T264074 (10elukey) ah snap I didn't check, will do the next time using the gerrit repo. It is not clear to me why we have all those fstat calls though.. [09:50:05] (03PS2) 10JMeybohm: deployment_server::helmfile: Allow to define general values in hiera [puppet] - 10https://gerrit.wikimedia.org/r/632658 [09:53:55] !log kormat@cumin1001 dbctl commit (dc=all): 'Remove db2119 from mw load groups T259831', diff saved to https://phabricator.wikimedia.org/P12945 and previous config saved to /var/cache/conftool/dbconfig/20201007-095355-kormat.json [09:54:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:01] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [09:56:36] (03CR) 10Jbond: "See inline can follow up with a further path but not sure anything can be Stdlib::Unixpath" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/616758 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [09:56:38] (03CR) 10Jbond: [C: 03+2] service: drop legacy validate functions [puppet] - 10https://gerrit.wikimedia.org/r/616758 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [09:58:05] (03PS3) 10JMeybohm: deployment_server::helmfile: Allow to define general values in hiera [puppet] - 10https://gerrit.wikimedia.org/r/632658 (https://phabricator.wikimedia.org/T264157) [09:58:47] (03CR) 10JMeybohm: "PCC https://puppet-compiler.wmflabs.org/compiler1002/25757/" [puppet] - 10https://gerrit.wikimedia.org/r/632658 (https://phabricator.wikimedia.org/T264157) (owner: 10JMeybohm) [09:59:07] any volunteers/kind sounds for a +1 on simple https://gerrit.wikimedia.org/r/c/operations/puppet/+/632655 ? [09:59:14] s/sounds/souls/ [09:59:20] I like the sounds version too tho [10:03:04] I'll have a look in 5m [10:03:22] thanks! appreciate it [10:04:22] (03CR) 10Elukey: [C: 03+1] hieradata: enable rsyslog queues for kafka in codfw [puppet] - 10https://gerrit.wikimedia.org/r/632655 (https://phabricator.wikimedia.org/T226703) (owner: 10Filippo Giunchedi) [10:07:50] (03CR) 10Muehlenhoff: [C: 03+1] hieradata: enable rsyslog queues for kafka in codfw [puppet] - 10https://gerrit.wikimedia.org/r/632655 (https://phabricator.wikimedia.org/T226703) (owner: 10Filippo Giunchedi) [10:08:06] \o/ thanks folks, appreciate it [10:08:17] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: enable rsyslog queues for kafka in codfw [puppet] - 10https://gerrit.wikimedia.org/r/632655 (https://phabricator.wikimedia.org/T226703) (owner: 10Filippo Giunchedi) [10:10:31] (03PS1) 10Giuseppe Lavagetto: redis::instance: switch to use systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/632661 [10:10:33] (03PS1) 10Giuseppe Lavagetto: redis::instance: raise the file limit to match maxclients [puppet] - 10https://gerrit.wikimedia.org/r/632662 (https://phabricator.wikimedia.org/T263910) [10:12:13] (03CR) 10Jbond: [C: 03+2] validate_$type: add checks to prevent legacy stdlib functions [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/616895 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [10:12:41] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM, although I'm starting to feel we're reimplementing hiera here 😉" [puppet] - 10https://gerrit.wikimedia.org/r/632658 (https://phabricator.wikimedia.org/T264157) (owner: 10JMeybohm) [10:12:55] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: cloudgw: refresh network setup [puppet] - 10https://gerrit.wikimedia.org/r/632659 (https://phabricator.wikimedia.org/T263622) (owner: 10Arturo Borrero Gonzalez) [10:14:42] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: move deployment-prep swift settings off Horizon [puppet] - 10https://gerrit.wikimedia.org/r/631758 (owner: 10Filippo Giunchedi) [10:14:49] (03PS2) 10Filippo Giunchedi: hieradata: move deployment-prep swift settings off Horizon [puppet] - 10https://gerrit.wikimedia.org/r/631758 [10:15:39] 10Operations, 10Traffic, 10Performance-Team (Radar): Elevated latency starting 2020-09-28 - https://phabricator.wikimedia.org/T264398 (10Gilles) On the topic of reliable metrics, our RUM monitoring is reliable, battle-tested and the ultimate truth about the performance users really experience. While it can't... [10:22:10] (03PS1) 10Jbond: 1.0.6: create new release [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/632664 [10:23:33] (03CR) 10Jbond: [C: 03+2] 1.0.6: create new release [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/632664 (owner: 10Jbond) [10:28:56] (03PS1) 10Jbond: Add publishing instructions and add gem file to gitignore file [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/632665 [10:29:21] (03CR) 10Hnowlan: [C: 03+1] Fix OAuthRateLimiter rate limit configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632589 (owner: 10Clarakosi) [10:29:23] (03CR) 10Jbond: [C: 03+2] Add publishing instructions and add gem file to gitignore file [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/632665 (owner: 10Jbond) [10:33:53] (03CR) 10Giuseppe Lavagetto: [C: 03+2] service::configuration: connect to restbase via TLS [puppet] - 10https://gerrit.wikimedia.org/r/630562 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [10:39:51] (03PS1) 10Jbond: wmf_styleguide: bump stylgude gem to 1.0.6 [puppet] - 10https://gerrit.wikimedia.org/r/632687 [10:40:16] (03CR) 10jerkins-bot: [V: 04-1] wmf_styleguide: bump stylgude gem to 1.0.6 [puppet] - 10https://gerrit.wikimedia.org/r/632687 (owner: 10Jbond) [10:45:54] (03PS2) 10Jbond: wmf_styleguide: bump stylgude gem to 1.0.6 [puppet] - 10https://gerrit.wikimedia.org/r/632687 [10:46:17] (03CR) 10jerkins-bot: [V: 04-1] wmf_styleguide: bump stylgude gem to 1.0.6 [puppet] - 10https://gerrit.wikimedia.org/r/632687 (owner: 10Jbond) [10:47:45] (03PS1) 10Filippo Giunchedi: hieradata: add missing swift private_container_list to deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/632691 [10:48:55] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: add missing swift private_container_list to deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/632691 (owner: 10Filippo Giunchedi) [10:49:12] 10Operations, 10SRE-OnFire, 10Sustainability (Incident Followup): Investigate whether we can automatically share incident status docs with WMDE - https://phabricator.wikimedia.org/T244395 (10Lucas_Werkmeister_WMDE) Works for me, thanks! I added a description to the folder so we (in WMDE) hopefully don’t forg... [10:53:03] (03PS3) 10Jbond: wmf_styleguide: bump stylgude gem to 1.0.7 [puppet] - 10https://gerrit.wikimedia.org/r/632687 [10:53:26] (03CR) 10jerkins-bot: [V: 04-1] wmf_styleguide: bump stylgude gem to 1.0.7 [puppet] - 10https://gerrit.wikimedia.org/r/632687 (owner: 10Jbond) [10:54:41] (03PS1) 10Jbond: 1.0.7: update puppet-lint dependency [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/632692 [10:54:48] (03CR) 10MSantos: [WIP] maps: block 3rd parties with 403, even hits (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/570156 (https://phabricator.wikimedia.org/T244278) (owner: 10BBlack) [10:54:58] (03CR) 10jerkins-bot: [V: 04-1] 1.0.7: update puppet-lint dependency [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/632692 (owner: 10Jbond) [10:57:52] (03PS4) 10Jbond: wmf_styleguide: bump stylgude gem to 1.0.7 [puppet] - 10https://gerrit.wikimedia.org/r/632687 [10:58:00] !log Set innodb_change_buffering = inserts on pc2009 T263443 [10:58:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:06] T263443: Evaluate the impact of changing innodb_change_buffering to inserts - https://phabricator.wikimedia.org/T263443 [10:58:23] (03CR) 10jerkins-bot: [V: 04-1] wmf_styleguide: bump stylgude gem to 1.0.7 [puppet] - 10https://gerrit.wikimedia.org/r/632687 (owner: 10Jbond) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for European mid-day backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201007T1100). [11:00:04] kart_ and hnowlan: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:16] I can deploy today [11:00:21] * kart_ is here. [11:01:11] (03PS2) 10Urbanecm: Set CXMTThresholdForPublish to 95% for Vietnamese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632214 (https://phabricator.wikimedia.org/T264161) (owner: 10KartikMistry) [11:01:22] (03CR) 10Urbanecm: [C: 03+2] Set CXMTThresholdForPublish to 95% for Vietnamese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632214 (https://phabricator.wikimedia.org/T264161) (owner: 10KartikMistry) [11:01:31] kart_: are you able to test it? [11:02:00] Urbanecm: Yeah. Can check value is set. [11:02:13] (03Merged) 10jenkins-bot: Set CXMTThresholdForPublish to 95% for Vietnamese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632214 (https://phabricator.wikimedia.org/T264161) (owner: 10KartikMistry) [11:03:53] kart_: pulled onto mwdebug2001 [11:04:16] Testing.. [11:06:22] Urbanecm: looks good. Please go ahead. [11:06:40] syncing [11:07:05] hnowlan: hello, are you around? [11:07:11] hi! [11:07:20] cool! [11:07:46] (03CR) 10Urbanecm: [C: 03+2] Fix OAuthRateLimiter rate limit configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632589 (owner: 10Clarakosi) [11:07:58] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 6cdeea2c4c15780a641722157584f12febedab2a: Set CXMTThresholdForPublish to 95% for Vietnamese Wikipedia (T264161) (duration: 00m 59s) [11:08:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:04] T264161: Adjust the threshold for Vietnamese to prevent publishing when overall unmodified content is higher than 95% - https://phabricator.wikimedia.org/T264161 [11:08:13] kart_: synced [11:08:42] (03Merged) 10jenkins-bot: Fix OAuthRateLimiter rate limit configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632589 (owner: 10Clarakosi) [11:09:01] hnowlan: are you able to test the patch? [11:09:20] Urbanecm: Thanks! [11:10:11] (03CR) 10Lucas Werkmeister (WMDE): "The task says that this should also affect Test Wikidata, but this change only sets the config for Wikidata." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631809 (https://phabricator.wikimedia.org/T254280) (owner: 10Abián) [11:10:12] Urbanecm: yep, should be possible to a reasonable degree [11:10:27] hnowlan: in that case, please test it via mwdebug2001 :) [11:10:36] will do, thanks! [11:12:22] just one more check [11:12:32] hnowlan: sure, I'm waiting [11:13:42] Urbanecm: looks good, please proceed [11:13:46] syncing, thanks [11:14:04] !log urbanecm@deploy1001 sync-file aborted: 57297362c0a22ecf16648b7be4a73c4cb80d53ef: Fix OAuthRateLimiter rate limit configuration (duration: 00m 02s) [11:14:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:42] (03PS2) 10Volans: dns: consolidate reverse zone files (part 1) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/632574 (https://phabricator.wikimedia.org/T264273) [11:15:15] !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: 57297362c0a22ecf16648b7be4a73c4cb80d53ef: Fix OAuthRateLimiter rate limit configuration (duration: 00m 59s) [11:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:39] hnowlan: should be live [11:15:58] great, thank you! [11:16:01] no problem [11:17:04] (03PS3) 10Urbanecm: Enable bot passwords at all fishbowl and private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631423 (https://phabricator.wikimedia.org/T258356) [11:17:07] (03CR) 10JMeybohm: [C: 03+2] deployment_server::helmfile: Allow to define general values in hiera [puppet] - 10https://gerrit.wikimedia.org/r/632658 (https://phabricator.wikimedia.org/T264157) (owner: 10JMeybohm) [11:17:28] (03CR) 10Urbanecm: [C: 03+2] Enable bot passwords at all fishbowl and private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631423 (https://phabricator.wikimedia.org/T258356) (owner: 10Urbanecm) [11:17:33] (03CR) 10Filippo Giunchedi: [C: 03+1] conftool-data: add new restbase hosts [puppet] - 10https://gerrit.wikimedia.org/r/632497 (https://phabricator.wikimedia.org/T261512) (owner: 10Hnowlan) [11:18:22] (03Merged) 10jenkins-bot: Enable bot passwords at all fishbowl and private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631423 (https://phabricator.wikimedia.org/T258356) (owner: 10Urbanecm) [11:19:10] (03CR) 10Volans: "After having tried various approaches I couldn't fine one that was safely and reliably generating the new records, keeping the old one and" (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/632574 (https://phabricator.wikimedia.org/T264273) (owner: 10Volans) [11:21:48] (03PS1) 10Volans: sre.dns.netbox: add --skip-authdns-update option [cookbooks] - 10https://gerrit.wikimedia.org/r/632697 (https://phabricator.wikimedia.org/T264846) [11:21:53] (03PS5) 10JMeybohm: deployment_server::helmfile: Allow default secrets per environment [puppet] - 10https://gerrit.wikimedia.org/r/631720 (https://phabricator.wikimedia.org/T260917) [11:22:12] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: f85bc3056f809910c0487fb0b0559b3de92b1992: Enable bot passwords at all fishbowl and private wikis (T258356) (duration: 00m 58s) [11:22:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:18] T258356: Allow users at all private/fishbowl wikis to use botpasswords - https://phabricator.wikimedia.org/T258356 [11:22:25] !log EU B&C window done [11:22:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:45] (03PS1) 10JMeybohm: Move to use tls.image_version from general values [deployment-charts] - 10https://gerrit.wikimedia.org/r/632699 [11:32:06] 04Critical Alert for device cr3-knams.wikimedia.org - Traffic bill over quota [11:32:10] 04Critical Alert for device cr2-eqdfw.wikimedia.org - Traffic bill over quota [11:33:05] 04Critical Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota [11:33:39] XioNoX: ^^^ ( librenms-wmf ) [11:33:54] yep [11:33:54] thx [11:34:22] (03CR) 10JMeybohm: [C: 03+2] Move to use tls.image_version from general values [deployment-charts] - 10https://gerrit.wikimedia.org/r/632699 (owner: 10JMeybohm) [11:36:36] (03Merged) 10jenkins-bot: Move to use tls.image_version from general values [deployment-charts] - 10https://gerrit.wikimedia.org/r/632699 (owner: 10JMeybohm) [11:37:46] 10Operations, 10DBA, 10Wikidata, 10serviceops: Hourly read spikes against s8 resulting in occasional user-visible latency & error spikes - https://phabricator.wikimedia.org/T264821 (10Addshore) One of the places to look that should track all access to this term related storage is https://grafana.wikimedia.... [11:38:41] (03PS1) 10Muehlenhoff: Add a pbuilder hook to include apt sources for icu63-linked packages [puppet] - 10https://gerrit.wikimedia.org/r/632700 [11:42:34] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/631270 (owner: 10PipelineBot) [11:43:06] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/631782 (owner: 10PipelineBot) [11:43:48] (03PS2) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/631783 (owner: 10PipelineBot) [11:46:11] (03PS3) 10Abián: Require autoconfirmed status to edit Wikidata Properties [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631809 (https://phabricator.wikimedia.org/T254280) [11:47:46] (03CR) 10Abián: "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631809 (https://phabricator.wikimedia.org/T254280) (owner: 10Abián) [11:51:39] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Add a pbuilder hook to include apt sources for icu63-linked packages [puppet] - 10https://gerrit.wikimedia.org/r/632700 (owner: 10Muehlenhoff) [11:51:49] <_joe_> thanks moritzm :) [11:53:06] 04Critical Alert for device cr3-knams.wikimedia.org - Traffic bill over quota got acknowledged [11:53:06] I'm making progress, BTW: I have a co-installable icu63 ready backport and currently working on rebuilding PHP 7.2 against it [11:53:10] 04Critical Alert for device cr2-eqdfw.wikimedia.org - Traffic bill over quota got acknowledged [11:53:15] 04Critical Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota got acknowledged [11:54:36] <_joe_> moritzm: ohhh nice :) [11:55:51] <_joe_> !log rolling restart of restbase due to running puppet with changed config-vars (a noop for the actual configuration) [11:55:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:10] (03PS5) 10Jbond: wmf_styleguide: bump stylgude gem to 1.0.7 [puppet] - 10https://gerrit.wikimedia.org/r/632687 [11:56:12] (03PS1) 10Jbond: rubocop: fix current ribocop issues [puppet] - 10https://gerrit.wikimedia.org/r/632707 [12:00:05] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201007T1200) [12:01:22] (03PS2) 10Jbond: rubocop: fix current ribocop issues [puppet] - 10https://gerrit.wikimedia.org/r/632707 [12:04:18] (03PS3) 10Jbond: rubocop: fix current ribocop issues [puppet] - 10https://gerrit.wikimedia.org/r/632707 [12:05:17] (03PS6) 10Jbond: wmf_styleguide: bump stylgude gem to 1.0.7 [puppet] - 10https://gerrit.wikimedia.org/r/632687 [12:05:42] (03CR) 10Muehlenhoff: [C: 03+2] Add a pbuilder hook to include apt sources for icu63-linked packages [puppet] - 10https://gerrit.wikimedia.org/r/632700 (owner: 10Muehlenhoff) [12:05:44] (03CR) 10Jbond: [C: 03+2] rubocop: fix current ribocop issues [puppet] - 10https://gerrit.wikimedia.org/r/632707 (owner: 10Jbond) [12:05:58] (03PS4) 10Jbond: rubocop: fix current rubocop issues [puppet] - 10https://gerrit.wikimedia.org/r/632707 [12:06:05] (03CR) 10Jbond: [V: 03+2 C: 03+2] rubocop: fix current rubocop issues [puppet] - 10https://gerrit.wikimedia.org/r/632707 (owner: 10Jbond) [12:06:26] (03PS1) 10Giuseppe Lavagetto: restbase: remove monitoring calls to the http endpoint [puppet] - 10https://gerrit.wikimedia.org/r/632708 [12:06:28] (03PS7) 10Jbond: wmf_styleguide: bump stylgude gem to 1.0.7 [puppet] - 10https://gerrit.wikimedia.org/r/632687 [12:07:01] (03PS8) 10Jbond: wmf_styleguide: bump stylgude gem to 1.0.7 [puppet] - 10https://gerrit.wikimedia.org/r/632687 [12:08:14] (03PS1) 10Alexandros Kosiaris: Remove restrouter [labs/private] - 10https://gerrit.wikimedia.org/r/632709 [12:08:18] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops, 10Patch-For-Review: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10mforns) Thanks @ayounsi > `"comms": "14907:0_14907:2_14907:3"` > To mean that the flow has the 3 communities 14907:0 14907... [12:09:33] (03PS1) 10JMeybohm: blubberoid: Update envoy to 1.15.1-2 See: Id8dfd7c5002cfd2c71b7f0aac4f21902035cc150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/632710 (https://phabricator.wikimedia.org/T264157) [12:09:35] (03PS1) 10JMeybohm: cxserver: Update envoy to 1.15.1-2 See: Id8dfd7c5002cfd2c71b7f0aac4f21902035cc150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/632711 (https://phabricator.wikimedia.org/T264157) [12:09:37] (03PS1) 10JMeybohm: eventgate-analytics: Update envoy to 1.15.1-2 See: Id8dfd7c5002cfd2c71b7f0aac4f21902035cc150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/632712 (https://phabricator.wikimedia.org/T264157) [12:09:39] (03PS1) 10JMeybohm: eventgate-main: Update envoy to 1.15.1-2 See: Id8dfd7c5002cfd2c71b7f0aac4f21902035cc150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/632713 (https://phabricator.wikimedia.org/T264157) [12:09:41] (03PS1) 10JMeybohm: mathoid: Update envoy to 1.15.1-2 See: Id8dfd7c5002cfd2c71b7f0aac4f21902035cc150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/632714 (https://phabricator.wikimedia.org/T264157) [12:09:44] (03PS1) 10JMeybohm: mobileapps: Update envoy to 1.15.1-2 See: Id8dfd7c5002cfd2c71b7f0aac4f21902035cc150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/632715 (https://phabricator.wikimedia.org/T264157) [12:09:45] (03PS1) 10JMeybohm: proton: Update envoy to 1.15.1-2 See: Id8dfd7c5002cfd2c71b7f0aac4f21902035cc150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/632716 (https://phabricator.wikimedia.org/T264157) [12:09:48] (03PS1) 10JMeybohm: push-notifications: Update envoy to 1.15.1-2 See: Id8dfd7c5002cfd2c71b7f0aac4f21902035cc150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/632717 (https://phabricator.wikimedia.org/T264157) [12:09:51] (03PS1) 10JMeybohm: termbox: Update envoy to 1.15.1-2 See: Id8dfd7c5002cfd2c71b7f0aac4f21902035cc150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/632718 (https://phabricator.wikimedia.org/T264157) [12:09:53] (03PS1) 10JMeybohm: wikifeeds: Update envoy to 1.15.1-2 See: Id8dfd7c5002cfd2c71b7f0aac4f21902035cc150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/632719 (https://phabricator.wikimedia.org/T264157) [12:09:55] (03PS1) 10JMeybohm: zotero: Update envoy to 1.15.1-2 See: Id8dfd7c5002cfd2c71b7f0aac4f21902035cc150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/632720 (https://phabricator.wikimedia.org/T264157) [12:12:16] (03CR) 10jerkins-bot: [V: 04-1] proton: Update envoy to 1.15.1-2 See: Id8dfd7c5002cfd2c71b7f0aac4f21902035cc150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/632716 (https://phabricator.wikimedia.org/T264157) (owner: 10JMeybohm) [12:17:28] (03CR) 10JMeybohm: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/632716 (https://phabricator.wikimedia.org/T264157) (owner: 10JMeybohm) [12:18:42] (03CR) 10JMeybohm: [C: 03+2] blubberoid: Update envoy to 1.15.1-2 See: Id8dfd7c5002cfd2c71b7f0aac4f21902035cc150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/632710 (https://phabricator.wikimedia.org/T264157) (owner: 10JMeybohm) [12:19:31] 10Operations, 10DBA, 10Wikidata, 10serviceops: Hourly read spikes against s8 resulting in occasional user-visible latency & error spikes - https://phabricator.wikimedia.org/T264821 (10Addshore) Flagging {T263999} up here as it could be related (not a currently prooven link) [12:21:16] (03Merged) 10jenkins-bot: blubberoid: Update envoy to 1.15.1-2 See: Id8dfd7c5002cfd2c71b7f0aac4f21902035cc150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/632710 (https://phabricator.wikimedia.org/T264157) (owner: 10JMeybohm) [12:22:39] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' . [12:22:40] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [12:22:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:53] !log jayme@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [12:24:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:09] 10Operations, 10DBA, 10Wikidata, 10serviceops: Hourly read spikes against s8 resulting in occasional user-visible latency & error spikes - https://phabricator.wikimedia.org/T264821 (10Addshore) ` 1:23 PM I also notice this? https://grafana.wikimedia.org/d/000000202/api-frontend-summary?orgId=1&f... [12:32:31] (03CR) 10Elukey: [C: 03+2] dumps::web::fetches::stat_dumps: add rsync job for pageview complete [puppet] - 10https://gerrit.wikimedia.org/r/629409 (https://phabricator.wikimedia.org/T251777) (owner: 10Fdans) [12:33:03] jbond42: are you still merging? [12:33:17] !log jayme@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [12:33:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:51] elukey: yes sorry its going now [12:34:14] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [12:34:17] super :) [12:44:10] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops, 10Patch-For-Review: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10ayounsi) > Awesome. Yes, as you said, Druid allows for multi-value dimensions. Either the Refine job or a subsequent job can... [12:45:04] 10Operations, 10DBA, 10Wikidata, 10serviceops: Hourly read spikes against s8 resulting in occasional user-visible latency & error spikes - https://phabricator.wikimedia.org/T264821 (10Ladsgroup) An update from IRC, it seems the culprit is wikifeeds: https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry... [12:46:48] (03PS1) 10Elukey: dumps::web::fetches::stats: fix pageview's hdfs path [puppet] - 10https://gerrit.wikimedia.org/r/632724 [12:47:56] (03CR) 10Fdans: [C: 03+1] "This looks perfect elukey, thank you so much for the speedy correction" [puppet] - 10https://gerrit.wikimedia.org/r/632724 (owner: 10Elukey) [12:48:01] (03CR) 10Elukey: [C: 03+2] dumps::web::fetches::stats: fix pageview's hdfs path [puppet] - 10https://gerrit.wikimedia.org/r/632724 (owner: 10Elukey) [12:48:28] (03CR) 10Jbond: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/632591 (https://phabricator.wikimedia.org/T242602) (owner: 10Dzahn) [12:51:13] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops, 10Patch-For-Review: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10mforns) OK, after a very interesting chat with Joseph, here's our conclusions: * It would be cool to have the core of the r... [12:52:48] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/631315 (owner: 10Dzahn) [12:53:43] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/630694 (owner: 10Dzahn) [12:54:55] (03CR) 10Jbond: [C: 03+1] "> Patch Set 7: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/630694 (owner: 10Dzahn) [12:55:44] * Pchelolo will deploy a little config if nobody minds [12:56:21] oh, no, never mind [12:56:58] (03PS1) 10Kormat: admin: Replace leila with leizi [puppet] - 10https://gerrit.wikimedia.org/r/632726 (https://phabricator.wikimedia.org/T264472) [12:58:08] (03Abandoned) 10Alexandros Kosiaris: lvs: Remove unused eventgate-main-http service [puppet] - 10https://gerrit.wikimedia.org/r/562810 (https://phabricator.wikimedia.org/T241073) (owner: 10Alexandros Kosiaris) [13:00:04] hashar and marxarelli: It is that lovely time of the day again! You are hereby commanded to deploy Mediawiki train - European+American Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201007T1300). [13:03:15] 10Operations, 10DBA, 10Wikidata, 10serviceops: Hourly read spikes against s8 resulting in occasional user-visible latency & error spikes - https://phabricator.wikimedia.org/T264821 (10Addshore) ` 1:57 PM While i was browsing around I also saw spikes in action=query in grafana, but couldn't dive... [13:09:28] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Ah, tech debt removal. Nice, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/632532 (owner: 10Jbond) [13:11:48] 10Operations, 10DBA, 10Wikidata, 10serviceops: Hourly read spikes against s8 resulting in occasional user-visible latency & error spikes - https://phabricator.wikimedia.org/T264821 (10Joe) To clarify a bit - restbase has hourly spikes of requests for the `feed` endpoint, which go back to wikifeeds, which c... [13:13:05] (03PS3) 10Alexandros Kosiaris: termbox: use k8s stdout/stderr logging only [deployment-charts] - 10https://gerrit.wikimedia.org/r/632432 (https://phabricator.wikimedia.org/T245603) (owner: 10Filippo Giunchedi) [13:13:51] (03CR) 10Alexandros Kosiaris: "Bundling the named_levels: true change in this to get somewhat more meaningful log levels in logstash." [deployment-charts] - 10https://gerrit.wikimedia.org/r/632432 (https://phabricator.wikimedia.org/T245603) (owner: 10Filippo Giunchedi) [13:14:35] (03CR) 10Jbond: [C: 03+2] wmflib: drop apply_format function [puppet] - 10https://gerrit.wikimedia.org/r/632532 (owner: 10Jbond) [13:15:54] (03PS4) 10Alexandros Kosiaris: termbox: use k8s stdout/stderr logging only [deployment-charts] - 10https://gerrit.wikimedia.org/r/632432 (https://phabricator.wikimedia.org/T245603) (owner: 10Filippo Giunchedi) [13:18:31] !log volker-e@deploy1001 Started deploy [design/style-guide@e3fda83]: Deploy design/style-guide: [13:18:32] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Needs a manual rebase." [puppet] - 10https://gerrit.wikimedia.org/r/610254 (https://phabricator.wikimedia.org/T257413) (owner: 10Hashar) [13:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:36] !log volker-e@deploy1001 Finished deploy [design/style-guide@e3fda83]: Deploy design/style-guide: (duration: 00m 04s) [13:18:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:07] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops, 10Patch-For-Review: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10mforns) @ayounsi > Ok to merge anytime or should I sync up with you? I believe it's OK to merge, and that Refine should id... [13:19:45] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/632432 (https://phabricator.wikimedia.org/T245603) (owner: 10Filippo Giunchedi) [13:20:24] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (sans the not-yet-confirmed SSH key)" [puppet] - 10https://gerrit.wikimedia.org/r/632726 (https://phabricator.wikimedia.org/T264472) (owner: 10Kormat) [13:22:05] (03Merged) 10jenkins-bot: termbox: use k8s stdout/stderr logging only [deployment-charts] - 10https://gerrit.wikimedia.org/r/632432 (https://phabricator.wikimedia.org/T245603) (owner: 10Filippo Giunchedi) [13:27:31] (03CR) 10Jbond: [C: 03+1] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/628459 (owner: 10Dzahn) [13:29:25] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to researchers and analytics-privatedata-users for Leila Zia - https://phabricator.wikimedia.org/T264472 (10Kormat) @leila: Ok, we're almost ready to go. The only remaining thing is to confirm your ssh key over a medium we have more c... [13:30:45] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to researchers and analytics-privatedata-users for Leila Zia - https://phabricator.wikimedia.org/T264472 (10Kormat) [13:31:12] 10Operations, 10Maps, 10Product-Infrastructure-Team-Backlog: Fix node vs nodejs dependency issue - https://phabricator.wikimedia.org/T214153 (10Jgiannelos) Should be fixed after: ` commit 2c85e694e0bf2e4fa5b4489b8cd7a01bef196786 Author: Guillaume Lederrey Date: Wed May 8... [13:31:47] (03CR) 10Elukey: [C: 03+1] "Thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/631302 (owner: 10Dzahn) [13:33:15] 10Operations, 10Maps, 10Product-Infrastructure-Team-Backlog: Fix node vs nodejs dependency issue - https://phabricator.wikimedia.org/T214153 (10Jgiannelos) 05Open→03Resolved a:03Jgiannelos [13:36:53] 10Operations, 10Traffic, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC - https://phabricator.wikimedia.org/T264881 (10CDanis) [13:37:00] 10Operations, 10Traffic, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC - https://phabricator.wikimedia.org/T264881 (10CDanis) p:05Triage→03High [13:37:16] (03CR) 10Elukey: [C: 03+1] "Little nit but looks good! Can you also file a code change to add admin_groups_no_ssh to the coordinator role hiera configs? You can have " (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/631849 (https://phabricator.wikimedia.org/T262660) (owner: 10Razzi) [13:37:18] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' . [13:37:18] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'test' . [13:37:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:29] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'termbox' for release 'production' . [13:39:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:37] (03CR) 10Elukey: "Really great work, left some comments if you want to follow up but don't consider them a blocker :)" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/631789 (https://phabricator.wikimedia.org/T261966) (owner: 10Hnowlan) [13:42:09] !log updated envoyproxy to 1.15.1-2 on all eqiad hosts [13:42:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:17] (03CR) 10Muehlenhoff: oozie: use admin groups to determine admin access (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/631849 (https://phabricator.wikimedia.org/T262660) (owner: 10Razzi) [13:45:42] (03CR) 10Elukey: "Left some comments, lemme know your thoughts!" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/632248 (https://phabricator.wikimedia.org/T264408) (owner: 10Klausman) [13:48:32] (03CR) 10Elukey: [C: 03+1] oozie: use admin groups to determine admin access (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/631849 (https://phabricator.wikimedia.org/T262660) (owner: 10Razzi) [13:51:46] (03CR) 10JMeybohm: [C: 03+2] cxserver: Update envoy to 1.15.1-2 See: Id8dfd7c5002cfd2c71b7f0aac4f21902035cc150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/632711 (https://phabricator.wikimedia.org/T264157) (owner: 10JMeybohm) [13:58:57] (03CR) 10Elukey: [C: 03+1] "Please check pcc on an-conf1001,conf1004,conf2001.codfw.wmnet,druid1004,an-druid1001 before merging, I tried but it seems failing due to t" [puppet] - 10https://gerrit.wikimedia.org/r/631295 (owner: 10Dzahn) [13:59:02] (03Merged) 10jenkins-bot: cxserver: Update envoy to 1.15.1-2 See: Id8dfd7c5002cfd2c71b7f0aac4f21902035cc150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/632711 (https://phabricator.wikimedia.org/T264157) (owner: 10JMeybohm) [13:59:03] 10Operations, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Deprecate all non-Kafka logstash inputs - https://phabricator.wikimedia.org/T227080 (10fgiunchedi) [14:00:32] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'staging' . [14:00:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:41] 10Operations, 10ops-eqiad, 10DC-Ops, 10Wikidata, and 2 others: Check for errors on wdqs1009 disks - https://phabricator.wikimedia.org/T263125 (10Cmjohnson) 05Open→03Resolved @RobH @gehel the SSD has been replaced. [14:03:15] !log akosiaris@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'termbox' for release 'production' . [14:03:17] 10Operations, 10netops, 10Security, 10User-jbond: Review default ferm INPUT policy - https://phabricator.wikimedia.org/T264888 (10jbond) [14:03:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:21] 10Operations, 10ops-eqiad, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps10[05-10].eqiad.wmnet - https://phabricator.wikimedia.org/T260269 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` maps1005.eqiad.wmnet ` The log can be found in... [14:04:10] !log jayme@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'cxserver' for release 'production' . [14:04:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:15] !log Ran "mwscript extensions/Wikibase/repo/maintenance/changePropertyDataType.php --wiki=wikidatawiki --property-id P1820 --new-data-type external-id" on mwmaint2001 (T263986) [14:04:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:20] T263986: Convert P1820 from string to external ID - https://phabricator.wikimedia.org/T263986 [14:05:25] (03PS3) 10Jbond: (WIP) firewall: change to default reject instead of drop [puppet] - 10https://gerrit.wikimedia.org/r/632543 (https://phabricator.wikimedia.org/T264888) [14:06:21] PROBLEM - MD RAID on wdqs1009 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [14:06:22] ACKNOWLEDGEMENT - MD RAID on wdqs1009 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T264889 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [14:06:26] 10Operations, 10ops-eqiad: Degraded RAID on wdqs1009 - https://phabricator.wikimedia.org/T264889 (10ops-monitoring-bot) [14:08:56] 10Operations, 10netops, 10Patch-For-Review, 10Security, 10User-jbond: Review default ferm INPUT policy - https://phabricator.wikimedia.org/T264888 (10jbond) [14:09:25] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): cloudvirt1033 psu redundancy alert - https://phabricator.wikimedia.org/T263145 (10Cmjohnson) The part has not arrived [14:09:36] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: 2020-09-15) upgrade/replace memory in stat100[58] - https://phabricator.wikimedia.org/T260448 (10Cmjohnson) a:05Cmjohnson→03wiki_willy assigning this to @wiki_willy to figure out whether we want to upgrade the power supplies [14:14:11] 10Operations, 10ops-eqiad, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps10[05-10].eqiad.wmnet - https://phabricator.wikimedia.org/T260269 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['maps1005.eqiad.wmnet'] ` Of which those **FAILED**: ` ['maps1005.eqiad.wmnet'] ` [14:22:32] !log jayme@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'cxserver' for release 'production' . [14:22:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:16] ACKNOWLEDGEMENT - Device not healthy -SMART- on mw2279 is CRITICAL: cluster=jobrunner device=sdb instance=mw2279 job=node site=codfw Muehlenhoff T264698 https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=mw2279&var-datasource=codfw+prometheus/ops [14:26:16] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2279 is CRITICAL: Host mw2279 is not in mediawiki-installation dsh group Muehlenhoff T264698 https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [14:33:38] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [14:33:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:42] 10Operations, 10netops, 10Patch-For-Review, 10Security, 10User-jbond: Review default ferm INPUT policy - https://phabricator.wikimedia.org/T264888 (10CDanis) I don't think the reflection is a concern; everything from NTP to memcache to OpenVPN to CLDAP is a better reflector by orders of magnitude. Apply... [14:36:31] 10Operations, 10netops, 10Patch-For-Review, 10Security, 10User-jbond: Review default ferm INPUT policy - https://phabricator.wikimedia.org/T264888 (10jbond) >>! In T264888#6525436, @CDanis wrote: > I don't think the reflection is a concern; everything from NTP to memcache to OpenVPN to CLDAP is a better... [14:37:16] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:37:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:27] 10Operations, 10Release-Engineering-Team, 10Wikimedia Design Style Guide: Deployment of latest Design Style Guide Gerrit clone doesn't seem to succeed - https://phabricator.wikimedia.org/T264894 (10Volker_E) [14:38:59] 10Operations, 10netops, 10Patch-For-Review, 10Security, 10User-jbond: Review default ferm INPUT policy - https://phabricator.wikimedia.org/T264888 (10CDanis) +1 from me then :) [14:41:02] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [14:41:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:21] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:45:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:29] 10Operations, 10netops, 10Patch-For-Review, 10Security, 10User-jbond: Review default ferm INPUT policy - https://phabricator.wikimedia.org/T264888 (10ayounsi) As discussed over IRC, that LGTM. Would it be easy to rollback if there are any issues or the result is not as fast as expected? [14:50:54] 10Operations, 10netops, 10Patch-For-Review, 10Security, 10User-jbond: Review default ferm INPUT policy - https://phabricator.wikimedia.org/T264888 (10jbond) > Would it be easy to rollback if there are any issues or the result is not as fast as expected? Yes and although the current CR doesn't include th... [14:52:37] 10Operations, 10netops, 10Patch-For-Review, 10Security, 10User-jbond: Review default ferm INPUT policy - https://phabricator.wikimedia.org/T264888 (10MoritzMuehlenhoff) Agreed, the service enumation/information disclosure angle is moot for us, so let's give this a shot. If we make it configurable via Hie... [14:53:03] 10Operations, 10ops-eqiad, 10decommission-hardware: Decommission analytics10[28-31,33-41] - https://phabricator.wikimedia.org/T227485 (10Cmjohnson) [14:57:09] (03CR) 10CDanis: [C: 03+1] Pmacct add standard BGP community to flows [puppet] - 10https://gerrit.wikimedia.org/r/632603 (https://phabricator.wikimedia.org/T254332) (owner: 10Ayounsi) [14:58:33] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops, 10Patch-For-Review: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10Nuria) >Maybe leaning towards using a transform function, because code would be shorter and less moving pieces? I think havi... [14:58:44] 10Operations, 10ops-eqiad, 10decommission-hardware: Decommission analytics10[28-31,33-41] - https://phabricator.wikimedia.org/T227485 (10Cmjohnson) 05Open→03Resolved I ran the script in netbox to remove all of these hosts and then ran the cookbook in cumin, removed all from the racks, cleaned up the netw... [14:59:05] 10Operations, 10ops-eqiad: Degraded RAID on wdqs1009 - https://phabricator.wikimedia.org/T264889 (10Cmjohnson) 05Open→03Declined ticket open for this already [15:07:31] (03PS1) 10Ssingh: wikidough: enable OCSP stapling in dnsdist [puppet] - 10https://gerrit.wikimedia.org/r/632735 (https://phabricator.wikimedia.org/T252132) [15:08:31] 10Operations, 10ops-eqiad, 10decommission-hardware: Decommission asw2-a5-eqiad - https://phabricator.wikimedia.org/T217383 (10Cmjohnson) [15:08:40] 10Operations, 10ops-eqiad, 10decommission-hardware: Decommission asw2-a5-eqiad - https://phabricator.wikimedia.org/T217383 (10Cmjohnson) 05Open→03Resolved [15:08:50] 10Operations, 10ops-eqiad, 10Cognate, 10Growth-Team, and 6 others: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10Cmjohnson) [15:10:07] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1003/25770/malmok.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/632735 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [15:12:46] !log upgrade rsyslog to 8.2008.0-1~bpo10+1 on centrallog1001 - T259780 [15:12:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:52] T259780: rsyslog occasional segfault on centrallog hosts - https://phabricator.wikimedia.org/T259780 [15:13:22] (03PS1) 10Cmjohnson: Removing dns entries for analytics1028-31,33-41 [dns] - 10https://gerrit.wikimedia.org/r/632736 (https://phabricator.wikimedia.org/T227485) [15:13:32] 10Operations, 10ops-eqiad, 10Cognate, 10Growth-Team, and 6 others: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10ayounsi) [15:13:38] 10Operations, 10ops-eqiad, 10decommission-hardware: Decommission asw2-a5-eqiad - https://phabricator.wikimedia.org/T217383 (10ayounsi) 05Resolved→03Open Console still connected in Netbox... :) [15:14:51] (03PS4) 10Abián: Require autoconfirmed status to edit Wikidata Properties [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631809 (https://phabricator.wikimedia.org/T254280) [15:15:05] (03CR) 10Cmjohnson: [C: 03+2] Removing dns entries for analytics1028-31,33-41 [dns] - 10https://gerrit.wikimedia.org/r/632736 (https://phabricator.wikimedia.org/T227485) (owner: 10Cmjohnson) [15:16:52] 10Operations, 10ops-eqiad, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps10[05-10].eqiad.wmnet - https://phabricator.wikimedia.org/T260269 (10Cmjohnson) @RobH Can you look into these please, I get them to do the initial install but I am getting an error I haven't seen before. IPMI Password: 14:... [15:18:23] 10Operations, 10Traffic, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC - https://phabricator.wikimedia.org/T264881 (10Tsevener) Hi @CDanis! This must be related to our new widgets, which request... [15:18:56] (03PS1) 10Cwhite: profile: apply ipsec monitoring where enabled with ipsec_exporter [puppet] - 10https://gerrit.wikimedia.org/r/632738 (https://phabricator.wikimedia.org/T148976) [15:18:58] (03PS1) 10Cwhite: profile: clean up ipsec aggregate check [puppet] - 10https://gerrit.wikimedia.org/r/632739 (https://phabricator.wikimedia.org/T148976) [15:19:57] (03CR) 10jerkins-bot: [V: 04-1] profile: apply ipsec monitoring where enabled with ipsec_exporter [puppet] - 10https://gerrit.wikimedia.org/r/632738 (https://phabricator.wikimedia.org/T148976) (owner: 10Cwhite) [15:20:09] 10Operations, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: Decommission analytics10[28-31,33-41] - https://phabricator.wikimedia.org/T227485 (10Cmjohnson) All the dns records have been manually removed as well [15:23:16] 10Operations, 10ops-eqiad, 10netops, 10User-Kormat, 10User-jijiki: Upgrade eqiad rack D4 to 10G switch - https://phabricator.wikimedia.org/T196487 (10Cmjohnson) [15:24:07] 10Operations, 10ops-eqiad, 10netops, 10User-Kormat, 10User-jijiki: Upgrade eqiad rack D4 to 10G switch - https://phabricator.wikimedia.org/T196487 (10Cmjohnson) 05Open→03Resolved ran script for the old asw2-d4 and changed name to old-asw2-d4. Changed name in netbox from asw3-d4 to asw2-d4 [15:35:45] 10Operations, 10observability: Grafana error: "parse error at char 1: unexpected character: '\\ufeff'" when copy-pasting metric names - https://phabricator.wikimedia.org/T263624 (10colewhite) I followed the replication steps and did not see the `\\ufeff` or `` artifacts appear in either the Grafana explo... [15:36:46] (03CR) 10Jbond: [C: 03+2] wmf_styleguide: bump stylgude gem to 1.0.7 [puppet] - 10https://gerrit.wikimedia.org/r/632687 (owner: 10Jbond) [15:48:37] (03CR) 10Hashar: "The CI entry point (rake test) has optimizations which leads to puppet-lint NOT being run when simply changing the Gemfile, and it would o" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/632687 (owner: 10Jbond) [15:49:53] (03CR) 10Andrew Bogott: [C: 03+1] "This will probably break a few things but I don't think we should stand in the way of progress." [puppet] - 10https://gerrit.wikimedia.org/r/632471 (https://phabricator.wikimedia.org/T210993) (owner: 10Muehlenhoff) [15:50:34] (03PS1) 10Volans: dns: add --keep-files option [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/632745 (https://phabricator.wikimedia.org/T264846) [15:53:01] (03CR) 10Herron: "I'm not sure this would work as expected since the metric is affected by availability of the remote side of the tunnel too." [puppet] - 10https://gerrit.wikimedia.org/r/632738 (https://phabricator.wikimedia.org/T148976) (owner: 10Cwhite) [15:53:28] (03PS2) 10Volans: sre.dns.netbox: add --skip-authdns-update option [cookbooks] - 10https://gerrit.wikimedia.org/r/632697 (https://phabricator.wikimedia.org/T264846) [15:53:30] (03PS1) 10Volans: sre.dns.netbox: add --emergency-manual-edit option [cookbooks] - 10https://gerrit.wikimedia.org/r/632746 (https://phabricator.wikimedia.org/T264846) [15:53:44] (03PS1) 10Gerrit maintenance bot: Add smn to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/632747 (https://phabricator.wikimedia.org/T264859) [15:53:51] (03CR) 10Volans: "To be used in the cookbook here: I939b2d9219760efb9c4cfe8137258ee68772737d" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/632745 (https://phabricator.wikimedia.org/T264846) (owner: 10Volans) [15:54:23] (03CR) 10Volans: "> Patch Set 1:" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/632745 (https://phabricator.wikimedia.org/T264846) (owner: 10Volans) [15:54:27] (03CR) 10Urbanecm: [C: 03+1] Add smn to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/632747 (https://phabricator.wikimedia.org/T264859) (owner: 10Gerrit maintenance bot) [15:55:22] (03CR) 10Volans: "See the Depends-On for the generation script side of it." [cookbooks] - 10https://gerrit.wikimedia.org/r/632746 (https://phabricator.wikimedia.org/T264846) (owner: 10Volans) [16:00:58] 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10Gilles) [16:03:09] 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10Gilles) @ema @bblack before I build it, I want to confirm that some complimentary information you're looking for is the ability to break down... [16:07:01] 10Operations, 10ops-eqiad, 10decommission-hardware: Decommission asw2-a5-eqiad - https://phabricator.wikimedia.org/T217383 (10Cmjohnson) It’s not [16:08:25] 10Operations, 10ops-eqiad, 10decommission-hardware: Decommission asw2-a5-eqiad - https://phabricator.wikimedia.org/T217383 (10ayounsi) 05Open→03Resolved Looks like a cache issue? It was for me, but not anymore. Thanks! [16:08:39] 10Operations, 10ops-eqiad, 10Cognate, 10Growth-Team, and 6 others: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10ayounsi) [16:16:08] 10Operations, 10SRE-Access-Requests: Requesting access to sites from Google Search Console - https://phabricator.wikimedia.org/T264630 (10ahemmer) Hi there, @Kormat Approving as @CGlenn manager. Thank you! [16:19:24] 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10Gilles) [16:24:14] 10Operations, 10Traffic, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC - https://phabricator.wikimedia.org/T264881 (10Tsevener) a:03Dmantena [16:24:45] (03PS1) 10Jforrester: Preload class used in HeaderCallback [core] (wmf/1.36.0-wmf.12) - 10https://gerrit.wikimedia.org/r/632679 (https://phabricator.wikimedia.org/T261260) [16:24:53] (03PS1) 10Hashar: Preload class used in HeaderCallback [core] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/632680 (https://phabricator.wikimedia.org/T261260) [16:25:04] (03PS1) 10Jforrester: Preload class used in HeaderCallback [core] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/632681 (https://phabricator.wikimedia.org/T261260) [16:25:21] (03PS2) 10Hashar: Preload class used in HeaderCallback [core] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/632681 (https://phabricator.wikimedia.org/T261260) (owner: 10Jforrester) [16:26:18] (03CR) 10Hashar: [C: 03+2] "In case we want to deplo wmf.12, given wmf.11 has not be rolled yet." [core] (wmf/1.36.0-wmf.12) - 10https://gerrit.wikimedia.org/r/632357 (https://phabricator.wikimedia.org/T263178) (owner: 10TrainBranchBot) [16:26:45] (03CR) 10Hashar: "The wmf.12 branches does not have yet modules registered. So we need that to be done first: https://gerrit.wikimedia.org/r/c/mediawiki/cor" [core] (wmf/1.36.0-wmf.12) - 10https://gerrit.wikimedia.org/r/632679 (https://phabricator.wikimedia.org/T261260) (owner: 10Jforrester) [16:26:54] (03CR) 10Hashar: [C: 04-1] Preload class used in HeaderCallback [core] (wmf/1.36.0-wmf.12) - 10https://gerrit.wikimedia.org/r/632679 (https://phabricator.wikimedia.org/T261260) (owner: 10Jforrester) [16:27:45] (03CR) 10Jforrester: "> Patch Set 1:" [core] (wmf/1.36.0-wmf.12) - 10https://gerrit.wikimedia.org/r/632679 (https://phabricator.wikimedia.org/T261260) (owner: 10Jforrester) [16:30:13] (03PS3) 10Dzahn: switch webproxy for esams/ulsfo/eqsin to their local install server [dns] - 10https://gerrit.wikimedia.org/r/632591 (https://phabricator.wikimedia.org/T242602) [16:32:31] (03CR) 10Hashar: [C: 03+2] "We can deploying it and test it on mwdebug when that merges." [core] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/632680 (https://phabricator.wikimedia.org/T261260) (owner: 10Hashar) [16:32:35] (03PS1) 10Jbond: ceph::osd: fix puppet-lint issues [puppet] - 10https://gerrit.wikimedia.org/r/632756 [16:33:55] (03CR) 10Dzahn: [C: 03+2] switch webproxy for esams/ulsfo/eqsin to their local install server [dns] - 10https://gerrit.wikimedia.org/r/632591 (https://phabricator.wikimedia.org/T242602) (owner: 10Dzahn) [16:35:38] !log switching webproxy service names to the new local install servers in esams/eqsin/ulsfo T242602 [16:35:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:46] T242602: Sort out plan for install* servers in edge sites - https://phabricator.wikimedia.org/T242602 [16:42:41] (03PS1) 10Dwisehaupt: Flip payments to codfw for extended test [dns] - 10https://gerrit.wikimedia.org/r/632758 (https://phabricator.wikimedia.org/T254298) [16:43:23] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: 2020-09-15) upgrade/replace memory in stat100[58] - https://phabricator.wikimedia.org/T260448 (10wiki_willy) Weird, I've never seen PSU's being affected like that, after a memory before, so I'll take it as an action item to reach out to t... [16:49:16] 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10BBlack) >>! In T264398#6525968, @Gilles wrote: > @ema @bblack before I build it, I want to confirm that some complimentary information you're... [16:49:52] 10Operations, 10netops, 10Patch-For-Review, 10Security, 10User-jbond: Review default ferm INPUT policy - https://phabricator.wikimedia.org/T264888 (10Volans) @jbond FYI as a side note once this is deployed we can probably revisit a bit the firewall rules of the failoid hosts, that were designed already w... [16:56:02] (03PS1) 10Jbond: P:netbox: fix puppet-lint complaints [puppet] - 10https://gerrit.wikimedia.org/r/632760 [16:56:57] (03CR) 10Jbond: [C: 03+2] ceph::osd: fix puppet-lint issues [puppet] - 10https://gerrit.wikimedia.org/r/632756 (owner: 10Jbond) [16:57:24] (03CR) 10Jbond: [C: 03+2] P:netbox: fix puppet-lint complaints [puppet] - 10https://gerrit.wikimedia.org/r/632760 (owner: 10Jbond) [16:59:45] (03Merged) 10jenkins-bot: Branch commit for wmf/1.36.0-wmf.12 [core] (wmf/1.36.0-wmf.12) - 10https://gerrit.wikimedia.org/r/632357 (https://phabricator.wikimedia.org/T263178) (owner: 10TrainBranchBot) [16:59:53] (03Merged) 10jenkins-bot: Preload class used in HeaderCallback [core] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/632680 (https://phabricator.wikimedia.org/T261260) (owner: 10Hashar) [17:00:15] 10Operations, 10Technical-blog-posts, 10Traffic: Blog post series: the evolution of Wikimedia's Content Delivery Network - https://phabricator.wikimedia.org/T264729 (10srodlund) @ema Overall, the post is well-written and interesting! I made some minor grammar suggestions. Can you accept / reject them, and I'... [17:00:32] (03PS1) 10Jbond: P:logstash/apifeatureusage: fix puppet-lint issues [puppet] - 10https://gerrit.wikimedia.org/r/632761 [17:02:29] (03CR) 10Jbond: [C: 03+2] P:logstash/apifeatureusage: fix puppet-lint issues [puppet] - 10https://gerrit.wikimedia.org/r/632761 (owner: 10Jbond) [17:06:05] (03CR) 10Jbond: "> Patch Set 9:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/632687 (owner: 10Jbond) [17:07:20] 10Operations, 10Traffic, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC - https://phabricator.wikimedia.org/T264881 (10CDanis) Thanks so much @Tsevener ! Another thing I wanted to ask, while we'... [17:14:13] (03PS1) 10Elukey: Set Debian Stretch for an-launcher1001 [puppet] - 10https://gerrit.wikimedia.org/r/632762 [17:15:19] (03CR) 10Jgreen: [C: 03+2] Flip payments to codfw for extended test [dns] - 10https://gerrit.wikimedia.org/r/632758 (https://phabricator.wikimedia.org/T254298) (owner: 10Dwisehaupt) [17:16:20] (03CR) 10Elukey: [C: 03+2] Set Debian Stretch for an-launcher1001 [puppet] - 10https://gerrit.wikimedia.org/r/632762 (owner: 10Elukey) [17:19:05] 10Operations, 10Patch-For-Review: Sort out plan for install* servers in edge sites - https://phabricator.wikimedia.org/T242602 (10Dzahn) 05Open→03Resolved a:03Dzahn I'm declaring this resolved. Everything is done according to the plan. Let me know if you think otherwise. [17:19:35] 10Operations: Sort out plan for install* servers in edge sites - https://phabricator.wikimedia.org/T242602 (10Dzahn) [17:25:03] I am going to deploy a hotfix for mediawiki [17:31:13] 10Operations, 10Traffic, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC - https://phabricator.wikimedia.org/T264881 (10Tsevener) @CDanis we can definitely smear the requests - is there a minimum... [17:33:01] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops, 10Patch-For-Review: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10mforns) @ayounsi Confirmed that you can merge the changes that add BGP communities to pmacct! We'll be monitoring the kafka... [17:33:16] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [17:33:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:10] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:35:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:27] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631809 (https://phabricator.wikimedia.org/T254280) (owner: 10Abián) [17:38:58] (03PS12) 10Ahmon Dancy: Factor out datacenters lists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622193 [17:39:51] (03CR) 10jerkins-bot: [V: 04-1] Factor out datacenters lists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622193 (owner: 10Ahmon Dancy) [17:40:48] (03PS13) 10Ahmon Dancy: Factor out datacenters lists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622193 [17:40:52] (03PS1) 10Ryan Kemper: Revert "Revert "cloudelastic: envoy sits in front now"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632683 [17:42:28] (03CR) 10jerkins-bot: [V: 04-1] Factor out datacenters lists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622193 (owner: 10Ahmon Dancy) [17:44:11] 10Operations, 10Traffic, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC - https://phabricator.wikimedia.org/T264881 (10Tsevener) @CDanis If the widgets don't have what they need when we request a... [17:44:17] (03PS2) 10Ryan Kemper: Revert "Revert "cloudelastic: envoy sits in front now"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632683 [17:45:00] (03CR) 10Ryan Kemper: [C: 03+1] "We've already rolled this out before, so this is safe to merge during today's backport window" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632683 (owner: 10Ryan Kemper) [17:47:34] (03CR) 10CRusnov: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/632697 (https://phabricator.wikimedia.org/T264846) (owner: 10Volans) [17:48:12] (03CR) 10CRusnov: [C: 03+1] "LGTM, as discussed" [cookbooks] - 10https://gerrit.wikimedia.org/r/632746 (https://phabricator.wikimedia.org/T264846) (owner: 10Volans) [17:49:58] (03CR) 10CRusnov: [C: 03+1] "Looks good." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/632745 (https://phabricator.wikimedia.org/T264846) (owner: 10Volans) [17:50:16] (03PS14) 10Ahmon Dancy: Factor out datacenters lists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622193 [17:55:40] 10Operations, 10Traffic, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC - https://phabricator.wikimedia.org/T264881 (10CDanis) >>! In T264881#6526357, @Tsevener wrote: > @CDanis If the widgets do... [17:56:21] jouncebot: now [17:56:21] No deployments scheduled for the next 0 hour(s) and 3 minute(s) [17:56:26] well doing the hotfix [17:58:14] (03CR) 10Bstorm: "> Patch Set 2:" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/630589 (owner: 10Jbond) [17:58:16] (03CR) 10CRusnov: [C: 03+1] "Output looks good. Thoroughly discussed." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/632574 (https://phabricator.wikimedia.org/T264273) (owner: 10Volans) [17:58:44] !log Pulled https://gerrit.wikimedia.org/r/c/mediawiki/core/+/632680 on deployment staging area and mw2001 [17:58:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:04] hashar and marxarelli: (Dis)respected human, time to deploy Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201007T1800). Please do the needful. [18:00:04] RoanKattouw, Niharika, and Urbanecm: (Dis)respected human, time to deploy Morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201007T1800). Please do the needful. [18:00:04] ryankemper: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:01:44] ryankemper: I hadn't seen your backport [18:02:06] we are rolling one for mediawiki/core , it is being tested [18:02:10] hashar: no worries (and I only added it ~20 mins ago) [18:02:15] take care of the mediawiki stuff first [18:02:16] ahh [18:02:20] and I haven't added mine :/ [18:06:54] tgr_: mwdebug2002 has a bunch of PHP Fatal error: Uncaught Exception in /srv/mediawiki/php-1.36.0-wmf.10/includes/Setup.php:129 in /srv/mediawiki/php-1.36.0-wmf.10/includes/Setup.php on line 129 [18:07:18] guess that is related to the testing going on ;] [18:07:22] yeah [18:08:19] ryankemper: will poke you when we are done [18:08:28] thanks [18:08:45] PROBLEM - Apache HTTP on mwdebug2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2431 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Application_servers [18:09:44] hashar: done [18:10:07] couldn't create a HTTP 500 with Cache-Control: public, FWIW [18:10:17] RECOVERY - Apache HTTP on mwdebug2002 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.151 second response time https://wikitech.wikimedia.org/wiki/Application_servers [18:10:21] tgr_: so can i sync it everywhere? [18:10:25] probably some Apache setting overriding it? [18:10:26] yeah [18:10:40] I'm not confident it will do anything at all, but it can't hurt [18:11:16] syncing [18:11:58] canaries passed [18:12:00] I couldn't trigger the bug that Reedy has reported in beta with WebRequest not found. Maybe it's CLI only, or depends on where exactly the exception happens. I can test that later. [18:12:06] !log hashar@deploy1001 Synchronized php-1.36.0-wmf.10/includes/HeaderCallback.php: Preload class used in HeaderCallback - T261260 (duration: 01m 01s) [18:12:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:12] T261260: Strange secondary error "Class 'WebRequest' not found" in logs after errors like "extension.json is not a valid JSON file" - https://phabricator.wikimedia.org/T261260 [18:13:13] (03CR) 10Hashar: [C: 03+2] "wmf.11 is not deployed yet but the patch has been rolled to wmf.10 a minute ago." [core] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/632681 (https://phabricator.wikimedia.org/T261260) (owner: 10Jforrester) [18:13:49] (03CR) 10Hashar: [C: 03+2] "rolled to wmf.10." [core] (wmf/1.36.0-wmf.12) - 10https://gerrit.wikimedia.org/r/632679 (https://phabricator.wikimedia.org/T261260) (owner: 10Jforrester) [18:13:51] 10Operations, 10Traffic, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC - https://phabricator.wikimedia.org/T264881 (10Tsevener) It's not difficult but it does take a couple of days before it'll... [18:14:45] tgr_: nothing surprising on the log side. Thank you! [18:15:05] ryankemper: we rolled our hotfix. You can now do the config change you have scheduled :] [18:15:19] cool, getting ready [18:15:21] thank you for your wait! [18:16:13] (03CR) 10Bstorm: [C: 03+1] "When I applied via cherry-pick to toolsbeta, this was a noop." [puppet] - 10https://gerrit.wikimedia.org/r/630589 (owner: 10Jbond) [18:18:07] (03CR) 10Ryan Kemper: [C: 03+2] Revert "Revert "cloudelastic: envoy sits in front now"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632683 (owner: 10Ryan Kemper) [18:19:01] (03Merged) 10jenkins-bot: Revert "Revert "cloudelastic: envoy sits in front now"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632683 (owner: 10Ryan Kemper) [18:20:20] !log (backport) HEAD set to 834b4571f978674162fa805906e665e35ac68e27 as expected [18:20:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:40] !log `scap pull`ed onto `mwdebug1002`. Talking to cloudelastic on localhost (which routes thru envoy), 6105 is `cloudelastic-chi-eqiad`, 6106 is `cloudelastic-omega-eqiad`, and 6107 is `cloudelastic-psi-eqiad` as expected [18:24:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:03] !log `scap pull`ed onto `mwdebug2001`; talking to cloudelastic via mediawiki from codfw has the expected decrease in latency due to the tls connection pooling [18:27:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:53] !log Above tests are as expected, syncing changes everywhere: `scap sync-file wmf-config/ProductionServices.php 'Config: [[gerrit:632683|cloudelastic: envoy sits in front now (T263073)]]'` [18:29:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:59] T263073: Large, steady increase in unprocessed cloudelastic job.cirrusSearchElasticaWrite messages - https://phabricator.wikimedia.org/T263073 [18:30:12] !log ryankemper@deploy1001 Synchronized wmf-config/ProductionServices.php: Config: [[gerrit:632683|cloudelastic: envoy sits in front now (T263073)]] (duration: 00m 58s) [18:30:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:38] !log search team's backport deploy is complete [18:30:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:46] (03PS1) 10Ahmon Dancy: wmf-config/env.php: Add dcs and servicesFile info [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632769 [18:36:27] (03Merged) 10jenkins-bot: Preload class used in HeaderCallback [core] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/632681 (https://phabricator.wikimedia.org/T261260) (owner: 10Jforrester) [18:36:34] (03Merged) 10jenkins-bot: Preload class used in HeaderCallback [core] (wmf/1.36.0-wmf.12) - 10https://gerrit.wikimedia.org/r/632679 (https://phabricator.wikimedia.org/T261260) (owner: 10Jforrester) [18:46:55] (03PS2) 10CRusnov: diffscan.py: Port to Python3 [puppet] - 10https://gerrit.wikimedia.org/r/630703 (https://phabricator.wikimedia.org/T247364) [18:49:41] (03PS1) 10Joal: Add analytics data purge for webrequest sequence stats [puppet] - 10https://gerrit.wikimedia.org/r/632773 (https://phabricator.wikimedia.org/T262826) [18:52:56] (03CR) 10Krinkle: [C: 03+1] wmf-config/env.php: Add dcs and servicesFile info (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632769 (owner: 10Ahmon Dancy) [18:56:09] (03CR) 10Bstorm: [C: 04-1] "Actually, Bryan and I have decided that we were wrong. We did a check around, and it seems that we still need the base diamond collectors " [puppet] - 10https://gerrit.wikimedia.org/r/632471 (https://phabricator.wikimedia.org/T210993) (owner: 10Muehlenhoff) [18:56:15] (03CR) 10Joal: "This should represent a few hundred thousand files drop :)" [puppet] - 10https://gerrit.wikimedia.org/r/632773 (https://phabricator.wikimedia.org/T262826) (owner: 10Joal) [18:56:58] (03CR) 10Bstorm: [C: 04-1] "https://phabricator.wikimedia.org/T264920" [puppet] - 10https://gerrit.wikimedia.org/r/632471 (https://phabricator.wikimedia.org/T210993) (owner: 10Muehlenhoff) [19:00:04] hashar and marxarelli: Dear deployers, time to do the Mediawiki train - European+American Version (secondary timeslot) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201007T1900). [19:10:51] (03PS1) 10HMonroy: Enable watchlist expiry feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632779 (https://phabricator.wikimedia.org/T264780) [19:16:55] (03PS1) 10RobH: updating with more skus [software] - 10https://gerrit.wikimedia.org/r/632781 [19:18:01] (03CR) 10RobH: [C: 03+2] updating with more skus [software] - 10https://gerrit.wikimedia.org/r/632781 (owner: 10RobH) [19:20:27] (03Abandoned) 10Dzahn: labs_bootstrapvz: remove diamond from lists of installed packages [puppet] - 10https://gerrit.wikimedia.org/r/632569 (https://phabricator.wikimedia.org/T210993) (owner: 10Dzahn) [19:22:42] (03CR) 10Dzahn: "ACK. well.. the parent task for this has had sme comments and a -1 from Brooke, so adding her here as well. Seems to me we can already rem" [puppet] - 10https://gerrit.wikimedia.org/r/632570 (https://phabricator.wikimedia.org/T210993) (owner: 10Dzahn) [19:24:56] (03PS2) 10Dzahn: Add Inari Sami (smn) language to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/632747 (https://phabricator.wikimedia.org/T264859) (owner: 10Gerrit maintenance bot) [19:25:58] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "Gerrit trivia: smn is a Sami language spoken by the Inari Sami of Finland. approved by langcom." [dns] - 10https://gerrit.wikimedia.org/r/632747 (https://phabricator.wikimedia.org/T264859) (owner: 10Gerrit maintenance bot) [19:26:53] (03CR) 10Dzahn: [V: 03+1 C: 03+2] Add Inari Sami (smn) language to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/632747 (https://phabricator.wikimedia.org/T264859) (owner: 10Gerrit maintenance bot) [19:27:18] (03PS3) 10Dzahn: Add Inari Sami (smn) language to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/632747 (https://phabricator.wikimedia.org/T264859) (owner: 10Gerrit maintenance bot) [19:36:13] !log blog post: The latest addition to our family of Wikimedia languages is "Inari Sami" with language code "smn". It is a Sami language spoken by the Inari Sami of Finland and has about 400 native speakers. It's in the Uralic language family. Wikipedia will be created in T264859. https://en.wikipedia.org/wiki/Inari_Sami | https://iso639-3.sil.org/code/smn | [19:36:18] https://meta.wikimedia.org/wiki/Requests_for_new_languages/Wikipedia_Inari_Sami_2 [19:36:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:21] T264859: Create Inari Sámi Wikipedia - https://phabricator.wikimedia.org/T264859 [20:00:02] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: 2020-09-15) upgrade/replace memory in stat100[58] - https://phabricator.wikimedia.org/T260448 (10wiki_willy) Spoke to our technical Dell rep today, and followed up with an email. Hopefully there's an easy way to get it working. If not,... [20:00:04] chrisalbon and accraze: May I have your attention please! Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201007T2000) [20:05:13] (03PS22) 10Ryan Kemper: elasticsearch: Let spicerack handle wait for all write queues to clear [cookbooks] - 10https://gerrit.wikimedia.org/r/603731 (https://phabricator.wikimedia.org/T261239) [20:05:39] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@7fa787e]: airflow: update mjolnir configuration to reduce max training dataset [20:05:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:43] (03PS7) 10Ryan Kemper: cloudelastic: remove temporarily increased timeout [puppet] - 10https://gerrit.wikimedia.org/r/624237 (https://phabricator.wikimedia.org/T220625) [20:05:58] 10Operations, 10SRE-Access-Requests: Access to the Logstash for John Bolorinos - https://phabricator.wikimedia.org/T264918 (10Dzahn) [20:09:02] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@7fa787e]: airflow: update mjolnir configuration to reduce max training dataset (duration: 03m 23s) [20:09:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:34] 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10Gilles) I'm happy to work with the Traffic team to increase knowledge on web performance, which metrics matter and why, help you define your... [20:13:47] (03CR) 10Razzi: oozie: use admin groups to determine admin access (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/631849 (https://phabricator.wikimedia.org/T262660) (owner: 10Razzi) [20:13:57] (03PS5) 10Razzi: oozie: use admin groups to determine admin access [puppet] - 10https://gerrit.wikimedia.org/r/631849 (https://phabricator.wikimedia.org/T262660) [20:14:45] (03CR) 10Ryan Kemper: [C: 03+2] elasticsearch: Let spicerack handle wait for all write queues to clear [cookbooks] - 10https://gerrit.wikimedia.org/r/603731 (https://phabricator.wikimedia.org/T261239) (owner: 10Ryan Kemper) [20:15:23] (03CR) 10Ryan Kemper: [C: 03+2] cloudelastic: remove temporarily increased timeout [puppet] - 10https://gerrit.wikimedia.org/r/624237 (https://phabricator.wikimedia.org/T220625) (owner: 10Ryan Kemper) [20:34:11] (03PS1) 10Gergő Tisza: Log when SessionManager is emitting cookies [core] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/632685 (https://phabricator.wikimedia.org/T264793) [20:34:51] (03PS1) 10Gergő Tisza: Log when SessionManager is emitting cookies [core] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/632806 (https://phabricator.wikimedia.org/T264793) [20:35:19] (03PS1) 10Gergő Tisza: Log when SessionManager is emitting cookies [core] (wmf/1.36.0-wmf.12) - 10https://gerrit.wikimedia.org/r/632807 (https://phabricator.wikimedia.org/T264793) [20:48:59] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:52:44] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to researchers and analytics-privatedata-users for Leila Zia - https://phabricator.wikimedia.org/T264472 (10leila) @Kormat thanks. I just confirmed the ssh key through the slack message I had received about it. [20:56:48] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-reboot [20:56:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:13] PROBLEM - ElasticSearch health check for shards on 9200 on relforge1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 48 threshold =0.15 breach: delayed_unassigned_shards: 0, active_shards_percent_as_number: 52.0, cluster_name: relforge-eqiad, number_of_nodes: 1, number_of_data_nodes: 1, timed_out: False, active_shards: 52, active_primary_shards: 52, relocating_shards: 0, task_max_waiting_in_queue_millis: 0, number_of_pen [21:02:13] itializing_shards: 0, status: red, number_of_in_flight_fetch: 0, unassigned_shards: 48 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:05:37] RECOVERY - ElasticSearch health check for shards on 9200 on relforge1002 is OK: OK - elasticsearch status relforge-eqiad: number_of_pending_tasks: 0, unassigned_shards: 0, number_of_nodes: 2, active_shards: 104, relocating_shards: 0, number_of_in_flight_fetch: 0, timed_out: False, cluster_name: relforge-eqiad, status: green, initializing_shards: 0, delayed_unassigned_shards: 0, number_of_data_nodes: 2, active_primary_shards: 83, [21:05:37] cent_as_number: 100.0, task_max_waiting_in_queue_millis: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:10:53] PROBLEM - ElasticSearch health check for shards on 9200 on relforge1001 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:14:54] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-reboot (exit_code=0) [21:14:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:57] RECOVERY - ElasticSearch health check for shards on 9200 on relforge1001 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, number_of_data_nodes: 2, active_shards_percent_as_number: 100.0, initializing_shards: 0, timed_out: False, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, active_primary_shards: 83, status: green, unassigned_shards: 0, task_max_waiting_in_queue_millis: 0, active_shards: [21:15:57] hards: 0, number_of_in_flight_fetch: 0, number_of_nodes: 2 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:37:48] (03CR) 10Dmaza: [C: 03+1] Enable watchlist expiry feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632779 (https://phabricator.wikimedia.org/T264780) (owner: 10HMonroy) [21:39:16] (03PS1) 10Gergő Tisza: Enable logging of session cookie changes in group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632795 (https://phabricator.wikimedia.org/T264793) [21:39:20] (03PS1) 10Gergő Tisza: Enable logging of session cookie changes in group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632796 (https://phabricator.wikimedia.org/T264793) [21:39:23] (03PS1) 10Gergő Tisza: Enable logging of session cookie changes everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632797 (https://phabricator.wikimedia.org/T264793) [21:48:37] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-reboot [21:48:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:35] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-reboot (exit_code=99) [21:55:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:21] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [22:03:41] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:03:45] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:03:55] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:03:59] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={swagger_check_restbase_cluster_eqiad,swagger_check_wikifeeds_codfw} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:04:21] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image dat [22:04:21] 016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received: /{domain}/v1/page/news (get In the News content) timed out before a respo [22:04:21] https://wikitech.wikimedia.org/wiki/Wikifeeds [22:04:29] PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:04:31] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:04:39] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:05:19] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:05:23] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:05:39] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:05:55] uhhh [22:06:09] RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:06:11] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [22:06:29] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:06:33] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wikifeeds_4101: Servers kubernetes2007.codfw.wmnet, kubernetes2012.codfw.wmnet, kubernetes2016.codfw.wmnet, kubernetes2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:07:17] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:07:41] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [22:07:53] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [22:08:01] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:08:07] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:08:15] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:08:37] looks like T264821 again [22:08:37] T264821: Hourly read spikes against s8 resulting in occasional user-visible latency & error spikes - https://phabricator.wikimedia.org/T264821 [22:08:55] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:09:05] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:09:11] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [22:09:33] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:10:10] that is, T264881 [22:10:10] T264881: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC - https://phabricator.wikimedia.org/T264881 [22:28:05] (03CR) 10DannyS712: Enable watchlist expiry feature (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632779 (https://phabricator.wikimedia.org/T264780) (owner: 10HMonroy) [22:28:11] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:00:01] (03PS2) 10Ahmon Dancy: wmf-config/env.php: Add dcs and servicesFile info [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632769 [23:00:04] RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201007T2300). [23:00:04] tgr: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:42] what is the status of wmf.12? should we backport (foreport?) patches there? [23:00:50] (03CR) 10Ahmon Dancy: wmf-config/env.php: Add dcs and servicesFile info (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632769 (owner: 10Ahmon Dancy) [23:03:03] (03CR) 10Gergő Tisza: [C: 03+2] Log when SessionManager is emitting cookies [core] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/632685 (https://phabricator.wikimedia.org/T264793) (owner: 10Gergő Tisza) [23:03:38] (03CR) 10Gergő Tisza: [C: 03+2] Log when SessionManager is emitting cookies [core] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/632806 (https://phabricator.wikimedia.org/T264793) (owner: 10Gergő Tisza) [23:03:45] (03PS2) 10HMonroy: Enable watchlist expiry feature on four wikis from group 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632779 (https://phabricator.wikimedia.org/T264780) [23:05:09] (03CR) 10HMonroy: Enable watchlist expiry feature on four wikis from group 2 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632779 (https://phabricator.wikimedia.org/T264780) (owner: 10HMonroy) [23:08:02] I guess it's OK to merge at least since that was done for https://gerrit.wikimedia.org/r/c/mediawiki/core/+/632679 [23:08:18] (03CR) 10Gergő Tisza: [C: 03+2] Log when SessionManager is emitting cookies [core] (wmf/1.36.0-wmf.12) - 10https://gerrit.wikimedia.org/r/632807 (https://phabricator.wikimedia.org/T264793) (owner: 10Gergő Tisza) [23:09:46] (03CR) 10Cwhite: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/632738 (https://phabricator.wikimedia.org/T148976) (owner: 10Cwhite) [23:32:46] (03CR) 10jerkins-bot: [V: 04-1] Log when SessionManager is emitting cookies [core] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/632685 (https://phabricator.wikimedia.org/T264793) (owner: 10Gergő Tisza) [23:40:01] (03CR) 10Gergő Tisza: [C: 03+2] "Force-merge due to flaky selenium test" [core] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/632685 (https://phabricator.wikimedia.org/T264793) (owner: 10Gergő Tisza) [23:40:25] (03CR) 10Gergő Tisza: [V: 03+2 C: 03+2] Log when SessionManager is emitting cookies [core] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/632685 (https://phabricator.wikimedia.org/T264793) (owner: 10Gergő Tisza) [23:47:56] (03CR) 10Gergő Tisza: [C: 03+2] Enable logging of session cookie changes in group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632795 (https://phabricator.wikimedia.org/T264793) (owner: 10Gergő Tisza) [23:48:55] (03Merged) 10jenkins-bot: Enable logging of session cookie changes in group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632795 (https://phabricator.wikimedia.org/T264793) (owner: 10Gergő Tisza) [23:50:31] (03CR) 10BryanDavis: [C: 03+1] "Untested, but the changes LGTM" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/632582 (https://phabricator.wikimedia.org/T263339) (owner: 10Bstorm) [23:55:57] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-restart [23:55:57] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-restart (exit_code=0) [23:56:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:56:11] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-reboot [23:56:11] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-reboot (exit_code=0) [23:56:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:56:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:58:51] !log tgr@deploy1001 Synchronized php-1.36.0-wmf.10/includes/session: Backport: [[gerrit:632685|Log when SessionManager is emitting cookies (T264793)]] (duration: 01m 00s) [23:58:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:58:57] T264793: Make sure SessionManager emitting Set-Cookie headers gets logged - https://phabricator.wikimedia.org/T264793