[00:00:04] <jouncebot>	 twentyafterfour: It is that lovely time of the day again! You are hereby commanded to deploy Phabricator update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200625T0000).
[00:00:41] <wikibugs>	 (03CR) 10Volans: [C: 03+2] scripts: unset the face too in the offline script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/607644 (owner: 10Volans)
[00:02:13] <wikibugs>	 (03PS6) 10Dzahn: jenkins: replace system user/group with systemd-sysuser [puppet] - 10https://gerrit.wikimedia.org/r/606286 (https://phabricator.wikimedia.org/T224591)
[00:02:38] <wikibugs>	 (03CR) 10Dzahn: jenkins: replace system user/group with systemd-sysuser (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/606286 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn)
[00:03:44] <wikibugs>	 (03CR) 10DannyS712: "Can the changes (or at least of highlight) be noted in the commit message?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607605 (owner: 10Reedy)
[00:03:46] <wikibugs>	 (03PS7) 10Dzahn: jenkins: replace system user/group with systemd-sysuser [puppet] - 10https://gerrit.wikimedia.org/r/606286 (https://phabricator.wikimedia.org/T224591)
[00:05:25] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "just like https://gerrit.wikimedia.org/r/c/operations/puppet/+/606287  this still fails in general and needs Chris Danis' fix at https://g" [puppet] - 10https://gerrit.wikimedia.org/r/606286 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn)
[00:05:32] <icinga-wm>	 PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX
[00:07:12] <icinga-wm>	 RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[00:07:39] <mutante>	 these sometimes happen for a very short time, i just rescheduled the icinga check
[00:09:59] <wikibugs>	 10Operations, 10Analytics, 10netops: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10faidon) To add to the above, I'm also wondering how difficult it would be to also include AS *names*, e.g. coming from the MaxMind GeoIP ASN database. I think we've use...
[00:10:47] <wikibugs>	 (03CR) 10Dzahn: Add initial puppetization for libraryupgrader (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/607452 (https://phabricator.wikimedia.org/T173478) (owner: 10Legoktm)
[00:11:20] <wikibugs>	 (03CR) 10Dzahn: Add initial puppetization for libraryupgrader (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/607452 (https://phabricator.wikimedia.org/T173478) (owner: 10Legoktm)
[00:11:27] <twentyafterfour>	 !log updating phabricator to release/2020-06-25/1, momentary (<1 minute) downtime expected.
[00:11:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:12:18] <twentyafterfour>	 !log phabricator updated, all seems normal 
[00:12:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:12:42] <mutante>	 thanks twentyafterfour, looks normal indeed
[00:12:49] <legoktm>	 mutante: if you think the systemd-sysuser thing will be fixed soon I don't mind waiting a bit, otherwise I can switch it back
[00:14:44] <mutante>	 legoktm: uhm.. switch it back and i will replace it again later
[00:14:55] <mutante>	 let's merge that
[00:15:13] <legoktm>	 ok
[00:18:32] <mutante>	 legoktm: is it possible to define a source for connections to port 3002 or does it need to be from anywhere?
[00:18:33] <wikibugs>	 (03PS4) 10Legoktm: Add initial puppetization for libraryupgrader [puppet] - 10https://gerrit.wikimedia.org/r/607452 (https://phabricator.wikimedia.org/T173478)
[00:18:35] <wikibugs>	 (03PS1) 10Legoktm: libraryupgrader: Switch to systemd-sysusers [puppet] - 10https://gerrit.wikimedia.org/r/607646
[00:19:06] <legoktm>	 mutante: the source would be the cloud dynamic-proxy
[00:20:29] <mutante>	 legoktm: ah, we just did the same for phab in cloud. we can use $CACHES
[00:20:43] <mutante>	 i can add that in another change
[00:22:23] <legoktm>	 ok
[00:22:33] <legoktm>	 we can do the same for codesearch too then
[00:25:30] <mutante>	 legoktm: first i meant this kind of thing: https://gerrit.wikimedia.org/r/c/operations/puppet/+/606255  but it only becomes interesting once we are actually in both prod and cloud.. so it's easier..amending to yours
[00:27:09] <wikibugs>	 (03PS5) 10Dzahn: Add initial puppetization for libraryupgrader [puppet] - 10https://gerrit.wikimedia.org/r/607452 (https://phabricator.wikimedia.org/T173478) (owner: 10Legoktm)
[00:27:17] <legoktm>	 mutante: ah, gotcha. fwiw port 3002 is arbitrary, we can switch to something more standard if that's easier. I just picked that because it's unprivledged and I already kept it open on my laptop
[00:29:04] <wikibugs>	 (03PS6) 10Dzahn: Add initial puppetization for libraryupgrader [puppet] - 10https://gerrit.wikimedia.org/r/607452 (https://phabricator.wikimedia.org/T173478) (owner: 10Legoktm)
[00:29:58] <mutante>	 legoktm: 3002 is fine. it's not anything well-known and doesn't really matter for the code
[00:31:07] <mutante>	 so i am doing simply    srange => '$CACHES' and it should work
[00:31:33] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Add initial puppetization for libraryupgrader [puppet] - 10https://gerrit.wikimedia.org/r/607452 (https://phabricator.wikimedia.org/T173478) (owner: 10Legoktm)
[00:32:35] <mutante>	 legoktm: do you want to add it on codesearch6, should I? later?
[00:33:00] <legoktm>	 you mean upgrader07? I can do that in a few minutes
[00:33:43] <mutante>	 heh, yea:)
[00:33:55] <mutante>	 ok
[00:35:19] <wikibugs>	 (03CR) 10Dzahn: "PS6: limited source range for connections to port 3002 to $CACHES." [puppet] - 10https://gerrit.wikimedia.org/r/607452 (https://phabricator.wikimedia.org/T173478) (owner: 10Legoktm)
[00:35:44] <mutante>	 ACCEPT     tcp  --  proxy-01.project-proxy.eqiad1.wikimedia.cloud  anywhere             tcp dpt:http
[00:35:47] <mutante>	 ACCEPT     tcp  --  proxy-02.project-proxy.eqiad1.wikimedia.cloud  anywhere             tcp dpt:http
[00:36:09] <mutante>	 legoktm: ^ this is what i am expecting you should see in the end in iptables -L
[00:36:29] <mutante>	 that is looking a phab-in-cloud instance where we used $CACHES as well
[00:37:50] <wikibugs>	 (03PS1) 10Dzahn: codesearch: limit connections to port3002 to $CACHES [puppet] - 10https://gerrit.wikimedia.org/r/607647
[00:38:28] <wikibugs>	 (03PS2) 10Dzahn: codesearch: limit connections to port 3002 to $CACHES [puppet] - 10https://gerrit.wikimedia.org/r/607647
[00:39:21] <wikibugs>	 (03PS2) 10Ssingh: prometheus: update scheme for wikidough (improves ab8a948a) [puppet] - 10https://gerrit.wikimedia.org/r/607570
[00:41:36] <wikibugs>	 (03CR) 10Ssingh: "No code change for patch set 2; I have just updated the commit message." [puppet] - 10https://gerrit.wikimedia.org/r/607570 (owner: 10Ssingh)
[00:43:12] <legoktm>	 mutante: Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Function lookup() did not find a value for the name 'profile::libraryupgrader::base_dir' (file: /etc/puppet/modules/profile/manifests/libraryupgrader.pp, line: 1) on node upgrader-07.library-upgrader.eqiad.wmflabs
[00:43:42] <Reedy>	 twentyafterfour: https://phabricator.wikimedia.org/T256343 is that after the update?
[00:43:57] <Reedy>	 Source file "/srv/deployment/phabricator/deployment-cache/revs/4547f31de8f69854e0cd9d3e0a802ce517360ee0/phabricator/src/applications/legalpad/conduit/LegalpadSignatureSearchConduitAPIMethod.php" failed to load.
[00:44:31] <twentyafterfour>	 Reedy: looks like it
[00:44:34] <twentyafterfour>	 fixing...
[00:45:10] <mutante>	 legoktm: uhm.. is the name of the project in horizon actually "libraryupgrader" ?
[00:45:21] <legoktm>	 oh uh, no
[00:45:24] <legoktm>	 it's library-upgrader
[00:45:44] <mutante>	 we gotta move the yaml file around then
[00:46:10] <mutante>	 hierdata/cloud/eqiad1/$project_name/common.yaml
[00:46:25] <legoktm>	 ohhh, my bad. I'll submit a patch
[00:48:23] <wikibugs>	 (03PS1) 10Legoktm: Fix location of library-upgrader hieradata [puppet] - 10https://gerrit.wikimedia.org/r/607648
[00:48:29] <legoktm>	 mutante: ^
[00:48:40] <twentyafterfour>	 !log restart php-fpm on phab1001 to fix T256343
[00:48:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:48:45] <stashbot>	 T256343: Unable to Access the Repository - https://phabricator.wikimedia.org/source/tool-commons-android-app/ - https://phabricator.wikimedia.org/T256343
[00:48:56] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Fix location of library-upgrader hieradata [puppet] - 10https://gerrit.wikimedia.org/r/607648 (owner: 10Legoktm)
[00:49:45] <mutante>	 legoktm: synced on prod puppetmasters
[00:52:30] <legoktm>	 puppet is running!
[00:54:53] <mutante>	 great
[00:55:22] <mutante>	 legoktm: when it's done, do an 'iptables -L | grep http' or so
[00:56:19] <legoktm>	 legoktm@upgrader-07:~$ iptables -L | grep http
[00:56:19] <legoktm>	 -bash: iptables: command not found
[00:56:32] <legoktm>	 there's an iptables-xml ?
[00:57:45] <legoktm>	 oh
[00:57:47] <legoktm>	 it's in sbin
[00:57:48] <mutante>	 legoktm: as root
[00:57:58] <mutante>	 well, or that
[00:58:29] <legoktm>	 uh, no output under http but
[00:58:36] <legoktm>	 root@upgrader-07:~# iptables -L | grep 3002
[00:58:36] <legoktm>	 ACCEPT     tcp  --  proxy-01.project-proxy.eqiad1.wikimedia.cloud  anywhere             tcp dpt:3002
[00:58:36] <legoktm>	 ACCEPT     tcp  --  proxy-02.project-proxy.eqiad1.wikimedia.cloud  anywhere             tcp dpt:3002
[00:58:36] <legoktm>	 ACCEPT     tcp  --  deployment-cache-text06.deployment-prep.eqiad1.wikimedia.cloud  anywhere             tcp dpt:3002
[00:58:39] <mutante>	 sorry, 3002
[00:59:08] <mutante>	 there are the 2 proxies and that one deployment-cache server for some reason
[00:59:15] <mutante>	 but this is what i expected, yep
[00:59:19] <mutante>	 it should work 
[00:59:56] <mutante>	 that is what $CACHES means in cloud, while it means "all the cp* servers" in prod
[01:00:21] <mutante>	 there doesn't have to be a $realm check or anything this way
[01:00:23] <legoktm>	 awesome :D next step for libraryupgrader is to migrate the systemd units over to puppet, I'll probably spend some time tonight on that
[01:00:33] <wikibugs>	 (03CR) 10Legoktm: [C: 03+1] codesearch: limit connections to port 3002 to $CACHES [puppet] - 10https://gerrit.wikimedia.org/r/607647 (owner: 10Dzahn)
[01:00:49] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] codesearch: limit connections to port 3002 to $CACHES [puppet] - 10https://gerrit.wikimedia.org/r/607647 (owner: 10Dzahn)
[01:02:50] <legoktm>	 mutante: thank you for all the help so far :)
[01:03:43] <mutante>	 legoktm: you're welcome, talk to you later then
[01:26:30] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 24038976 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:28:16] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 783376 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:33:01] <wikibugs>	 10Operations, 10Commons, 10MediaWiki-File-management, 10Traffic: Cached thumbnails and originals are sometimes not being purged correctly/quickly - https://phabricator.wikimedia.org/T256313 (10King_of_Hearts) Yet another one, currently still broken as of time of writing: https://upload.wikimedia.org/wikipe...
[01:41:26] <wikibugs>	 10Operations, 10Commons, 10MediaWiki-File-management, 10SRE-swift-storage, 10Traffic: Cached thumbnails and originals are sometimes not being purged correctly/quickly - https://phabricator.wikimedia.org/T256313 (10Reedy)
[03:08:33] <wikibugs>	 (03CR) 10Bmansurov: "> May I ask what the configuration changes were? We probably need to amend the chart to account for those." [deployment-charts] - 10https://gerrit.wikimedia.org/r/602527 (https://phabricator.wikimedia.org/T241230) (owner: 10Bmansurov)
[04:17:29] <wikibugs>	 (03PS1) 10Marostegui: db2120: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/607660
[04:22:13] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2120: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/607660 (owner: 10Marostegui)
[04:25:23] <marostegui>	 !log Deploy schema change on s2 codfw - T238966
[04:25:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:25:28] <stashbot>	 T238966: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966
[04:26:04] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 2 (releases1002, ...), Fresh: 101 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[04:26:29] <marostegui>	 !log Remove triggers from db2095:3312 - T238966
[04:26:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:37:12] <wikibugs>	 (03PS4) 10Marostegui: mariadb: Promote db1097 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/606953 (https://phabricator.wikimedia.org/T254556)
[04:57:59] <wikibugs>	 10Operations, 10DBA: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10Marostegui) Anything else left here after the 100% repool or we can close this? Thank you!
[05:25:32] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 52 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[05:29:52] <icinga-wm>	 PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 429 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: 
[05:29:52] <icinga-wm>	  most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[05:30:52] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 48 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[05:35:24] <icinga-wm>	 RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[05:45:56] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:51:14] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:56:01] <wikibugs>	 (03CR) 10Elukey: hadoop - Add change-distro.py and stop-cluster.py (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/606736 (https://phabricator.wikimedia.org/T244499) (owner: 10Elukey)
[05:59:05] <wikibugs>	 (03PS11) 10Elukey: hadoop - Add change-distro.py and stop-cluster.py [cookbooks] - 10https://gerrit.wikimedia.org/r/606736 (https://phabricator.wikimedia.org/T244499)
[05:59:14] <icinga-wm>	 PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 429 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: 
[05:59:15] <icinga-wm>	  most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[06:03:47] <elukey>	 !log reboot an-airflow1001 for kernel upgrades 
[06:03:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:04:42] <icinga-wm>	 RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[06:12:26] <icinga-wm>	 PROBLEM - Check systemd state on an-airflow1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:14:52] <elukey>	 ifup for ens5 fails - RTNETLINK answers: File exists
[06:15:28] <icinga-wm>	 PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 429 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: 
[06:15:28] <icinga-wm>	  most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[06:17:16] <icinga-wm>	 RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[06:19:34] <icinga-wm>	 RECOVERY - Check systemd state on an-airflow1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:19:55] <elukey>	 !log execute ip addr flush ens5 on an-airflow1001 to clear RTNETLINK answers: File exists (error from ifup@ens5.service)
[06:19:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:22:28] <elukey>	 !log reboot analytics-tool1001 for kernel upgrades
[06:22:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:22:54] <icinga-wm>	 ACKNOWLEDGEMENT - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: CDanis Scheduled maintenance TTN-0004068701 - The acknowledgement expires at: 2020-06-25 10:22:27. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:22:54] <icinga-wm>	 ACKNOWLEDGEMENT - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0: CDanis Scheduled maintenance TTN-0004068701 - The acknowledgement expires at: 2020-06-25 10:22:27. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:23:16] <elukey>	 !log reboot analytics-tool1004 for kernel upgrades (Superset host)
[06:23:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:24:40] <elukey>	 !log reboot an-tool* vms for kernel upgrades
[06:24:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:28:08] <icinga-wm>	 PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 429 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: 
[06:28:08] <icinga-wm>	  most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[06:28:14] <icinga-wm>	 PROBLEM - Check systemd state on an-tool1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:29:14] <icinga-wm>	 PROBLEM - Check size of conntrack table on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[06:29:26] <icinga-wm>	 PROBLEM - Check size of conntrack table on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[06:29:40] <icinga-wm>	 PROBLEM - ores uWSGI web app on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores
[06:30:08] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[06:30:08] <icinga-wm>	 PROBLEM - ores uWSGI web app on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores
[06:30:14] <icinga-wm>	 PROBLEM - Check systemd state on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:30:31] <elukey>	 ah this is ores dying for the logrotate
[06:30:44] <icinga-wm>	 PROBLEM - Check systemd state on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:31:21] <elukey>	 !log force puppet run on ores1003/1005 to restore celery (killed by the oom)
[06:31:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:31:40] <icinga-wm>	 PROBLEM - puppet last run on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:32:00] <icinga-wm>	 RECOVERY - Check systemd state on ores1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:32:32] <icinga-wm>	 RECOVERY - Check systemd state on ores1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:32:50] <icinga-wm>	 RECOVERY - Check size of conntrack table on ores1003 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[06:33:00] <icinga-wm>	 RECOVERY - Check size of conntrack table on ores1005 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[06:34:32] <elukey>	 !log reboot archiva for kernel upgrades
[06:34:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:35:04] <elukey>	 !log reboot archiva1002 (new vm, not yet in service) for kernel upgrades
[06:35:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:37:08] <icinga-wm>	 RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[06:37:28] <icinga-wm>	 RECOVERY - puppet last run on ores1003 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:40:44] <elukey>	 !log reboot matomo1002 for kernel upgrades
[06:40:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:46:10] <icinga-wm>	 PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 429 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: 
[06:46:10] <icinga-wm>	  most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[06:49:44] <icinga-wm>	 RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[06:49:50] <wikibugs>	 (03CR) 10Elukey: [C: 04-1] hadoop - Add change-distro.py and stop-cluster.py (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/606736 (https://phabricator.wikimedia.org/T244499) (owner: 10Elukey)
[06:55:24] <wikibugs>	 (03PS12) 10Elukey: hadoop - Add change-distro.py and stop-cluster.py [cookbooks] - 10https://gerrit.wikimedia.org/r/606736 (https://phabricator.wikimedia.org/T244499)
[06:56:18] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] hadoop - Add change-distro.py and stop-cluster.py [cookbooks] - 10https://gerrit.wikimedia.org/r/606736 (https://phabricator.wikimedia.org/T244499) (owner: 10Elukey)
[06:57:02] <icinga-wm>	 PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 429 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: 
[06:57:02] <icinga-wm>	  most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[07:00:54] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ores1005 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[07:02:51] <wikibugs>	 (03PS13) 10Elukey: hadoop - Add change-distro.py and stop-cluster.py [cookbooks] - 10https://gerrit.wikimedia.org/r/606736 (https://phabricator.wikimedia.org/T244499)
[07:03:16] <wikibugs>	 10Operations, 10Commons, 10MediaWiki-File-management, 10SRE-swift-storage, 10Traffic: Cached thumbnails and originals are sometimes not being purged correctly/quickly - https://phabricator.wikimedia.org/T256313 (10Aklapper)
[07:07:58] <icinga-wm>	 RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[07:08:50] <marostegui>	 !log Start pre switchover steps on m1 T254556
[07:08:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:08:55] <stashbot>	 T254556: Upgrade m1 to Buster and Mariadb 10.4 - https://phabricator.wikimedia.org/T254556
[07:14:25] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] "Time to test!" [cookbooks] - 10https://gerrit.wikimedia.org/r/606736 (https://phabricator.wikimedia.org/T244499) (owner: 10Elukey)
[07:15:12] <icinga-wm>	 PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 429 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: 
[07:15:12] <icinga-wm>	  most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[07:17:16] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review: replace backends for releases.wikimedia.org with buster VMs - https://phabricator.wikimedia.org/T247652 (10hashar) The server with `/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java` has been started by our systemd unit a...
[07:18:10] <elukey>	 !log reboot kafkamon* vms for kernel upgrades
[07:18:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:18:21] <elukey>	 this may generate some kafka lag alerts, hopefully not --^
[07:30:25] <wikibugs>	 (03CR) 10Marostegui: mariadb: Promote db1097 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/606953 (https://phabricator.wikimedia.org/T254556) (owner: 10Marostegui)
[07:30:40] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1097 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/606953 (https://phabricator.wikimedia.org/T254556) (owner: 10Marostegui)
[07:32:37] <wikibugs>	 (03PS1) 10Elukey: camus: use refinery-camus-0.128 [puppet] - 10https://gerrit.wikimedia.org/r/607717
[07:32:49] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] Introduce kubernetes[12]01[56] [dns] - 10https://gerrit.wikimedia.org/r/607495 (https://phabricator.wikimedia.org/T256236) (owner: 10Alexandros Kosiaris)
[07:32:54] <wikibugs>	 (03PS4) 10Alexandros Kosiaris: Introduce kubernetes[12]01[56] [dns] - 10https://gerrit.wikimedia.org/r/607495 (https://phabricator.wikimedia.org/T256236)
[07:33:36] <wikibugs>	 (03CR) 10Joal: [C: 03+1] "LGTM! Thanks elukey" [puppet] - 10https://gerrit.wikimedia.org/r/607717 (owner: 10Elukey)
[07:36:25] <elukey>	 !log reboot an-launcher1001 for kernel upgrades
[07:36:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:36:50] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] camus: use refinery-camus-0.128 [puppet] - 10https://gerrit.wikimedia.org/r/607717 (owner: 10Elukey)
[07:37:06] <icinga-wm>	 RECOVERY - WDQS high update lag on wdqs1007 is OK: (C)4.32e+04 ge (W)2.16e+04 ge 2.158e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[07:38:48] <icinga-wm>	 PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash-codfw instance=kafkamon1001 job=burrow partition={0,1,2,4,5} site=eqiad topic={udp_localhost-info,udp_localhost-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+
[07:38:48] <icinga-wm>	 r-cluster=logging-eqiad&var-topic=All&var-consumer_group=All
[07:39:53] <wikibugs>	 10Operations, 10DBA: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10Kormat) 05Open→03Resolved Nope, all done!
[07:41:22] <elukey>	 so the lag seemed starting before I rebooted kafkamon
[07:41:42] <elukey>	 Cc godog
[07:41:54] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:42:56] <wikibugs>	 (03CR) 10Ayounsi: cumin: backup all of /srv where a lot of deployment state may live (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/607258 (owner: 10Jcrespo)
[07:47:26] <wikibugs>	 (03CR) 10Hashar: "We should remove the PHP packages from the releases hosts." [puppet] - 10https://gerrit.wikimedia.org/r/607641 (https://phabricator.wikimedia.org/T247652) (owner: 10Dzahn)
[07:47:44] <wikibugs>	 (03CR) 10Jcrespo: "So this is my plan based on the feedback:" [puppet] - 10https://gerrit.wikimedia.org/r/607258 (owner: 10Jcrespo)
[07:47:58] <logmsgbot>	 !log akosiaris@cumin1001 START - Cookbook sre.ganeti.makevm
[07:48:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:48:27] <logmsgbot>	 !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99)
[07:48:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:48:36] <logmsgbot>	 !log akosiaris@cumin1001 START - Cookbook sre.ganeti.makevm
[07:48:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:49:01] <logmsgbot>	 !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99)
[07:49:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:49:09] <logmsgbot>	 !log akosiaris@cumin1001 START - Cookbook sre.ganeti.makevm
[07:49:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:49:24] <logmsgbot>	 !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99)
[07:49:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:49:32] <logmsgbot>	 !log akosiaris@cumin1001 START - Cookbook sre.ganeti.makevm
[07:49:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:49:40] <logmsgbot>	 !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99)
[07:49:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:50:59] <kormat>	 how do i page traffic to say that akosiaris is DoSing this channel? ;)
[07:51:04] <icinga-wm>	 RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[07:51:09] <volans>	 kormat: lol
[07:51:20] <volans>	 akosiaris: anything spicerack-related?
[07:52:08] <jynus>	 !log stop bacula-director on backup1001 for db maintenance T254556
[07:52:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:52:12] <stashbot>	 T254556: Upgrade m1 to Buster and Mariadb 10.4 - https://phabricator.wikimedia.org/T254556
[07:54:44] <icinga-wm>	 PROBLEM - bacula director process on backup1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (bacula), command name bacula-dir https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[07:56:28] <icinga-wm>	 PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 429 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: 
[07:56:28] <icinga-wm>	  most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[07:56:44] <icinga-wm>	 RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All
[07:57:23] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review: replace backends for releases.wikimedia.org with buster VMs - https://phabricator.wikimedia.org/T247652 (10jcrespo) This is bacula when trying to backup releases2002: `lines=10 25-Jun 04:05 backup1001.eqiad.wmnet JobId...
[07:58:34] <jynus>	 the bacula alerts is me, see log
[07:58:37] <jynus>	 will ack it
[08:00:04] <jouncebot>	 marostegui, jynus, and akosiaris: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for m1 database master failover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200625T0800).
[08:00:04] <icinga-wm>	 RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[08:00:05] <wikibugs>	 (03PS1) 10Muehlenhoff: Extend access for PHPCC people [puppet] - 10https://gerrit.wikimedia.org/r/607720
[08:00:27] <marostegui>	 jynus akosiaris let's go?
[08:00:34] <jynus>	 I am here
[08:00:49] <ema>	 jynus: is the "Prometheus jobs reduced availability" alert for job={bacula,wikidough} site={codfw,eqiad} also related?
[08:00:55] <jynus>	 yeah
[08:01:00] <ema>	 ack, thanks
[08:01:03] <jynus>	 will check it later
[08:01:29] <marostegui>	 akosiaris ok from your side to go ahead?
[08:02:15] <jynus>	 ema: although that has been happening for 15 hours so maybe not
[08:02:57] <akosiaris>	 marostegui: yes
[08:03:03] <marostegui>	 ok, let's start then
[08:03:05] <marostegui>	 !log Failover m1 from db1135 to db1097 - T254556 
[08:03:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:03:09] <stashbot>	 T254556: Upgrade m1 to Buster and Mariadb 10.4 - https://phabricator.wikimedia.org/T254556
[08:03:16] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Extend access for PHPCC people [puppet] - 10https://gerrit.wikimedia.org/r/607720 (owner: 10Muehlenhoff)
[08:03:52] <jynus>	 ethepad: upstream connect error or disconnect/reset before headers. reset reason: connection failure
[08:03:59] <marostegui>	 all done
[08:04:17] <akosiaris>	 etherpad logs show that things proceed as normal...
[08:04:23] <akosiaris>	 ah no
[08:04:27] <akosiaris>	 it just started logging exceptions
[08:04:28] <marostegui>	 mmm, one sec
[08:04:31] <jynus>	 An error occurred Please press and hold Ctrl and press F5 to reload this page
[08:04:38] <marostegui>	 yes, something failed
[08:04:39] <marostegui>	 checking
[08:04:54] <akosiaris>	 I 'll wait it out 30s before restarting it
[08:05:01] <marostegui>	 should be good now
[08:05:18] <jynus>	 etherpad still says upstream connect error or disconnect/reset before headers. reset reason: connection failure
[08:05:22] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: update scheme for wikidough (improves ab8a948a) [puppet] - 10https://gerrit.wikimedia.org/r/607570 (owner: 10Ssingh)
[08:05:25] <akosiaris>	 [2020-06-25 08:04:23.327] [ERROR] console - ERROR: Problem while initalizing the database
[08:05:26] <ema>	 jynus: I think it says it's been happening for 15 hours because that's when the wikidough issues started, while bacula started having issues only at 7:53ish
[08:05:39] <jynus>	 ema: makes sense
[08:05:51] <akosiaris>	 no activity on the logs
[08:05:55] <jynus>	 marostegui: what is the status mysql-wise
[08:06:00] <marostegui>	 it is all done
[08:06:02] <jynus>	 everything looking good on db and proxy?
[08:06:06] <marostegui>	 yeo
[08:06:08] <marostegui>	 yep
[08:06:14] <jynus>	 ok, so it is app side now, akosiaris
[08:06:36] <jynus>	 most likely the persistent connections
[08:06:38] <akosiaris>	 ok that means a restart is needed. It did restart on its own, but that for some reason did not help
[08:06:46] <akosiaris>	 2m17s ago
[08:06:49] <marostegui>	 jynus: I have killed connections on db1135
[08:07:05] <jynus>	 yeah, but app logic is somtimes strange
[08:07:11] <marostegui>	 yep
[08:07:14] <akosiaris>	 oh it's definitely in the app
[08:07:20] <jynus>	 let's confirm it works after app restart
[08:07:23] <akosiaris>	 I would be surprised if it wasn't
[08:07:26] <jynus>	 it could be something unexpected
[08:07:30] <icinga-wm>	 PROBLEM - etherpad_up reduced availability on icinga1001 is CRITICAL: 0 le 0.8 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_exporters_%22up%22_metrics_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:07:30] * akosiaris investigating a bit before restarting
[08:07:31] <marostegui>	 akosiaris: XD
[08:07:45] <jynus>	 marostegui: get screen captures of everything done
[08:08:11] <jynus>	 what other things are on m1?
[08:08:12] <akosiaris>	 hmm, it's not the app
[08:08:17] <akosiaris>	 it's dead alright
[08:08:20] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove access for nathante [puppet] - 10https://gerrit.wikimedia.org/r/607722
[08:08:21] <akosiaris>	 systemd isn't restarting it
[08:08:24] <marostegui>	 akosiaris: what does the error say?
[08:08:26] <icinga-wm>	 PROBLEM - Check systemd state on etherpad1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:08:27] <akosiaris>	 ok we found something
[08:08:35] <akosiaris>	    Active: failed (Result: exit-code) since Thu 2020-06-25 08:04:23 UTC; 3min 17s ago
[08:08:40] <marostegui>	 librenms does work
[08:08:54] <icinga-wm>	 PROBLEM - etherpad_lite_process_running on etherpad1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/node /usr/share/etherpad-lite/node_modules/ep_etherpad-lite/node/server.js https://wikitech.wikimedia.org/wiki/Etherpad.wikimedia.org
[08:09:12] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Remove access for nathante [puppet] - 10https://gerrit.wikimedia.org/r/607722 (owner: 10Muehlenhoff)
[08:09:15] <akosiaris>	 for some reason systemd restart policy failed
[08:09:18] <icinga-wm>	 PROBLEM - etherpad.wikimedia.org HTTP on etherpad1002 is CRITICAL: connect to address 10.64.32.178 and port 9001: Connection refused https://wikitech.wikimedia.org/wiki/Etherpad.wikimedia.org
[08:09:22] <akosiaris>	 anyway I 'll restart and read up on it
[08:09:24] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: Degraded RAID on pc2007 - https://phabricator.wikimedia.org/T255904 (10Kormat) 05Open→03Resolved Array rebuild has completed, and is back in "optimal" state.
[08:09:31] <marostegui>	 rt also works
[08:09:31] <jynus>	 that is ok, that is why we gather info
[08:09:52] <icinga-wm>	 RECOVERY - MegaRAID on pc2007 is OK: OK: optimal, 1 logical, 4 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[08:09:54] <akosiaris>	 !log restart etherpad-lite on etherpad1002
[08:09:56] <marostegui>	 so it looks only etherpad specific
[08:09:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:10:12] <icinga-wm>	 RECOVERY - Check systemd state on etherpad1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:10:12] <marostegui>	 etherpad is back for me now
[08:10:15] <jynus>	 lets see what happens after restart
[08:10:26] <jynus>	 yep
[08:10:26] <marostegui>	 I can write fine on etherpad
[08:10:27] <akosiaris>	 yeah, something with the systemd unit. It should have tried to restart it again, but it didn't
[08:10:40] <icinga-wm>	 RECOVERY - etherpad_lite_process_running on etherpad1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/node /usr/share/etherpad-lite/node_modules/ep_etherpad-lite/node/server.js https://wikitech.wikimedia.org/wiki/Etherpad.wikimedia.org
[08:10:41] <akosiaris>	 I think I need to tweak a bit the restart policy
[08:10:44] <jynus>	 to be fair, we had documented to restart etherpad, we agree for further testing
[08:10:50] <akosiaris>	 it does was Restart=always
[08:10:52] <jynus>	 *ed
[08:10:56] <akosiaris>	 s/was/have/
[08:11:00] <jynus>	 because it is nice to understand why
[08:11:06] <icinga-wm>	 RECOVERY - etherpad.wikimedia.org HTTP on etherpad1002 is OK: HTTP OK: HTTP/1.1 200 OK - 9000 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Etherpad.wikimedia.org
[08:11:09] <wikibugs>	 (03CR) 10Volans: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/607258 (owner: 10Jcrespo)
[08:11:11] <akosiaris>	 but I think it died too much and systemd stopped trying
[08:11:23] <jynus>	 it could be
[08:11:39] <jynus>	 the problem with app logic is that how it reacts is complicated
[08:11:44] <akosiaris>	 Jun 25 08:04:23 etherpad1002 systemd[1]: etherpad-lite.service: Service RestartSec=100ms expired, scheduling restart.
[08:11:44] <akosiaris>	 Jun 25 08:04:23 etherpad1002 systemd[1]: etherpad-lite.service: Scheduled restart job, restart counter is at 7.
[08:11:44] <akosiaris>	 Jun 25 08:04:23 etherpad1002 systemd[1]: Stopped Etherpad-lite daemon.
[08:11:44] <akosiaris>	 Jun 25 08:04:23 etherpad1002 systemd[1]: etherpad-lite.service: Start request repeated too quickly.
[08:11:48] <jynus>	 specially with multiple threads
[08:11:49] <akosiaris>	 yup, that's it
[08:11:50] <jynus>	 and read only
[08:11:57] <akosiaris>	 threads?
[08:12:05] <akosiaris>	 this is a nodejs app we are talking about
[08:12:06] <jynus>	 e.g. db connections
[08:12:09] <akosiaris>	 no threads really here
[08:12:13] <jynus>	 at some poing
[08:12:16] <marostegui>	 hahaha
[08:12:16] <akosiaris>	 no multiple connections either
[08:12:18] <jynus>	 *point
[08:12:26] <jynus>	 one connection could see everthing is all right
[08:12:33] <jynus>	 and other see it is in read only or down
[08:12:40] <jynus>	 and not react accordinly
[08:12:42] <icinga-wm>	 PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 429 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: 
[08:12:42] <icinga-wm>	  most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[08:12:48] <moritzm>	 we can tweak RestartSec, the default is very low, 0.1 seconds or so
[08:12:51] <jynus>	 if things are stable
[08:12:52] <icinga-wm>	 RECOVERY - etherpad_up reduced availability on icinga1001 is OK: (C)0.8 le (W)0.9 le 1 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_exporters_%22up%22_metrics_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:12:59] <akosiaris>	 moritzm: yup, that's what I 'll do
[08:12:59] <jynus>	 I will start bacula
[08:13:07] <marostegui>	 jynus: go ahead
[08:13:09] <akosiaris>	 wikifeeds being rate limited?
[08:13:09] <jynus>	 ok
[08:13:15] <akosiaris>	 429? what's up with that?
[08:14:01] <jynus>	 !log restarting bacula-dir on backup1001
[08:14:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:14:23] <jynus>	 Info: /Stage[main]/Bacula::Director/Service[bacula-director]: Unscheduling refresh on Service[bacula-director]
[08:14:28] <marostegui>	 so, librenms, rt and racktables are working fine too
[08:14:32] <icinga-wm>	 RECOVERY - bacula director process on backup1001 is OK: PROCS OK: 1 process with UID = 112 (bacula), command name bacula-dir https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[08:14:42] <akosiaris>	 https://grafana.wikimedia.org/d/35vIuGpZk/wikifeeds?orgId=1&refresh=1m&from=now-2d&to=now
[08:14:48] <akosiaris>	 interesting. requests doubled on the 24th
[08:14:51] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove access for nathante [puppet] - 10https://gerrit.wikimedia.org/r/607722
[08:15:09] <jynus>	 lets run a backup just to be sure
[08:15:39] <jynus>	 gerrit1001.wikimedia.org-Hourly-Sun-production-gerrit-repo-data is running
[08:15:41] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Remove access for nathante [puppet] - 10https://gerrit.wikimedia.org/r/607722 (owner: 10Muehlenhoff)
[08:16:01] <jynus>	 239104  Incr       2,153    98.53 M  OK       25-Jun-20 08:15 gerrit1001.wikimedia.org-Hourly-Sun-production-gerrit-repo-data
[08:16:04] <jynus>	 ran ok
[08:16:32] <akosiaris>	 https://grafana.wikimedia.org/d/35vIuGpZk/wikifeeds?panelId=28&fullscreen&orgId=1&refresh=1m&from=now-2d&to=now
[08:16:34] <jynus>	 marostegui: anythign weird on db connections / traffic to old servers?
[08:16:36] <akosiaris>	 wow, that's not good
[08:16:45] * akosiaris opening task
[08:17:14] <marostegui>	 jynus: nope it is empty and actually replication 10.4 -> 10.1 is not broken yet :)
[08:17:30] <jynus>	 I was about to ask
[08:17:55] <jynus>	 Query Throughput seems lower
[08:17:57] <jynus>	 on
[08:18:08] <jynus>	 https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-group=misc&var-shard=m1&var-role=All&from=1593051483625&to=1593073083625
[08:18:14] <wikibugs>	 (03PS3) 10Muehlenhoff: Remove access for nathante [puppet] - 10https://gerrit.wikimedia.org/r/607722
[08:18:57] <jynus>	 should we double check zarcillo or refresh prometheus?
[08:19:26] <marostegui>	 https://grafana.wikimedia.org/d/000000273/mysql?panelId=37&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1135&var-port=9104
[08:19:39] <marostegui>	 https://grafana.wikimedia.org/d/000000273/mysql?panelId=37&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1097&var-port=9104
[08:19:45] <jynus>	 open connections went from 30-41 to 21
[08:19:46] <marostegui>	 that looks good
[08:20:04] <jynus>	 so it could be prometheus?
[08:20:12] <wikibugs>	 (03PS15) 10Kormat: Add native mysql spicerack module. [software/spicerack] - 10https://gerrit.wikimedia.org/r/603434 (https://phabricator.wikimedia.org/T255409)
[08:20:23] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove access for nathante [puppet] - 10https://gerrit.wikimedia.org/r/607722 (owner: 10Muehlenhoff)
[08:20:37] <jynus>	 let me refresh prometheus config
[08:20:43] <icinga-wm>	 RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[08:20:52] <marostegui>	 so basically the connections have shifted from db1135 and db1097 with more or less the same number
[08:21:25] <jynus>	 master: db1135
[08:21:35] <jynus>	 replica only: db1117
[08:21:45] <marostegui>	 how come?
[08:21:50] <jynus>	 so there maybe somethig missing on zarcillo
[08:21:55] <wikibugs>	 (03CR) 10Kormat: "Now that the code itself is pretty settled, i'll start working on tests." (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/603434 (https://phabricator.wikimedia.org/T255409) (owner: 10Kormat)
[08:22:05] <marostegui>	 root@cumin1001:/home/marostegui# ./section m1 | grep db1097
[08:22:05] <marostegui>	 db1097.eqiad.wmnet	3306
[08:22:08] <jynus>	 what should be on replica?
[08:22:22] <marostegui>	 replica should be db1117 and master db1097
[08:22:25] <jynus>	 checking
[08:22:38] <kormat>	 marostegui: do you need an extra pair of eyes for anything?
[08:22:39] <marostegui>	 Updating zarcillo...
[08:22:39] <marostegui>	 [WARNING] Old master not found on zarcillo master list
[08:23:08] <marostegui>	 kormat: no, not needed at this point, thank you 
[08:23:25] <jynus>	 there you have it
[08:23:35] <jynus>	 let me see what was missing
[08:23:52] <jynus>	 db1097 : core
[08:24:03] <jynus>	 I think it should be misc
[08:24:03] <marostegui>	 ha
[08:24:06] <marostegui>	 yep
[08:24:11] <jynus>	 I update
[08:24:27] <jynus>	 so metrics where happening
[08:24:30] <marostegui>	 | m1      | eqiad | db1135     |
[08:24:34] <jynus>	 but were being sent to m1 on core
[08:24:37] <marostegui>	 Do you update that or I do?
[08:24:39] <jynus>	 I do
[08:24:42] <marostegui>	 ok
[08:25:34] <jynus>	 I will refresh prometheus now
[08:25:51] <jynus>	 now both are there
[08:26:16] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/607645 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn)
[08:26:19] <icinga-wm>	 PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 429 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: 
[08:26:19] <icinga-wm>	  most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[08:26:22] <marostegui>	 goood
[08:26:34] <jynus>	 I wonder how ever why the script failed
[08:26:42] <jynus>	 because that should not be a cause
[08:26:47] <wikibugs>	 10Puppet, 10Toolforge, 10Documentation, 10User-srodlund: Document our GridEngine set up - https://phabricator.wikimedia.org/T88733 (10Aklapper) https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin#GridEngine_Master seems to be closest here; may want to link from https://wikitech.wikimedia.org/wiki/He...
[08:27:26] <marostegui>	 jynus: I know why
[08:27:28] <marostegui>	 ZARCILLO_INSTANCE = 'db1115'  # instance_name:port format
[08:27:45] <marostegui>	 that needs to be db2093
[08:27:56] <jynus>	 ah
[08:28:08] <jynus>	 master right now is: m1      | eqiad | db1135
[08:28:17] <jynus>	 what should it say?
[08:28:21] <marostegui>	 db1097
[08:29:06] <wikibugs>	 (03PS1) 10Marostegui: switchover.py: Change zarcillo instance [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/607725 (https://phabricator.wikimedia.org/T254556)
[08:29:18] <marostegui>	 jynus: ^
[08:29:47] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] switchover.py: Change zarcillo instance [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/607725 (https://phabricator.wikimedia.org/T254556) (owner: 10Marostegui)
[08:29:57] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] switchover.py: Change zarcillo instance [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/607725 (https://phabricator.wikimedia.org/T254556) (owner: 10Marostegui)
[08:30:02] <jynus>	 we should have a better discovery method than a global constant :-D
[08:30:12] <marostegui>	 or a cname :)
[08:30:44] <jynus>	 master: db1097
[08:30:58] <marostegui>	 coool
[08:30:58] <jynus>	 replicas: db1117:13321, db1135:9104
[08:31:07] <marostegui>	 and codfw?
[08:32:03] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on labtestpuppetmaster2001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes
[08:32:30] <jynus>	 it only shows replica: db2078:13321
[08:32:32] <jynus>	 no master
[08:32:39] <icinga-wm>	 RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[08:32:52] <marostegui>	 that's strange
[08:32:57] <marostegui>	 db2132 is codfw's master for m1
[08:33:18] <jynus>	 on zarcillo it says: m1      | codfw | db2132
[08:33:30] <marostegui>	 yeah, that is the one
[08:33:57] <jynus>	 it is on core too
[08:34:13] <jynus>	 maybe you can take care of reviewing the hosts that are on core that should be on misc?
[08:34:21] <marostegui>	 yeah
[08:34:23] <marostegui>	 I can do that
[08:34:28] <jynus>	 this will fail until we have it automated
[08:34:36] <jynus>	 so it is a best effort
[08:34:52] <marostegui>	 Can you fix db2132 and I will take care of the rest of hosts?
[08:34:55] <jynus>	 metrics are not lost, but they are added to the wrong group
[08:34:59] <icinga-wm>	 RECOVERY - Check systemd state on an-tool1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:35:06] <jynus>	 I can update  db2132 yes
[08:35:13] <marostegui>	 cool, I will check the rest of misc sections
[08:35:43] <jynus>	 [zarcillo]> update instances set `group` = 'misc' where name ='db2132';
[08:36:06] <marostegui>	 ha, good timing, now I don't have to do a describe instances;
[08:36:19] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:37:20] <jynus>	 let's move convo to databases
[08:37:25] <jynus>	 as maintenance seems done
[08:37:41] <marostegui>	 +1
[08:40:03] <icinga-wm>	 PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 429 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: 
[08:40:03] <icinga-wm>	  most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[08:40:14] <wikibugs>	 10Operations, 10DBA, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Request new database for idp-test.wikimedia.org - https://phabricator.wikimedia.org/T256120 (10jbond) >>! In T256120#6252975, @Marostegui wrote: > Should be fixed now.  Thanks although I'm now getting  "Error message: CREATE command denied...
[08:41:42] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review: replace backends for releases.wikimedia.org with buster VMs - https://phabricator.wikimedia.org/T247652 (10hashar) There are ferm rules: ` iptables --list -v|grep bacula    36  2160 ACCEPT     tcp  --  any    any     b...
[08:41:57] <wikibugs>	 10Operations, 10DBA, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Request new database for idp-test.wikimedia.org - https://phabricator.wikimedia.org/T256120 (10Marostegui) Fixed
[08:42:34] <hashar>	 !log releases2002: restarted bacula-fd to take in account the puppet provided configuration  # T247652
[08:42:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:42:39] <stashbot>	 T247652: replace backends for releases.wikimedia.org with buster VMs - https://phabricator.wikimedia.org/T247652
[08:43:12] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review: replace backends for releases.wikimedia.org with buster VMs - https://phabricator.wikimedia.org/T247652 (10hashar) @jcrespo should be good now: ` # netstat -tlnp|grep bacula tcp        0      0 0.0.0.0:9102...
[08:45:28] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review: replace backends for releases.wikimedia.org with buster VMs - https://phabricator.wikimedia.org/T247652 (10jcrespo) Thanks, it ran successfully now:  ` 239105  Full          21    22.89 K  OK       25-Jun-20 08:44 rele...
[08:46:20] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to PROD for lmata (SRE) - https://phabricator.wikimedia.org/T254818 (10jbond) 05Open→03Resolved >>! In T254818#6253018, @Dzahn wrote: > "Membership of ops group in LDAP and YAML are not identical: ['lmata']"  This is fixed now thanks @Dzahn
[08:46:59] <icinga-wm>	 ACKNOWLEDGEMENT - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 429 (expecting: 200) alexandros kosiaris https://phabricator.wikimedia.org/T256358 - The acknowledgement ex
[08:46:59] <icinga-wm>	 -26 18:45:49. https://wikitech.wikimedia.org/wiki/Wikifeeds
[08:50:48] <wikibugs>	 (03PS1) 10Marostegui: db1135: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/607728 (https://phabricator.wikimedia.org/T253217)
[08:51:30] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1135: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/607728 (https://phabricator.wikimedia.org/T253217) (owner: 10Marostegui)
[08:51:52] <marostegui>	 moritzm: is your change ok to merge?
[08:54:38] <jynus>	 ema: indeed it went away, but job=wikidough site=codfw is still ongoing
[08:54:38] <wikibugs>	 (03CR) 10Hashar: ci: switch integration.wikimedia.org to scap DocumentRoot (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/607076 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar)
[08:55:51] <ema>	 jynus: right, probably due to ab8a948a. I'll investigate further, thanks!
[08:56:41] <logmsgbot>	 !log joal@deploy1001 Started deploy [analytics/refinery@4aba370]: Analytics fix over weekly train [analytics/refinery@4aba370]
[08:56:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:57:06] <wikibugs>	 (03PS1) 10Muehlenhoff: Fix status check for Kerberos principal deletion [puppet] - 10https://gerrit.wikimedia.org/r/607729
[08:58:05] <jynus>	 elukey: with your permission I will ack analytics1030 alerts (scheduled for decom) to also remove them from the unack list
[08:58:07] <jynus>	 	
[08:58:29] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on labtestpuppetmaster2001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes
[08:59:03] <jynus>	 I will also ack services for netbox-dev2001 with is being built and it is WIP
[09:01:25] <vgutierrez>	 !log restarting acme-chief instances to catch up on kernel updates
[09:01:27] <jynus>	 I think that will help with ongoing issues discoverability, I will revert if that impacts any or your work
[09:01:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:02:06] <jynus>	 idem for analytics1039, scheduled for decom
[09:02:17] <elukey>	 jynus: sure no problem, where did you find the alarms? I thought I had everything acked in icinga, the host is part of the test cluster (we'll replace it with proper nodes not OOW soon)
[09:02:29] <jynus>	 it is disabled
[09:02:33] <jynus>	 so no issue
[09:02:43] <jynus>	 but if you search for ongoing alerts, it lists it
[09:02:50] <elukey>	 ahh all of them
[09:02:51] <jynus>	 so it doesn't hurt to ack it
[09:02:53] <icinga-wm>	 RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[09:03:01] <jynus>	 if you don't mind, not really needed
[09:03:12] <elukey>	 +1 thanks a lot for the cleanup
[09:03:23] <jynus>	 so you did nothing wrong
[09:03:39] <jynus>	 but it helps me tracking other ongoing issues
[09:04:37] <jynus>	 for example, https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=all&type=detail&servicestatustypes=16&hoststatustypes=3&serviceprops=2097162 only shows 2 alerts
[09:04:57] <jynus>	 and someone was working on one
[09:05:51] <jynus>	 or the wikifeeds thing that alex mentioned stands out more
[09:06:16] <jynus>	 just a preference of mine, but hopefully it is helpful for others too
[09:08:20] <icinga-wm>	 PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 429 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: 
[09:08:20] <icinga-wm>	  most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[09:13:08] <logmsgbot>	 !log joal@deploy1001 Finished deploy [analytics/refinery@4aba370]: Analytics fix over weekly train [analytics/refinery@4aba370] (duration: 16m 27s)
[09:13:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:13:30] <logmsgbot>	 !log joal@deploy1001 Started deploy [analytics/refinery@4aba370] (thin): Analytics fix over weekly train THIN [analytics/refinery@4aba370]
[09:13:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:13:34] <akosiaris>	 I wish I could ack a flapping service somehow in icinga
[09:13:40] <logmsgbot>	 !log joal@deploy1001 Finished deploy [analytics/refinery@4aba370] (thin): Analytics fix over weekly train THIN [analytics/refinery@4aba370] (duration: 00m 10s)
[09:13:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:21] <godog>	 "catch the service while it is flapping (away)"
[09:15:39] <icinga-wm>	 PROBLEM - Thanos compact is halted on icinga1001 is CRITICAL: cluster=thanos instance=thanos-fe2001 job=thanos-compact prometheus=ops site=codfw https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact
[09:16:48] <akosiaris>	 wikifeeds seems weird up to now
[09:17:10] <akosiaris>	 it's like suddenly users from 2 countries in the world decided to use the app more
[09:17:52] <akosiaris>	 the 2 countries part is not exactly right of course, it's just that those are really large countries
[09:18:41] <wikibugs>	 (03CR) 10Jcrespo: "> Patch Set 1:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/607258 (owner: 10Jcrespo)
[09:19:37] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: update scheme for wikidough (improves ab8a948a) [puppet] - 10https://gerrit.wikimedia.org/r/607570 (owner: 10Ssingh)
[09:21:19] <vgutierrez>	 !log rolling restart of  ncredir instances to catch up on kernel updates
[09:21:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:23:38] <godog>	 I'm looking at the thanos compact alert
[09:25:15] <godog>	 and it ran out of space on the local host while compacting -.-
[09:25:17] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove cas-logstash from caches [puppet] - 10https://gerrit.wikimedia.org/r/607508 (https://phabricator.wikimedia.org/T246998)
[09:26:46] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/607645 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn)
[09:28:05] <godog>	 !log extend lv on thanos-fe2001 and restart thanos-compact
[09:28:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:28:44] <akosiaris>	 !log schedule downtime for eqiad wikifeeds as it's flapping too much without yet knowing why. T256358
[09:28:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:28:48] <stashbot>	 T256358: wikifeeds usage increased by 3x on 2020-06-24 - https://phabricator.wikimedia.org/T256358
[09:28:51] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 103 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[09:29:03] <wikibugs>	 (03CR) 10Muehlenhoff: releases::mediawiki:: support buster / PHP 7.3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/607641 (https://phabricator.wikimedia.org/T247652) (owner: 10Dzahn)
[09:29:59] <wikibugs>	 (03PS2) 10Jcrespo: cumin: backup all of /srv where a lot of deployment state may live [puppet] - 10https://gerrit.wikimedia.org/r/607258
[09:30:05] <icinga-wm>	 RECOVERY - Thanos compact is halted on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact
[09:33:29] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/607609 (owner: 10CDanis)
[09:34:15] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.dns.netbox
[09:34:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:34:22] <wikibugs>	 (03CR) 10Jcrespo: "Let me know if this version is ok:" [puppet] - 10https://gerrit.wikimedia.org/r/607258 (owner: 10Jcrespo)
[09:36:55] <icinga-wm>	 PROBLEM - Thanos compact has not run on icinga1001 is CRITICAL: 4.425e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact
[09:37:19] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:37:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:37:23] <wikibugs>	 (03PS1) 10Jforrester: ExtensionDistribution: Drop REL1_33, EOL'ed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607740 (https://phabricator.wikimedia.org/T256087)
[09:38:27] <akosiaris>	 found it I think. There seems to be a restbase deploy right before the wikifeeds issues start
[09:38:40] <akosiaris>	 I wonder whether I should rollback or leave it to the devs
[09:41:28] <wikibugs>	 (03PS1) 10Volans: mgmt: netbox-generated data for frack mgmt codfw [dns] - 10https://gerrit.wikimedia.org/r/607741 (https://phabricator.wikimedia.org/T233183)
[09:44:21] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[09:49:29] <icinga-wm>	 RECOVERY - Thanos compact has not run on icinga1001 is OK: (C)24 ge (W)12 ge 0.003265 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact
[09:51:02] <wikibugs>	 (03PS1) 10Mvolz: Update citoid to dcc45a42 [deployment-charts] - 10https://gerrit.wikimedia.org/r/607745
[09:52:57] <mvolz>	 Hey, I notice that citoid is listed in the services deployment windows here: https://wikitech.wikimedia.org/wiki/Deployments - but I mostly deploy citoid I've never actually used any of those windows. 😅 Is this schedule supposed to be prescriptive or descriptive? 
[09:53:25] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[09:53:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:57:09] <logmsgbot>	 !log akosiaris@cumin1001 START - Cookbook sre.ganeti.makevm
[09:57:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:57:32] <logmsgbot>	 !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99)
[09:57:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:58:19] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[09:58:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:59:44] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[09:59:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:00:11] <icinga-wm>	 PROBLEM - rsyslog TLS listener on port 6514 on centrallog1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Logs
[10:00:29] <icinga-wm>	 RECOVERY - rsyslog TLS listener on port 6514 on centrallog1001 is OK: SSL OK - Certificate centrallog1001.eqiad.wmnet valid until 2024-06-25 15:42:33 +0000 (expires in 1461 days) https://wikitech.wikimedia.org/wiki/Logs
[10:00:34] <logmsgbot>	 !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime
[10:00:35] <logmsgbot>	 !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[10:00:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:00:38] <wikibugs>	 10Operations, 10ops-eqiad: upgrade memory in ganeti100[5-8].eqiad.wmnet - https://phabricator.wikimedia.org/T244530 (10ops-monitoring-bot) Icinga downtime for 12:00:00 set by akosiaris@cumin1001 on 1 host(s) and their services with reason: Memory upgrade ` kubestagetcd1004.eqiad.wmnet `
[10:00:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:00:43] <logmsgbot>	 !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime
[10:00:44] <logmsgbot>	 !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[10:00:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:00:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:00:48] <wikibugs>	 10Operations, 10ops-eqiad: upgrade memory in ganeti100[5-8].eqiad.wmnet - https://phabricator.wikimedia.org/T244530 (10ops-monitoring-bot) Icinga downtime for 12:00:00 set by akosiaris@cumin1001 on 1 host(s) and their services with reason: Memory upgrade ` ganeti1005.eqiad.wmnet `
[10:02:44] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "> Patch Set 2:" [dns] - 10https://gerrit.wikimedia.org/r/607634 (https://phabricator.wikimedia.org/T256139) (owner: 10Dzahn)
[10:04:37] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[10:04:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:04:41] <akosiaris>	 !log poweroff kubestagetcd1004 and ganeti1005 for T244530
[10:04:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:04:44] <stashbot>	 T244530: upgrade memory in ganeti100[5-8].eqiad.wmnet - https://phabricator.wikimedia.org/T244530
[10:04:54] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Indeed re: ipv6 (see comment on I33596b)" [dns] - 10https://gerrit.wikimedia.org/r/607637 (https://phabricator.wikimedia.org/T256139) (owner: 10Dzahn)
[10:04:59] <wikibugs>	 10Operations, 10ops-eqiad: upgrade memory in ganeti100[5-8].eqiad.wmnet - https://phabricator.wikimedia.org/T244530 (10akosiaris)
[10:05:35] <wikibugs>	 10Operations, 10ops-eqiad: upgrade memory in ganeti100[5-8].eqiad.wmnet - https://phabricator.wikimedia.org/T244530 (10akosiaris) @Jclark-ctr: ganeti1005 is ready. Fully depooled, downtimed and powered off.
[10:07:21] <logmsgbot>	 !log akosiaris@cumin1001 START - Cookbook sre.ganeti.makevm
[10:07:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:07:32] <logmsgbot>	 !log akosiaris@cumin1001 START - Cookbook sre.ganeti.makevm
[10:07:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:07:41] <logmsgbot>	 !log akosiaris@cumin1001 START - Cookbook sre.ganeti.makevm
[10:07:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:07:49] <logmsgbot>	 !log akosiaris@cumin1001 START - Cookbook sre.ganeti.makevm
[10:07:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:12:27] <logmsgbot>	 !log akosiaris@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
[10:12:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:12:37] <logmsgbot>	 !log akosiaris@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
[10:12:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:13:26] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[10:13:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:17:25] <logmsgbot>	 !log akosiaris@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
[10:17:27] <logmsgbot>	 !log akosiaris@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
[10:17:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:17:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:18:17] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[10:18:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:19:07] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:22:03] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[10:22:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:23:47] <icinga-wm>	 PROBLEM - Check systemd state on ncredir2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:25:39] <wikibugs>	 (03CR) 10Ayounsi: "> It could be evaluated if this record should be in mgmt.frack or just frack as it is right now." [dns] - 10https://gerrit.wikimedia.org/r/607741 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans)
[10:25:54] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[10:25:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:27:18] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[10:27:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:31:15] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "> Thinking about it, .mgmt.frack sounds better, as it's an IP in that vlan. But no strong opinion, so whatever is easier to manage." [dns] - 10https://gerrit.wikimedia.org/r/607741 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans)
[10:32:11] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[10:32:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:33:58] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[10:34:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:34:21] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Introduce kubernetes[12]01[56] [puppet] - 10https://gerrit.wikimedia.org/r/607752 (https://phabricator.wikimedia.org/T256236)
[10:34:46] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "Surely a pebcak on my side, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/607729 (owner: 10Muehlenhoff)
[10:35:39] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] Introduce kubernetes[12]01[56] [puppet] - 10https://gerrit.wikimedia.org/r/607752 (https://phabricator.wikimedia.org/T256236) (owner: 10Alexandros Kosiaris)
[10:36:12] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Fix status check for Kerberos principal deletion [puppet] - 10https://gerrit.wikimedia.org/r/607729 (owner: 10Muehlenhoff)
[10:38:51] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[10:38:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:38:55] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 2016 responds with unexpected value at path = Missing keys: [mostread] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:38:55] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:39:11] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:40:37] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:40:41] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:40:53] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:41:11] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[10:41:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:41:42] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: use system openjdk 11 for logging ES7 instances [puppet] - 10https://gerrit.wikimedia.org/r/607542 (https://phabricator.wikimedia.org/T252913) (owner: 10Herron)
[10:41:47] <icinga-wm>	 PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[10:45:19] <moritzm>	 !log rolling reboot of ms-be[2044-2056]
[10:45:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:45:38] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Add kubernetes[12]01[56] [homer/public] - 10https://gerrit.wikimedia.org/r/607754 (https://phabricator.wikimedia.org/T256236)
[10:45:52] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[10:45:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:50:10] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[10:50:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:50:32] <icinga-wm>	 RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[10:53:30] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[10:53:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:56:04] <wikibugs>	 (03CR) 10Jforrester: "We use the PHP version for docroot hosting for a few things still, don't we? doc1001 is for most things, but there's still… coverage repor" [puppet] - 10https://gerrit.wikimedia.org/r/607641 (https://phabricator.wikimedia.org/T247652) (owner: 10Dzahn)
[10:56:45] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[10:56:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:57:18] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] Remove cas-logstash from caches [puppet] - 10https://gerrit.wikimedia.org/r/607508 (https://phabricator.wikimedia.org/T246998) (owner: 10Muehlenhoff)
[11:00:04] <jouncebot>	 Amir1, Lucas_WMDE, awight, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy European mid-day backport window(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200625T1100).
[11:00:26] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[11:00:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:01:30] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[11:01:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:01:52] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:01:54] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:05:45] <wikibugs>	 (03PS4) 10Jcrespo: mariadb-backups: Move x1 backup source from db1095 to db1102 [puppet] - 10https://gerrit.wikimedia.org/r/607510 (https://phabricator.wikimedia.org/T254871)
[11:06:16] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:06:22] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[11:06:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:07:10] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/607754 (https://phabricator.wikimedia.org/T256236) (owner: 10Alexandros Kosiaris)
[11:12:25] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, thanks for the patch!" [puppet] - 10https://gerrit.wikimedia.org/r/607258 (owner: 10Jcrespo)
[11:14:05] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Add kubernetes[12]01[56] (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/607754 (https://phabricator.wikimedia.org/T256236) (owner: 10Alexandros Kosiaris)
[11:16:11] <awight>	 I'd like to add something to the BACON window, I can deploy it myself.
[11:17:32] <Urbanecm>	 awight: go ahead :)
[11:18:44] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "> Patch Set 15:" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/603434 (https://phabricator.wikimedia.org/T255409) (owner: 10Kormat)
[11:21:06] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[11:21:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:21:57] <awight>	 ty!
[11:24:09] <wikibugs>	 (03PS1) 10Awight: [beta] Enable mobile view for dewiki survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607762 (https://phabricator.wikimedia.org/T253112)
[11:24:28] <wikibugs>	 (03CR) 10Awight: [C: 03+2] "beta-only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607762 (https://phabricator.wikimedia.org/T253112) (owner: 10Awight)
[11:25:06] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[11:25:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:25:18] <wikibugs>	 (03Merged) 10jenkins-bot: [beta] Enable mobile view for dewiki survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607762 (https://phabricator.wikimedia.org/T253112) (owner: 10Awight)
[11:27:54] <moritzm>	 !log rolling reboot of  ms-be[1044-1059].eqiad.wmnet
[11:28:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:31:02] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:34:52] <wikibugs>	 (03PS1) 10Awight: Enable WMDE Tech Wishes survey configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607763 (https://phabricator.wikimedia.org/T253112)
[11:35:14] <wikibugs>	 (03CR) 10Awight: [C: 03+2] "BACON" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607763 (https://phabricator.wikimedia.org/T253112) (owner: 10Awight)
[11:36:03] <wikibugs>	 (03Merged) 10jenkins-bot: Enable WMDE Tech Wishes survey configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607763 (https://phabricator.wikimedia.org/T253112) (owner: 10Awight)
[11:36:37] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[11:36:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:38:34] <wikibugs>	 (03PS1) 10Elukey: Set notebook100[3,4] with role::insetup [puppet] - 10https://gerrit.wikimedia.org/r/607764 (https://phabricator.wikimedia.org/T256363)
[11:38:54] <logmsgbot>	 !log awight@deploy1001 Synchronized wmf-config/InitialiseSettings.php: BACON: [[gerrit:607763|Enable WMDE Tech Wishes survey configuration (T253112)]] (duration: 01m 09s)
[11:38:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:38:59] <stashbot>	 T253112: Create survey for TechWish prototype announcements on dewiki and metawiki - https://phabricator.wikimedia.org/T253112
[11:39:39] <wikibugs>	 (03PS2) 10Elukey: Set notebook100[3,4] with role::insetup [puppet] - 10https://gerrit.wikimedia.org/r/607764 (https://phabricator.wikimedia.org/T256363)
[11:41:29] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: toolforge: mailrelay: introduce SRS to correctly envelope forwarded emails [puppet] - 10https://gerrit.wikimedia.org/r/607279 (https://phabricator.wikimedia.org/T120225)
[11:41:31] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[11:41:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:45:09] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[11:45:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:46:30] <wikibugs>	 (03PS1) 10Awight: Enable QuickSurveys on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607767 (https://phabricator.wikimedia.org/T253112)
[11:46:45] <wikibugs>	 (03CR) 10Awight: [C: 03+2] "BACON" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607767 (https://phabricator.wikimedia.org/T253112) (owner: 10Awight)
[11:47:36] <wikibugs>	 (03Merged) 10jenkins-bot: Enable QuickSurveys on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607767 (https://phabricator.wikimedia.org/T253112) (owner: 10Awight)
[11:48:30] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[11:48:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:49:42] <wikibugs>	 10Operations: Create ssh keypair for integration/docroot deployment with scap - https://phabricator.wikimedia.org/T256138 (10ema) Key generated and added to the private puppet repo under `modules/secret/secrets/keyholder`.
[11:49:59] <logmsgbot>	 !log awight@deploy1001 Synchronized wmf-config/InitialiseSettings.php: BACON: [[gerrit:607767|Enable QuickSurveys on metawiki (T253112)]] (duration: 01m 05s)
[11:50:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:50:34] <stashbot>	 T253112: Create survey for TechWish prototype announcements on dewiki and metawiki - https://phabricator.wikimedia.org/T253112
[11:50:48] <wikibugs>	 10Operations: Create ssh keypair for integration/docroot deployment with scap - https://phabricator.wikimedia.org/T256138 (10ema) a:05ema→03None
[11:51:04] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[11:51:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:54:34] <wikibugs>	 (03PS1) 10Ssingh: wikidough: organize shared fake passwords [labs/private] - 10https://gerrit.wikimedia.org/r/607769
[11:55:01] <awight>	 !log EU BACON is cooked
[11:55:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:55:03] <wikibugs>	 (03CR) 10Ssingh: [V: 03+2 C: 03+2] wikidough: organize shared fake passwords [labs/private] - 10https://gerrit.wikimedia.org/r/607769 (owner: 10Ssingh)
[11:55:32] <moritzm>	 !log installing python3.4 security updates
[11:55:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:55:58] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[11:56:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:59:02] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/23471/" [puppet] - 10https://gerrit.wikimedia.org/r/607764 (https://phabricator.wikimedia.org/T256363) (owner: 10Elukey)
[12:03:58] <icinga-wm>	 PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp5010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[12:05:59] <wikibugs>	 (03PS1) 10Elukey: Clean up old reference to notebook100[3,4] and set PXE to Buster [puppet] - 10https://gerrit.wikimedia.org/r/607771 (https://phabricator.wikimedia.org/T256363)
[12:07:17] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Clean up old reference to notebook100[3,4] and set PXE to Buster [puppet] - 10https://gerrit.wikimedia.org/r/607771 (https://phabricator.wikimedia.org/T256363) (owner: 10Elukey)
[12:08:08] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:09:38] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[12:09:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:13:01] <wikibugs>	 (03PS1) 10Ssingh: prometheus: use the correct password for the wikidough job [puppet] - 10https://gerrit.wikimedia.org/r/607772 (https://phabricator.wikimedia.org/T252132)
[12:13:18] <icinga-wm>	 RECOVERY - Check systemd state on ncredir2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:13:30] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[12:13:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:14:34] <icinga-wm>	 RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp5010 is OK: HTTP OK: HTTP/1.0 200 OK - 23528 bytes in 0.750 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[12:14:36] <legoktm>	 "wikidough" is one of the greatest names I've seen in a long time
[12:16:19] <wikibugs>	 (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1001/23472/" [puppet] - 10https://gerrit.wikimedia.org/r/607772 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh)
[12:16:28] <wikibugs>	 (03PS1) 10Awight: Enable TechWishes survey for dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607773 (https://phabricator.wikimedia.org/T253112)
[12:16:40] <sukhe>	 legoktm: haha thank you!
[12:17:03] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[12:17:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:18:52] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: use the correct password for the wikidough job [puppet] - 10https://gerrit.wikimedia.org/r/607772 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh)
[12:19:21] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] prometheus: use the correct password for the wikidough job [puppet] - 10https://gerrit.wikimedia.org/r/607772 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh)
[12:21:12] <wikibugs>	 (03PS1) 10Vgutierrez: Release 8.0.8-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/607774
[12:21:57] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[12:21:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:23:34] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:25:20] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime
[12:25:22] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime
[12:25:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:25:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:25:42] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=wikidough site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:26:16] <moritzm>	 !log installing libjpeg-turbo security updates
[12:26:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:26:50] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:27:45] <icinga-wm>	 ACKNOWLEDGEMENT - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: Ayounsi TTN-0004198221 / TTN-0004197860 - The acknowledgement expires at: 2020-06-25 18:00:00. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[12:27:45] <icinga-wm>	 ACKNOWLEDGEMENT - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0: Ayounsi TTN-0004198221 / TTN-0004197860 - The acknowledgement expires at: 2020-06-25 18:00:00. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[12:27:56] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[12:27:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:28:01] <XioNoX>	 "Our technicians ETA to the Ft. Worth site has been updated to approximately 4 hours." so ACKing it for 6h
[12:28:43] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] fix multiple invocations of systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/607609 (owner: 10CDanis)
[12:30:22] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[12:30:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:30:44] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[12:30:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:32:28] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] Add native mysql spicerack module. [software/spicerack] - 10https://gerrit.wikimedia.org/r/603434 (https://phabricator.wikimedia.org/T255409) (owner: 10Kormat)
[12:32:33] <moritzm>	 !log installing libssh2 security updates
[12:32:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:34:38] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[12:34:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:39:27] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[12:39:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:39:50] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=wikidough site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:40:53] <wikibugs>	 (03PS1) 10Elukey: Remove notebook1003 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/607779 (https://phabricator.wikimedia.org/T256363)
[12:41:17] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:41:50] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Remove notebook1003 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/607779 (https://phabricator.wikimedia.org/T256363) (owner: 10Elukey)
[12:42:18] <moritzm>	 !log installing libmspack security updates
[12:42:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:44:04] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.decommission
[12:44:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:44:21] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[12:44:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:45:37] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
[12:45:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:47:35] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[12:47:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:48:44] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: toolforge: mailrelay: introduce SRS to correctly envelope forwarded emails [puppet] - 10https://gerrit.wikimedia.org/r/607279 (https://phabricator.wikimedia.org/T120225)
[12:51:13] <wikibugs>	 (03PS1) 10Elukey: Rename notebook1003 records to an-launcher1002 records [dns] - 10https://gerrit.wikimedia.org/r/607780 (https://phabricator.wikimedia.org/T256363)
[12:51:19] <elukey>	 volans: if you have a sec --^
[12:51:22] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: mailrelay: introduce SRS to correctly envelope forwarded emails [puppet] - 10https://gerrit.wikimedia.org/r/607279 (https://phabricator.wikimedia.org/T120225) (owner: 10Arturo Borrero Gonzalez)
[12:51:27] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[12:51:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:51:42] <volans>	 sure
[12:53:00] <elukey>	 ipv6 records are missing, will add them later on
[12:54:21] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/607780 (https://phabricator.wikimedia.org/T256363) (owner: 10Elukey)
[12:54:46] <elukey>	 \o/
[12:54:51] <icinga-wm>	 PROBLEM - Host ganeti1005.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[12:54:54] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Rename notebook1003 records to an-launcher1002 records [dns] - 10https://gerrit.wikimedia.org/r/607780 (https://phabricator.wikimedia.org/T256363) (owner: 10Elukey)
[12:55:26] <elukey>	 !log rename notebook1003 to an-launcher1002 - T256363
[12:55:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:55:30] <stashbot>	 T256363: Repurpose notebook100[3,4]  - https://phabricator.wikimedia.org/T256363
[12:55:39] <icinga-wm>	 RECOVERY - Maps - OSM synchronization lag - codfw on icinga1001 is OK: (C)2.592e+05 ge (W)1.764e+05 ge 7.174e+04 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=12&fullscreen&orgId=1
[12:55:56] <volans>	 elukey: ack, it's consistent with what's defined already
[12:57:17] <wikibugs>	 (03PS5) 10Arturo Borrero Gonzalez: toolforge: mailrelay: introduce SRS to correctly envelope forwarded emails [puppet] - 10https://gerrit.wikimedia.org/r/607279 (https://phabricator.wikimedia.org/T120225)
[12:59:18] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: mailrelay: introduce SRS to correctly envelope forwarded emails [puppet] - 10https://gerrit.wikimedia.org/r/607279 (https://phabricator.wikimedia.org/T120225) (owner: 10Arturo Borrero Gonzalez)
[12:59:52] <wikibugs>	 (03PS1) 10Elukey: Add an-launcher1002 to puppet config [puppet] - 10https://gerrit.wikimedia.org/r/607781 (https://phabricator.wikimedia.org/T256363)
[13:00:04] <jouncebot>	 brennen and hashar: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Mediawiki train - American+European Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200625T1300).
[13:00:19] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add an-launcher1002 to puppet config [puppet] - 10https://gerrit.wikimedia.org/r/607781 (https://phabricator.wikimedia.org/T256363) (owner: 10Elukey)
[13:00:57] <wikibugs>	 10Operations, 10ops-eqiad: upgrade memory in ganeti100[5-8].eqiad.wmnet - https://phabricator.wikimedia.org/T244530 (10Jclark-ctr) @akosiaris   ganeti1005 is finished and booting up now  Thanks!
[13:01:23] <icinga-wm>	 RECOVERY - Host ganeti1005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.78 ms
[13:01:59] <wikibugs>	 (03PS2) 10Elukey: Add an-launcher1002 to puppet config [puppet] - 10https://gerrit.wikimedia.org/r/607781 (https://phabricator.wikimedia.org/T256363)
[13:02:38] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Add an-launcher1002 to puppet config [puppet] - 10https://gerrit.wikimedia.org/r/607781 (https://phabricator.wikimedia.org/T256363) (owner: 10Elukey)
[13:02:47] <moritzm>	 !log installing 4.9.210-1+deb9u1~deb8u1 on jessie hosts (fixed kernel for recent cacheoutattack CPU leaks)
[13:02:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:09:12] <wikibugs>	 (03PS1) 10Jbond: jpa: add workaround for HikariCP dependency clash [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/607782
[13:11:36] <wikibugs>	 (03PS1) 10Filippo Giunchedi: thanos: set consistency-delay on store [puppet] - 10https://gerrit.wikimedia.org/r/607783 (https://phabricator.wikimedia.org/T233956)
[13:13:46] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[13:13:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:15:30] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install <hadoop testing nodes> - https://phabricator.wikimedia.org/T255520 (10Jclark-ctr) @elukey any rows that these need to be in?
[13:17:52] <wikibugs>	 (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1001/23474/" [puppet] - 10https://gerrit.wikimedia.org/r/607783 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi)
[13:19:50] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[13:19:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:24:09] <wikibugs>	 (03PS1) 10Kormat: mysql: Spruce up documentation formatting [software/spicerack] - 10https://gerrit.wikimedia.org/r/607786
[13:25:47] <icinga-wm>	 PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash instance=kafkamon1001 job=burrow partition={0,1,2,3,4,5} site=eqiad topic={udp_localhost-info,udp_localhost-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prom
[13:25:47] <icinga-wm>	 uster=logging-eqiad&var-topic=All&var-consumer_group=All
[13:26:38] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[13:26:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:28:03] <godog>	 mmhh logstash1007's unhappy, I'll bounce logstash there
[13:28:31] <godog>	 !log bounce logstash on logstash1007
[13:28:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:12] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/607782 (owner: 10Jbond)
[13:30:31] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[13:30:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:53] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] jpa: add workaround for HikariCP dependency clash [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/607782 (owner: 10Jbond)
[13:33:09] <icinga-wm>	 RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All
[13:34:58] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] Release 8.0.8-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/607774 (owner: 10Vgutierrez)
[13:36:00] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[13:36:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:36:47] <wikibugs>	 10Operations, 10DBA, 10DC-Ops, 10Sustainability (Incident Prevention): PXE Boot defaults to automatically reimaging (normally destroying os and all filesystemdata) on all servers - https://phabricator.wikimedia.org/T251416 (10Marostegui) Can this task be closed? By default hosts reimage now but they do kee...
[13:40:54] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[13:40:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:47:01] <wikibugs>	 (03PS3) 10Reedy: Replace PasswordNotInLargeBlacklist with PasswordNotInCommonList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605904
[13:47:05] <Reedy>	 jouncebot: now
[13:47:06] <jouncebot>	 For the next 1 hour(s) and 12 minute(s): Mediawiki train - American+European Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200625T1300)
[13:47:26] <wikibugs>	 (03CR) 10Reedy: [C: 03+2] Replace PasswordNotInLargeBlacklist with PasswordNotInCommonList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605904 (owner: 10Reedy)
[13:48:29] <wikibugs>	 (03Merged) 10jenkins-bot: Replace PasswordNotInLargeBlacklist with PasswordNotInCommonList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605904 (owner: 10Reedy)
[13:49:54] <logmsgbot>	 !log reedy@deploy1001 Synchronized wmf-config/CommonSettings.php: Replace PasswordNotInLargeBlacklist with PasswordNotInCommonList (duration: 01m 06s)
[13:49:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:51:23] <logmsgbot>	 !log reedy@deploy1001 Synchronized wmf-config/CommonSettings-labs.php: Replace PasswordNotInLargeBlacklist with PasswordNotInCommonList (duration: 01m 05s)
[13:51:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:51:32] <vgutierrez>	 !log upload trafficserver 8.0.8 to apt.wm.o (buster)
[13:51:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:52:31] <James_F>	 Reedy: ❤️
[13:52:52] <Reedy>	 Just noticed it was still sitting there, so might aswell ship it
[13:54:01] <wikibugs>	 (03PS3) 10Reedy: Remove OAuthReplaceMessage hook subscriber [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601910 (https://phabricator.wikimedia.org/T254301)
[13:54:09] <wikibugs>	 (03CR) 10Reedy: [C: 03+2] Remove OAuthReplaceMessage hook subscriber [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601910 (https://phabricator.wikimedia.org/T254301) (owner: 10Reedy)
[13:55:01] <wikibugs>	 (03Merged) 10jenkins-bot: Remove OAuthReplaceMessage hook subscriber [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601910 (https://phabricator.wikimedia.org/T254301) (owner: 10Reedy)
[13:56:19] <vgutierrez>	 !log upgrade ATS in ulsfo to version 8.0.8
[13:56:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:56:26] <logmsgbot>	 !log reedy@deploy1001 Synchronized wmf-config/CommonSettings.php: T254301 Remove OAuthReplaceMessage hook subscriber (duration: 01m 05s)
[13:57:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:57:14] <stashbot>	 T254301: Replace OAuthReplaceMessage subscriber in CommonSettings.php - https://phabricator.wikimedia.org/T254301
[13:58:56] <wikibugs>	 (03PS1) 10Jbond: apereo_cas: Enable SSL for DB connections [puppet] - 10https://gerrit.wikimedia.org/r/607793 (https://phabricator.wikimedia.org/T256113)
[13:59:26] <wikibugs>	 (03PS1) 10Muehlenhoff: Handle CAS war updates [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/607794 (https://phabricator.wikimedia.org/T233950)
[13:59:38] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[13:59:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:59:59] <wikibugs>	 (03PS3) 10Reedy: Use structured logging fields for xff logs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607389
[14:01:14] <wikibugs>	 10Operations, 10ops-codfw, 10netops: codfw: rack/setup new srx300 (mr1) - https://phabricator.wikimedia.org/T255577 (10Papaul)
[14:01:51] <wikibugs>	 (03PS2) 10Krinkle: logging: Fold Parsoid into type:mediawiki, add 'servergroup:' instead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606038 (https://phabricator.wikimedia.org/T255627)
[14:02:07] <wikibugs>	 (03PS2) 10Krinkle: mediawiki,logstash: Update type:parsoid-php -> type:mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/606049 (https://phabricator.wikimedia.org/T255627)
[14:02:47] <wikibugs>	 10Operations, 10ops-codfw, 10netops: codfw: rack/setup new srx300 (mr1) - https://phabricator.wikimedia.org/T255577 (10Papaul)
[14:03:26] <wikibugs>	 (03CR) 10Krinkle: Use structured logging fields for xff logs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607389 (owner: 10Reedy)
[14:03:29] <wikibugs>	 10Operations, 10DBA, 10DC-Ops, 10Sustainability (Incident Prevention): PXE Boot defaults to automatically reimaging (normally destroying os and all filesystemdata) on all servers - https://phabricator.wikimedia.org/T251416 (10jcrespo) a:03Kormat
[14:04:01] <wikibugs>	 (03CR) 10Reedy: Use structured logging fields for xff logs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607389 (owner: 10Reedy)
[14:04:22] <logmsgbot>	 !log marostegui@cumin2001 dbctl commit (dc=all): 'Depool db2088:3312', diff saved to https://phabricator.wikimedia.org/P11663 and previous config saved to /var/cache/conftool/dbconfig/20200625-140421-marostegui.json
[14:04:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:04:33] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[14:04:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:05:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2104', diff saved to https://phabricator.wikimedia.org/P11664 and previous config saved to /var/cache/conftool/dbconfig/20200625-140519-marostegui.json
[14:05:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:05:28] <marostegui>	 !log Stop MySQL on db2104 and db2088:3312
[14:05:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:09:46] <wikibugs>	 (03PS4) 10Reedy: Use structured logging fields for xff logs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607389
[14:12:08] <wikibugs>	 (03Abandoned) 10Filippo Giunchedi: DNM: adjust logstash index template for ES 7 [puppet] - 10https://gerrit.wikimedia.org/r/545566 (https://phabricator.wikimedia.org/T235891) (owner: 10Filippo Giunchedi)
[14:12:53] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[14:13:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:14:21] <wikibugs>	 (03Abandoned) 10Filippo Giunchedi: swift: add role::swift::swiftrepl to ms-fe1001 [puppet] - 10https://gerrit.wikimedia.org/r/254412 (owner: 10Filippo Giunchedi)
[14:14:49] <wikibugs>	 (03Abandoned) 10Filippo Giunchedi: swift: add swift replication support via swiftrepl [puppet] - 10https://gerrit.wikimedia.org/r/254411 (owner: 10Filippo Giunchedi)
[14:16:33] <wikibugs>	 (03PS1) 10Faidon Liambotis: Allow SELECTED_PATH selection for IXP routes as well [homer/public] - 10https://gerrit.wikimedia.org/r/607800
[14:17:40] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[14:17:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:17:52] <wikibugs>	 10Operations, 10ORES, 10Scoring-platform-team: [Epic] Deploy ORES in kubernetes cluster - https://phabricator.wikimedia.org/T182331 (10akosiaris) Couple of more benefits of k8s I forgot to mention yesterday  * Ability for >1 deployments. This might be beneficial from a product perspective, e.g. create an ORE...
[14:18:26] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Allow SELECTED_PATH selection for IXP routes as well [homer/public] - 10https://gerrit.wikimedia.org/r/607800 (owner: 10Faidon Liambotis)
[14:19:17] <wikibugs>	 10Operations, 10DBA, 10DC-Ops, 10Sustainability (Incident Prevention): PXE Boot defaults to automatically reimaging (normally destroying os and all filesystemdata) on all servers - https://phabricator.wikimedia.org/T251416 (10Kormat) From the perspective of #dba, this issue is mostly resolved. Most DB mach...
[14:19:24] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[14:19:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:19:35] <wikibugs>	 10Operations, 10DC-Ops, 10SRE-tools, 10Sustainability (Incident Prevention): PXE Boot defaults to automatically reimaging (normally destroying os and all filesystemdata) on all servers - https://phabricator.wikimedia.org/T251416 (10Kormat)
[14:19:40] <vgutierrez>	 !log upgrade ATS in eqsin to version 8.0.8
[14:19:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:19:47] <wikibugs>	 10Operations, 10DC-Ops, 10SRE-tools, 10Sustainability (Incident Prevention): PXE Boot defaults to automatically reimaging (normally destroying os and all filesystemdata) on all servers - https://phabricator.wikimedia.org/T251416 (10Kormat) a:05Kormat→03None
[14:19:55] <wikibugs>	 10Operations, 10ops-eqiad: upgrade memory in ganeti100[5-8].eqiad.wmnet - https://phabricator.wikimedia.org/T244530 (10akosiaris)
[14:20:23] <icinga-wm>	 PROBLEM - Host scs-a1-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[14:20:51] <wikibugs>	 10Operations, 10ops-eqiad: upgrade memory in ganeti100[5-8].eqiad.wmnet - https://phabricator.wikimedia.org/T244530 (10akosiaris) @Jclark-ctr Excellent. I started the process of emptying ganeti1006 (and filling ganeti1005), that should take quite a while, but we should be on time for next Thursday. Many thanks!
[14:21:06] <wikibugs>	 (03CR) 10Jbond: Handle CAS war updates (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/607794 (https://phabricator.wikimedia.org/T233950) (owner: 10Muehlenhoff)
[14:24:19] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[14:24:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:26:02] <wikibugs>	 (03PS2) 10Muehlenhoff: Handle CAS war updates [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/607794 (https://phabricator.wikimedia.org/T233950)
[14:26:06] <wikibugs>	 (03CR) 10Muehlenhoff: Handle CAS war updates (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/607794 (https://phabricator.wikimedia.org/T233950) (owner: 10Muehlenhoff)
[14:26:13] <icinga-wm>	 RECOVERY - Host scs-a1-codfw is UP: PING WARNING - Packet loss = 50%, RTA = 36.74 ms
[14:29:41] <papaul>	 !log replacing mr1-codfw
[14:29:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:31:30] <XioNoX>	 papaul: let me know if you need any help or when you're done
[14:31:43] <papaul>	 XioNoX: sure thanks will let you know
[14:32:08] <XioNoX>	 and parent/child is working fine, all mgmt show up as UNREACH in icinga, and don't alert here
[14:32:19] <papaul>	 XioNoX: xool
[14:32:21] <papaul>	 cool
[14:32:29] <icinga-wm>	 PROBLEM - Host asw-b-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[14:32:30] <wikibugs>	 (03CR) 10Alexandros Kosiaris: Add kubernetes[12]01[56] (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/607754 (https://phabricator.wikimedia.org/T256236) (owner: 10Alexandros Kosiaris)
[14:32:45] <icinga-wm>	 PROBLEM - Host asw-a-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[14:32:59] <XioNoX>	 eh, maybe not all, but it's fine it's only a few
[14:33:02] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Add kubernetes[12]01[56] [homer/public] - 10https://gerrit.wikimedia.org/r/607754 (https://phabricator.wikimedia.org/T256236)
[14:33:05] <icinga-wm>	 PROBLEM - Host asw-c-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[14:33:33] <icinga-wm>	 PROBLEM - Host fasw-c-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[14:33:39] <wikibugs>	 (03CR) 10Ema: [C: 03+1] Disable HTCP purging everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607593 (https://phabricator.wikimedia.org/T250781) (owner: 10Ppchelko)
[14:34:13] <icinga-wm>	 PROBLEM - Host asw-d-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[14:34:45] <icinga-wm>	 PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:34:55] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=pdu site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:36:07] <icinga-wm>	 PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:38:25] <wikibugs>	 (03PS1) 10Reedy: Fix name of PasswordNotInCommonList in CommonSettings-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607805 (https://phabricator.wikimedia.org/T256374)
[14:38:51] <wikibugs>	 (03CR) 10Reedy: [C: 03+2] Fix name of PasswordNotInCommonList in CommonSettings-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607805 (https://phabricator.wikimedia.org/T256374) (owner: 10Reedy)
[14:39:41] <wikibugs>	 (03Merged) 10jenkins-bot: Fix name of PasswordNotInCommonList in CommonSettings-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607805 (https://phabricator.wikimedia.org/T256374) (owner: 10Reedy)
[14:41:40] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review: replace backends for releases.wikimedia.org with buster VMs - https://phabricator.wikimedia.org/T247652 (10Dzahn) >>! In T247652#6255779, @jcrespo wrote: > Thanks, it ran successfully now: >  > ` > 239105  Full...
[14:43:32] <vgutierrez>	 !log upgrade ATS in esams to version 8.0.8
[14:43:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:48:57] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime
[14:48:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:49:21] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install <hadoop testing nodes> - https://phabricator.wikimedia.org/T255520 (10Jclark-ctr) @Cmjohnson    will be bulk uploading to netbox after leaving data center HOST , SWITCHPORT , RACK , UNIT, ASSET TAG an-test-worker1001 25 A3 25 WMF4833 an-t...
[14:50:12] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install 3 lightweight hadoop nodes - https://phabricator.wikimedia.org/T255518 (10Jclark-ctr) @Cmjohnson will be bulk uploading to netbox after leaving data center an-test-master1001 30 A5 30 WMF4836 an-test-master1002 36 C5 34 WMF4837 an-test-co...
[14:50:29] <icinga-wm>	 RECOVERY - Host asw-a-codfw is UP: PING OK - Packet loss = 0%, RTA = 41.83 ms
[14:50:30] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install <hadoop testing nodes> - https://phabricator.wikimedia.org/T255520 (10elukey) No preference, if possible one host per row, otherwise any arrangement that fit bests for you!
[14:50:34] <icinga-wm>	 RECOVERY - Host asw-b-codfw is UP: PING OK - Packet loss = 0%, RTA = 36.80 ms
[14:50:57] <icinga-wm>	 RECOVERY - Host asw-c-codfw is UP: PING OK - Packet loss = 0%, RTA = 37.86 ms
[14:51:11] <icinga-wm>	 RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:51:31] <XioNoX>	 yaaa
[14:51:31] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[14:51:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:51:51] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, thanks!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/607786 (owner: 10Kormat)
[14:51:53] <icinga-wm>	 RECOVERY - Host fasw-c-codfw is UP: PING OK - Packet loss = 0%, RTA = 36.68 ms
[14:52:00] <icinga-wm>	 RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:52:02] <icinga-wm>	 RECOVERY - Host asw-d-codfw is UP: PING OK - Packet loss = 0%, RTA = 36.76 ms
[14:52:20] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] mysql: Spruce up documentation formatting [software/spicerack] - 10https://gerrit.wikimedia.org/r/607786 (owner: 10Kormat)
[14:52:31] <XioNoX>	 papaul: SRX220H2? the old one again?
[14:52:38] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:54:15] <wikibugs>	 10Operations, 10DBA, 10SRE-tools, 10Patch-For-Review: Audit all cumin queries in switchdc scripts - https://phabricator.wikimedia.org/T243935 (10Kormat)
[14:54:35] <wikibugs>	 (03Merged) 10jenkins-bot: mysql: Spruce up documentation formatting [software/spicerack] - 10https://gerrit.wikimedia.org/r/607786 (owner: 10Kormat)
[14:55:13] <wikibugs>	 (03PS1) 10Elukey: Add ipv6 AAAA/PTR records for an-launcher1002 [dns] - 10https://gerrit.wikimedia.org/r/607808 (https://phabricator.wikimedia.org/T256363)
[14:55:14] <papaul>	 XioNoX: yes the new one got stucked at Octeon srx_300_ram# so trying ti fix that
[14:55:37] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add ipv6 AAAA/PTR records for an-launcher1002 [dns] - 10https://gerrit.wikimedia.org/r/607808 (https://phabricator.wikimedia.org/T256363) (owner: 10Elukey)
[14:56:18] <papaul>	 XioNoX: turn it off unplug it to move it and plug it back and after boot i get stuck at that 
[14:56:54] <XioNoX>	 papaul: tried a reboot I guess?
[14:57:02] <papaul>	 XioNoX: yes doing that
[14:57:02] <XioNoX>	 reading https://forums.juniper.net/t5/SRX-Services-Gateway/After-abrupt-power-loss-SRX300-stack-in-Octeon-srx-300-ram/td-p/306366
[14:57:55] <wikibugs>	 (03PS2) 10Elukey: Add ipv6 AAAA/PTR records for an-launcher1002 [dns] - 10https://gerrit.wikimedia.org/r/607808 (https://phabricator.wikimedia.org/T256363)
[14:58:24] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] add logstash1030 and logstash1031 [dns] - 10https://gerrit.wikimedia.org/r/607634 (https://phabricator.wikimedia.org/T256139) (owner: 10Dzahn)
[14:58:49] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Add ipv6 AAAA/PTR records for an-launcher1002 [dns] - 10https://gerrit.wikimedia.org/r/607808 (https://phabricator.wikimedia.org/T256363) (owner: 10Elukey)
[14:59:12] <elukey>	 mutante: ah snap, do we want to coordinate?
[14:59:28] <XioNoX>	 a clean power-off/power on might solve it
[14:59:39] <mutante>	 elukey: i already started the authdns-update but by patch is not actually merged.. you can try again now for yours
[15:00:11] <mutante>	 elukey: you should be free to merge now
[15:00:31] <elukey>	 ack !
[15:02:51] <mutante>	 elukey: oh.. we even took the same IP.. i see :p
[15:03:08] <elukey>	 whattt
[15:03:14] <papaul>	 XioNoX: ok it is bsack up 
[15:03:34] <mutante>	 elukey: we both saw the same "21" IP being free to take.. you can have it :)
[15:03:55] <mutante>	 i noticed because it needed manual rebase 
[15:04:20] <XioNoX>	 papaul: nice
[15:04:30] <elukey>	 mutante: I am confused, I just added AAAA/PTR records 
[15:05:48] <mutante>	 elukey: oh.. then it was somebody else who took it meanwhile. don't worry about it. i just need to fix it
[15:06:10] <elukey>	 mutante: ahhh fiiuuu, I thought something horrible happened :D good I can breathe again
[15:06:35] <papaul>	 XioNoX: it did boot up from backup so i am in the process of reinstalling Ju
[15:06:40] <mutante>	 elukey: no no.. it's ok :)
[15:07:17] <papaul>	 XioNoX: it did bootup in backupup mode so in the process of reinstalling Junos on it it will take a minute
[15:08:17] <XioNoX>	 papaul: if it's on the backup, a clean reboot might bring it back to primary. Ideally try to backup the config too. But a junos upgrade shouldn't impact it
[15:08:38] <papaul>	 XioNoX: ok doing a clean reboot 
[15:08:49] <wikibugs>	 (03CR) 10Jbond: "still not sure this is right?" (032 comments) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/607794 (https://phabricator.wikimedia.org/T233950) (owner: 10Muehlenhoff)
[15:08:59] <wikibugs>	 (03PS3) 10Dzahn: add logstash1030 and logstash1031 [dns] - 10https://gerrit.wikimedia.org/r/607634 (https://phabricator.wikimedia.org/T256139)
[15:12:36] <wikibugs>	 (03PS4) 10Dzahn: add logstash1030 and logstash1031 [dns] - 10https://gerrit.wikimedia.org/r/607634 (https://phabricator.wikimedia.org/T256139)
[15:13:09] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] add logstash1030 and logstash1031 [dns] - 10https://gerrit.wikimedia.org/r/607634 (https://phabricator.wikimedia.org/T256139) (owner: 10Dzahn)
[15:13:16] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:15:06] <wikibugs>	 (03PS5) 10Dzahn: add logstash1030 and logstash1031 [dns] - 10https://gerrit.wikimedia.org/r/607634 (https://phabricator.wikimedia.org/T256139)
[15:15:31] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] add logstash1030 and logstash1031 [dns] - 10https://gerrit.wikimedia.org/r/607634 (https://phabricator.wikimedia.org/T256139) (owner: 10Dzahn)
[15:15:44] <papaul>	 XioNoX: clean boot now 
[15:15:59] <XioNoX>	 papaul: nice!
[15:15:59] <papaul>	 eveything back normal
[15:16:09] <mutante>	 incoming icinga alert flood for the mgmt's
[15:16:12] <mutante>	 but known then
[15:16:50] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] apereo_cas: Enable SSL for DB connections [puppet] - 10https://gerrit.wikimedia.org/r/607793 (https://phabricator.wikimedia.org/T256113) (owner: 10Jbond)
[15:16:57] <XioNoX>	 papaul: is it racked at it's final location or you have to power it down again?
[15:17:58] <papaul>	 XioNoX: it is at final location no more powering down again
[15:18:40] <wikibugs>	 (03PS6) 10Dzahn: add logstash1030 and logstash1031 [dns] - 10https://gerrit.wikimedia.org/r/607634 (https://phabricator.wikimedia.org/T256139)
[15:18:57] <XioNoX>	 cool
[15:19:08] <logmsgbot>	 !log ppchelko@deploy1001 Started deploy [restbase/deploy@821e96b]: Only emit vary: accept-language for feeds when it matters T256358
[15:19:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:19:13] <stashbot>	 T256358: wikifeeds usage increased by 3x on 2020-06-24 - https://phabricator.wikimedia.org/T256358
[15:20:07] <XioNoX>	 papaul: I'm connected to it via oob, let me know if it's fully cabled
[15:20:22] <papaul>	 XioNoX: give me a minute
[15:20:30] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:20:45] <logmsgbot>	 !log ppchelko@deploy1001 Finished deploy [restbase/deploy@821e96b]: Only emit vary: accept-language for feeds when it matters T256358 (duration: 01m 37s)
[15:20:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:21:44] <wikibugs>	 (03PS1) 10Jbond: apereo_cas: set db dialect to MariaDBDialect [puppet] - 10https://gerrit.wikimedia.org/r/607813
[15:22:25] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] add logstash1030 and logstash1031 [dns] - 10https://gerrit.wikimedia.org/r/607634 (https://phabricator.wikimedia.org/T256139) (owner: 10Dzahn)
[15:22:27] <wikibugs>	 (03PS5) 10Reedy: Use structured logging fields for xff logs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607389
[15:22:32] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] apereo_cas: set db dialect to MariaDBDialect [puppet] - 10https://gerrit.wikimedia.org/r/607813 (owner: 10Jbond)
[15:22:52] <wikibugs>	 (03PS6) 10Reedy: Use structured logging fields for xff logs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607389
[15:23:00] <Reedy>	 jouncebot: now
[15:23:00] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 36 minute(s)
[15:23:02] <icinga-wm>	 PROBLEM - Host ms-be2051.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:23:06] <logmsgbot>	 !log ppchelko@deploy1001 Started deploy [restbase/deploy@821e96b]: Only emit vary: accept-language for feeds when it matters T256358, take 2
[15:23:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:23:12] <wikibugs>	 (03CR) 10Reedy: [C: 03+2] Use structured logging fields for xff logs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607389 (owner: 10Reedy)
[15:23:21] <papaul>	 XioNoX: on all is back up new mr in place and all interfaces are up 
[15:23:31] <XioNoX>	 yep, and I'm able to reach it
[15:23:37] <XioNoX>	 checking everything
[15:23:38] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:23:38] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:23:42] <icinga-wm>	 PROBLEM - Host ms-be2053.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:23:42] <icinga-wm>	 PROBLEM - Host ms-be2055.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:24:01] <wikibugs>	 (03Merged) 10jenkins-bot: Use structured logging fields for xff logs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607389 (owner: 10Reedy)
[15:25:05] <wikibugs>	 (03CR) 10Dzahn: "> Yes eventually all should have v6, I think we're ok to add v6 for new instances and retroactively add v6 to existing hosts later" [dns] - 10https://gerrit.wikimedia.org/r/607634 (https://phabricator.wikimedia.org/T256139) (owner: 10Dzahn)
[15:25:40] <logmsgbot>	 !log reedy@deploy1001 Synchronized wmf-config/CommonSettings.php: structured logging for xff log, stop logging jobrunner requests (duration: 01m 05s)
[15:25:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:25:56] <akosiaris>	 Pchelolo: there was an interesting increase of 504s while you were deploying https://grafana.wikimedia.org/d/35vIuGpZk/wikifeeds?panelId=15&fullscreen&orgId=1&from=now-30m&to=now
[15:26:45] <Pchelolo>	 hm... 
[15:26:49] <akosiaris>	 but indeed I see request rates to the various endpoints dropping now
[15:27:02] <akosiaris>	 and errors are no longer around, so \o/
[15:27:20] <mutante>	 i was about to ACK that with the "wikifeeds 3x" ticket.. then it was already recovered 
[15:28:04] <Pchelolo>	 akosiaris: there was an interesting side effect to this that we probably need to mitigate
[15:28:14] <Pchelolo>	 we set cache-control to feeds for 5 mins
[15:28:29] <Pchelolo>	 and it seems like ALL caches are expiring simultaniously
[15:28:33] <Pchelolo>	 creating a spike
[15:28:49] <Pchelolo>	 now we had a lot of vary: headers, so a spike was huge
[15:28:57] <Pchelolo>	 creating 429 on metrics endpoints
[15:29:05] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm
[15:29:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:29:15] <Pchelolo>	 but even without it, the spike's probably there, just less visible
[15:29:17] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10netops, 10cloud-services-team (Hardware): (Need By: 2020-06-12) rack/setup/install WMCS 10G switches - https://phabricator.wikimedia.org/T251632 (10Jclark-ctr) @Cmjohnson host are cabled need to be configured.
[15:29:33] <Pchelolo>	 I think I need to add some randomization to cache-control: max-age
[15:29:44] <logmsgbot>	 !log ppchelko@deploy1001 Finished deploy [restbase/deploy@821e96b]: Only emit vary: accept-language for feeds when it matters T256358, take 2 (duration: 06m 38s)
[15:29:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:29:48] <stashbot>	 T256358: wikifeeds usage increased by 3x on 2020-06-24 - https://phabricator.wikimedia.org/T256358
[15:30:17] <logmsgbot>	 !log ppchelko@deploy1001 Started deploy [restbase/deploy@821e96b]: Only emit vary: accept-language for feeds when it matters T256358, more groups
[15:30:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:30:30] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm
[15:30:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:30:51] <vgutierrez>	 !log upgrade ATS in codfw to version 8.0.8
[15:30:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:31:05] <wikibugs>	 (03PS1) 10Elukey: Move all analytics timers but RU ones from an-launcher1001 to 1002 [puppet] - 10https://gerrit.wikimedia.org/r/607819 (https://phabricator.wikimedia.org/T256363)
[15:31:07] <Pchelolo>	 it's very hard to deploy restbase during wikifeeds issue, the restbase checks include check to wikifeeds that keeps failing
[15:31:12] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:32:05] <akosiaris>	 ouch.
[15:32:12] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Move all analytics timers but RU ones from an-launcher1001 to 1002 [puppet] - 10https://gerrit.wikimedia.org/r/607819 (https://phabricator.wikimedia.org/T256363) (owner: 10Elukey)
[15:32:20] <XioNoX>	 papaul: it all looks good to me, I ran homer which normalized the root password, ssh keys, etc...
[15:32:37] <papaul>	 XioNoX: cool thanks
[15:32:47] <XioNoX>	 papaul: thank you! great work!
[15:32:49] <papaul>	 XioNoX: will start the clen up 
[15:32:58] <papaul>	 XioNoX: np
[15:33:08] <akosiaris>	 Pchelolo: it does seem like we are back to normal traffic levels btw. I 'll let it be for today and close the task tomorrow EU morning if everything checks out
[15:33:35] <Pchelolo>	 ok akosiaris. I'll make a little followup with randomization of cache-control for feeds
[15:33:41] <logmsgbot>	 !log ppchelko@deploy1001 Finished deploy [restbase/deploy@821e96b]: Only emit vary: accept-language for feeds when it matters T256358, more groups (duration: 03m 24s)
[15:33:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:33:46] <logmsgbot>	 !log ppchelko@deploy1001 Started deploy [restbase/deploy@821e96b]: Only emit vary: accept-language for feeds when it matters T256358, more groups
[15:33:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:34:08] <wikibugs>	 (03PS1) 10Dzahn: site: add logstash1030/31 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/607821 (https://phabricator.wikimedia.org/T256139)
[15:35:16] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] site: add logstash1030/31 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/607821 (https://phabricator.wikimedia.org/T256139) (owner: 10Dzahn)
[15:35:23] <wikibugs>	 (03PS2) 10Dzahn: site: add logstash1030/31 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/607821 (https://phabricator.wikimedia.org/T256139)
[15:35:53] <wikibugs>	 (03PS1) 10Ayounsi: New mr1-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/607823
[15:37:13] <wikibugs>	 10Operations, 10observability: Icinga refresh hardware selection (2020) - https://phabricator.wikimedia.org/T251644 (10Jclark-ctr)
[15:37:24] <logmsgbot>	 !log ppchelko@deploy1001 Finished deploy [restbase/deploy@821e96b]: Only emit vary: accept-language for feeds when it matters T256358, more groups (duration: 03m 38s)
[15:37:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:37:28] <stashbot>	 T256358: wikifeeds usage increased by 3x on 2020-06-24 - https://phabricator.wikimedia.org/T256358
[15:37:47] <logmsgbot>	 !log ppchelko@deploy1001 Started deploy [restbase/deploy@821e96b]: Only emit vary: accept-language for feeds when it matters T256358, more groups
[15:37:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:38:21] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] New mr1-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/607823 (owner: 10Ayounsi)
[15:40:38] <wikibugs>	 (03PS2) 10Elukey: Move all analytics timers but RU ones from an-launcher1001 to 1002 [puppet] - 10https://gerrit.wikimedia.org/r/607819 (https://phabricator.wikimedia.org/T256363)
[15:42:56] <logmsgbot>	 !log ppchelko@deploy1001 Finished deploy [restbase/deploy@821e96b]: Only emit vary: accept-language for feeds when it matters T256358, more groups (duration: 05m 09s)
[15:42:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:43:00] <stashbot>	 T256358: wikifeeds usage increased by 3x on 2020-06-24 - https://phabricator.wikimedia.org/T256358
[15:45:34] <wikibugs>	 (03PS4) 10Dzahn: add logstash2030 and logstash2031 [dns] - 10https://gerrit.wikimedia.org/r/607637 (https://phabricator.wikimedia.org/T256139)
[15:45:36] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] Add recommendation-api helmfile stanzas (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/602527 (https://phabricator.wikimedia.org/T241230) (owner: 10Bmansurov)
[15:46:21] <wikibugs>	 (03CR) 10Dzahn: "> I'm ok either with going ahead with this now and followup with a v6 patch later, or add v6 to this." [dns] - 10https://gerrit.wikimedia.org/r/607637 (https://phabricator.wikimedia.org/T256139) (owner: 10Dzahn)
[15:46:34] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:46:34] <icinga-wm>	 RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[15:47:53] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] logging: Fold Parsoid into type:mediawiki, add 'servergroup:' instead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606038 (https://phabricator.wikimedia.org/T255627) (owner: 10Krinkle)
[15:48:42] <wikibugs>	 (03Merged) 10jenkins-bot: logging: Fold Parsoid into type:mediawiki, add 'servergroup:' instead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606038 (https://phabricator.wikimedia.org/T255627) (owner: 10Krinkle)
[15:49:00] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] add logstash2030 and logstash2031 [dns] - 10https://gerrit.wikimedia.org/r/607637 (https://phabricator.wikimedia.org/T256139) (owner: 10Dzahn)
[15:49:06] * Krinkle testing on mwdebug1002
[15:49:22] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review, 10User-fgiunchedi: VM requests for additional Logstash capacity - https://phabricator.wikimedia.org/T256139 (10Dzahn) a:03Dzahn
[15:51:08] <vgutierrez>	 !log upgrade ATS in eqiad to version 8.0.8
[15:51:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:53:04] <wikibugs>	 10Operations, 10ops-codfw, 10netops: codfw: rack/setup new srx300 (mr1) - https://phabricator.wikimedia.org/T255577 (10Papaul) Offline script in Netbox will do this  {F31905493}
[15:53:28] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] "Confimed this yields servergroup:api_appserver on mw1276 and servergroup:appserver on mw1274 and mwdebug1002 as example." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606038 (https://phabricator.wikimedia.org/T255627) (owner: 10Krinkle)
[15:53:56] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm
[15:53:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:54:53] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm
[15:54:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:55:15] <logmsgbot>	 !log krinkle@deploy1001 Synchronized wmf-config/logging.php: I4c519f88c613fc (duration: 01m 05s)
[15:55:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:57:21] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review, 10User-fgiunchedi: VM requests for additional Logstash capacity - https://phabricator.wikimedia.org/T256139 (10Dzahn) Creating VM logstash1030.eqiad.wmnet in cluster ganeti01.svc.eqiad.wmnet with row=D vcpus=4 memory=8GB disk=50GB l...
[15:59:22] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[15:59:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:00:04] <jouncebot>	 godog and _joe_: (Dis)respected human, time to deploy Puppet request window(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200625T1600). Please do the needful.
[16:00:53] <mutante>	 nothing in the puppet window
[16:01:31] <mutante>	 stuff got merged anyways without the need for a time slot
[16:02:06] <icinga-wm>	 PROBLEM - Router interfaces on mr1-codfw is CRITICAL: CRITICAL: host 208.80.153.196, interfaces up: 39, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:02:33] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "Duplicate declaration: Group[jenkins] is already declared at (file: /srv/jenkins-workspace/puppet-compiler/23476/change/src/modules/jenkin" [puppet] - 10https://gerrit.wikimedia.org/r/607645 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn)
[16:03:16] <wikibugs>	 (03PS3) 10Elukey: Move all analytics timers but RU ones from an-launcher1001 to 1002 [puppet] - 10https://gerrit.wikimedia.org/r/607819 (https://phabricator.wikimedia.org/T256363)
[16:03:24] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[16:03:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:03:27] <Krinkle>	 mutante: could use help with https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/606049/ and/or a pointer for who can help instead :)
[16:04:43] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[16:04:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:04:50] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Move all analytics timers but RU ones from an-launcher1001 to 1002 [puppet] - 10https://gerrit.wikimedia.org/r/607819 (https://phabricator.wikimedia.org/T256363) (owner: 10Elukey)
[16:05:46] <icinga-wm>	 RECOVERY - Router interfaces on mr1-codfw is OK: OK: host 208.80.153.196, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:06:10] <moritzm>	 !log installing 4.9.210-1+deb9u1~deb8u1 on jessie hosts (fixed kernel for recent cacheoutattack CPU leaks)
[16:06:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:07:33] <mutante>	 Krinkle: please add me on Gerrit and i will get to it
[16:07:46] <Krinkle>	 ok :)
[16:09:37] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[16:09:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:10:32] <wikibugs>	 (03PS1) 10Andrew Bogott: Openstack Nova: enable soft affinity (and soft anti-affinity) server groups [puppet] - 10https://gerrit.wikimedia.org/r/607825 (https://phabricator.wikimedia.org/T253267)
[16:11:07] <wikibugs>	 (03PS8) 10Dzahn: jenkins: replace system user/group with systemd-sysuser [puppet] - 10https://gerrit.wikimedia.org/r/606286 (https://phabricator.wikimedia.org/T224591)
[16:11:25] <wikibugs>	 (03Abandoned) 10Dzahn: admins: add system user for jenkins, reserve UID 903 [puppet] - 10https://gerrit.wikimedia.org/r/607645 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn)
[16:12:05] <wikibugs>	 (03CR) 10Dzahn: "this is now a single change at https://gerrit.wikimedia.org/r/c/operations/puppet/+/606286 to avoid the duplicate declaration issue" [puppet] - 10https://gerrit.wikimedia.org/r/607645 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn)
[16:12:27] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[16:12:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:13:02] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Openstack Nova: enable soft affinity (and soft anti-affinity) server groups [puppet] - 10https://gerrit.wikimedia.org/r/607825 (https://phabricator.wikimedia.org/T253267) (owner: 10Andrew Bogott)
[16:15:01] <moritzm>	 !log installing libxml2 security updates
[16:15:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:15:19] <Krinkle>	 !log I've deleted a "saved object" visualisation in logstash called "Production Errors & Deployments" which seemed to be corrupt and redirect random logstash dashboards to a management page. Backed up at https://phabricator.wikimedia.org/P11666 (NDA) 
[16:15:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:16:27] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[16:16:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:16:50] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
[16:16:53] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[16:16:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:16:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:18:06] <wikibugs>	 (03PS1) 10Andrew Bogott: Openstack Nova: enable soft affinity (and soft anti-affinity) server groups [puppet] - 10https://gerrit.wikimedia.org/r/607827 (https://phabricator.wikimedia.org/T253267)
[16:18:37] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Openstack Nova: enable soft affinity (and soft anti-affinity) server groups [puppet] - 10https://gerrit.wikimedia.org/r/607827 (https://phabricator.wikimedia.org/T253267) (owner: 10Andrew Bogott)
[16:20:28] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1003/23478/" [puppet] - 10https://gerrit.wikimedia.org/r/606286 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn)
[16:20:30] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[16:20:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:21:06] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] Add kubernetes[12]01[56] [homer/public] - 10https://gerrit.wikimedia.org/r/607754 (https://phabricator.wikimedia.org/T256236) (owner: 10Alexandros Kosiaris)
[16:21:33] <wikibugs>	 (03Merged) 10jenkins-bot: Add kubernetes[12]01[56] [homer/public] - 10https://gerrit.wikimedia.org/r/607754 (https://phabricator.wikimedia.org/T256236) (owner: 10Alexandros Kosiaris)
[16:23:42] <wikibugs>	 10Operations, 10LDAP-Access-Requests: NDA for superset access request from WMDE employee danshick - https://phabricator.wikimedia.org/T254442 (10KFrancis) @ema The NDA/MOU has been added to the spreadsheet.  Thanks!
[16:25:20] <wikibugs>	 (03PS1) 10Krinkle: logging: Use 'other' instead of '' as default servergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607829
[16:25:29] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] logging: Use 'other' instead of '' as default servergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607829 (owner: 10Krinkle)
[16:26:21] <wikibugs>	 (03Merged) 10jenkins-bot: logging: Use 'other' instead of '' as default servergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607829 (owner: 10Krinkle)
[16:28:03] <logmsgbot>	 !log krinkle@deploy1001 Synchronized wmf-config/logging.php: Ia6ef7617d378 (duration: 01m 02s)
[16:28:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:29:52] <wikibugs>	 (03CR) 10Dzahn: "This is what this does:" [puppet] - 10https://gerrit.wikimedia.org/r/606218 (https://phabricator.wikimedia.org/T255629) (owner: 10Dzahn)
[16:30:04] <wikibugs>	 10Operations, 10ops-codfw, 10netops: codfw: rack/setup new srx300 (mr1) - https://phabricator.wikimedia.org/T255577 (10Papaul)
[16:30:22] <wikibugs>	 10Operations, 10ops-codfw, 10netops: codfw: rack/setup new srx300 (mr1) - https://phabricator.wikimedia.org/T255577 (10Papaul) 05Open→03Resolved This is complete
[16:30:49] <wikibugs>	 10Operations, 10ops-codfw, 10netops: codfw: Decommission old mr1 - https://phabricator.wikimedia.org/T256143 (10Papaul)
[16:30:57] <wikibugs>	 10Operations, 10ops-codfw, 10netops: codfw: rack/setup new srx300 (mr1) - https://phabricator.wikimedia.org/T255577 (10Papaul)
[16:30:59] <wikibugs>	 10Operations, 10ops-codfw, 10netops: codfw: Decommission old mr1 - https://phabricator.wikimedia.org/T256143 (10Papaul) 05Open→03Resolved Compete
[16:31:45] <logmsgbot>	 !log sukhe@cumin1001 START - Cookbook sre.hosts.downtime
[16:31:45] <logmsgbot>	 !log sukhe@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[16:31:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:31:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:34:42] <wikibugs>	 (03CR) 10Dzahn: "[mwdebug1001:~] $  curl -s --head localhost | grep Server:" [puppet] - 10https://gerrit.wikimedia.org/r/606218 (https://phabricator.wikimedia.org/T255629) (owner: 10Dzahn)
[16:37:03] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 135, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:40:18] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
[16:40:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:43:29] <icinga-wm>	 PROBLEM - Ensure hosts are not performing a change on every puppet run on puppetdb1002 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: releases1002.eqiad.wmnet, kubernetes2015.codfw.wmnet, malmok.wikimedia.org, kubernetes2016.codfw.wmnet, releases2002.codfw.wmnet https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes
[16:46:33] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:47:26] <mutante>	 i will look at the releases* part of that alert above
[16:48:21] <mutante>	 sukhe: maybe you could check what it is on malmok?
[16:48:45] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[16:49:25] <sukhe>	 mutante: I scheduled a downtime for it but I should have picked a longer interval. could it be related to that? I don't see anything on the host 
[16:49:49] <volans>	 that's a global check
[16:49:54] <mutante>	 sukhe: the claim is that it changes something on every single puppet run
[16:49:55] <volans>	 doens't depend on the host on icinga
[16:50:20] <mutante>	 malmok isn't the only host, it's just in the list
[16:50:28] <mutante>	 and at some point it gets globally over threshold
[16:50:45] <mutante>	 yea, unrelated to downtime
[16:51:26] <volans>	 to be clear, so it's not included in the downtime of a host and all their services
[16:51:49] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[16:52:23] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:52:29] <mutante>	 re: releases* hosts.. it is because the puppet role does not support buster package names yet.. but i see Gerrit comments i can just drop all those package installs. will do that later today
[16:52:34] <sukhe>	 I am actually not sure what to look for and how to help debug, but I am happy to follow instructions 
[16:52:57] <sukhe>	 on malmok, it does say "The last Puppet run was at Wed Jun 24 11:47:43 UTC 2020 (1745 minutes ago)." which is not true of course
[16:53:30] <mutante>	 sukhe: do you want puppet to be disabled right now?
[16:53:56] <mutante>	 1745 min ago is long. yea
[16:54:21] <sukhe>	 mutante: I am not making any changes on the host nor plan to, for today, so you can disable it if required
[16:54:31] <volans>	 sukhe: the quickest thing to look at is puppetboard: https://puppetboard.wikimedia.org/node/malmok.wikimedia.org
[16:54:36] <mutante>	 sukhe: the opposite, i want to run it repeatedly
[16:54:56] <mutante>	 debugging would just mean running it multiple times and see if there is a thing it repeats each time or not
[16:55:04] <volans>	 and look at the last few puppet runs in the bottom-left column 
[16:55:07] <mutante>	 and it's weird that it has that number when it wasnt disabled
[16:55:41] <mutante>	 sukhe: ok, so it's not running because currently puppet code is broken ..not because it was disabled
[16:56:07] <sukhe>	 I see. I am looking to see if I can find out why
[16:56:17] <mutante>	 i guess the icinga check must interpret that in the same way as "changes stuff on each run" for some reason
[16:56:42] <volans>	 sukhe: same error of what we were chatting yesterday
[16:56:43] <volans>	 https://puppetboard.wikimedia.org/report/malmok.wikimedia.org/8f6768ec616cbce36c8773d7e9c4d53f4918b8fc
[16:56:48] <mutante>	 it seems to be the thing that jbond was debugging yesterday?
[16:56:52] <mutante>	 ack
[16:57:05] <sukhe>	 ahhh thanks volans, I was trying to make the login work
[16:58:09] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:00:02] <volans>	 that sounds like a pre-requisite :-P
[17:00:04] <jouncebot>	 halfak and accraze: My dear minions, it's time we take the moon! Just kidding. Time for Services – Graphoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200625T1700).
[17:00:32] <sukhe>	 but why is it failing when the change is not in production yet?
[17:01:27] <volans>	 https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/profile/manifests/wikidough.pp#4 is there
[17:02:19] <mutante>	 sukhe: it probably failed after https://gerrit.wikimedia.org/r/c/labs/private/+/607769 
[17:04:39] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster: Renamed notebook1003 to an-launcher1002 - https://phabricator.wikimedia.org/T256397 (10elukey)
[17:08:41] <mutante>	 nevermind, not labs/private of course. but a change in the actual private repo that may have gone with it
[17:09:12] <sukhe>	 fixing, and yeah, it was in private
[17:11:20] <mutante>	 cool!
[17:11:53] * sukhe waits for the recovery
[17:12:47] <mutante>	 the individual puppet run alert on malmok won't be shown because it happened to be in downtime for other reasons
[17:13:17] <mutante>	 the global alert ..not sure if that gets us under the threshold yet since others have issues too.. but i am looking to fix 2 more
[17:13:56] <sukhe>	 malmok has recovered now. thanks for the alert and help
[17:14:15] <mutante>	 yw, thanks
[17:14:26] <wikibugs>	 (03PS1) 10Dzahn: releases::mediawiki: only install PHP packages if pre-buster [puppet] - 10https://gerrit.wikimedia.org/r/607838 (https://phabricator.wikimedia.org/T247652)
[17:17:11] <wikibugs>	 (03PS1) 10Elukey: Remove hiera specific overrides for an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/607839 (https://phabricator.wikimedia.org/T256363)
[17:17:36] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/23480/" [puppet] - 10https://gerrit.wikimedia.org/r/607838 (https://phabricator.wikimedia.org/T247652) (owner: 10Dzahn)
[17:18:07] <wikibugs>	 (03CR) 10Dzahn: "hot fix to avoid broken puppet runs that trigger icinga alerts" [puppet] - 10https://gerrit.wikimedia.org/r/607838 (https://phabricator.wikimedia.org/T247652) (owner: 10Dzahn)
[17:18:37] <icinga-wm>	 PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash-codfw instance=kafkamon1001 job=burrow partition={0,1} site=eqiad topic=udp_localhost-info https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=log
[17:18:37] <icinga-wm>	 pic=All&var-consumer_group=All
[17:20:03] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Remove hiera specific overrides for an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/607839 (https://phabricator.wikimedia.org/T256363) (owner: 10Elukey)
[17:21:06] <wikibugs>	 (03CR) 10Dzahn: "quick fix for now: https://gerrit.wikimedia.org/r/c/operations/puppet/+/607838" [puppet] - 10https://gerrit.wikimedia.org/r/607641 (https://phabricator.wikimedia.org/T247652) (owner: 10Dzahn)
[17:24:00] <wikibugs>	 (03CR) 10Hashar: "> We use the PHP version for docroot hosting for a few things still, don't we? doc1001 is for most things, but there's still… coverage rep" [puppet] - 10https://gerrit.wikimedia.org/r/607641 (https://phabricator.wikimedia.org/T247652) (owner: 10Dzahn)
[17:25:44] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://phabricator.wikimedia.org/T255629#6257233" [puppet] - 10https://gerrit.wikimedia.org/r/606218 (https://phabricator.wikimedia.org/T255629) (owner: 10Dzahn)
[17:28:03] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[17:28:23] <wikibugs>	 (03PS2) 10Dzahn: mediawiki::maintenance: add server-header config [puppet] - 10https://gerrit.wikimedia.org/r/606218 (https://phabricator.wikimedia.org/T255629)
[17:30:33] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[17:32:59] <wikibugs>	 (03CR) 10Dzahn: "affects only mwmaint*, not other mw*" [puppet] - 10https://gerrit.wikimedia.org/r/606218 (https://phabricator.wikimedia.org/T255629) (owner: 10Dzahn)
[17:36:40] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me, but one comment should be clarified before merging to avoid breaking anything." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/606286 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn)
[17:37:43] <mutante>	 !log mwmaint1002 - restarted apache2 to add server_headers snippet for T255629 - but not working as expected yet
[17:37:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:38:20] <stashbot>	 T255629: The "Server: mw•"  response header is missing on mwmaint/noc.wm.o - https://phabricator.wikimedia.org/T255629
[17:40:28] <wikibugs>	 (03CR) 10Dzahn: "applied this and restarted apache2 on mwmaint1002 - but it does not show the difference yet because the security2 modules is not loaded he" [puppet] - 10https://gerrit.wikimedia.org/r/606218 (https://phabricator.wikimedia.org/T255629) (owner: 10Dzahn)
[17:48:22] <wikibugs>	 10Operations, 10Gerrit, 10SRE-Access-Requests: Add qchris to `archiva-deployers` so he can upload artifacts for gerrit deployments - https://phabricator.wikimedia.org/T256404 (10QChris)
[17:48:44] <wikibugs>	 10Operations, 10Gerrit, 10SRE-Access-Requests: Add qchris to `archiva-deployers` so he can upload artifacts for gerrit deployments - https://phabricator.wikimedia.org/T256404 (10QChris)
[17:50:53] <icinga-wm>	 RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All
[17:53:06] <wikibugs>	 (03PS1) 10Dzahn: mediawiki::maintenance: load mod_security2 also on mwmaint*, not just mw* [puppet] - 10https://gerrit.wikimedia.org/r/607848 (https://phabricator.wikimedia.org/T255629)
[18:00:04] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Morning backport window(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200625T1800).
[18:00:38] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] jenkins: replace system user/group with systemd-sysuser (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/606286 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn)
[18:01:21] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[18:02:43] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[18:08:06] <wikibugs>	 (03PS1) 10Urbanecm: Change bnwiki logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607852 (https://phabricator.wikimedia.org/T255328)
[18:08:16] <wikibugs>	 (03PS1) 10Dzahn: zuul: replace user/group with systemd-sysuser and reserved UID [puppet] - 10https://gerrit.wikimedia.org/r/607853 (https://phabricator.wikimedia.org/T224591)
[18:08:50] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "also added to https://wikitech.wikimedia.org/wiki/UID and a message that one should use the admin module to reserve UIDs now" [puppet] - 10https://gerrit.wikimedia.org/r/606286 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn)
[18:09:08] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] zuul: replace user/group with systemd-sysuser and reserved UID [puppet] - 10https://gerrit.wikimedia.org/r/607853 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn)
[18:10:08] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[18:11:49] <wikibugs>	 (03PS1) 10Dzahn: zuul: remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/607854
[18:12:21] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] zuul: remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/607854 (owner: 10Dzahn)
[18:13:23] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[18:13:57] <wikibugs>	 (03PS2) 10Dzahn: zuul: replace user/group with systemd-sysuser and reserved UID [puppet] - 10https://gerrit.wikimedia.org/r/607853 (https://phabricator.wikimedia.org/T224591)
[18:15:02] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] zuul: replace user/group with systemd-sysuser and reserved UID [puppet] - 10https://gerrit.wikimedia.org/r/607853 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn)
[18:18:05] <wikibugs>	 (03PS1) 10Dzahn: releases::mediawiki: remove PHP packages [puppet] - 10https://gerrit.wikimedia.org/r/607858
[18:23:57] <wikibugs>	 10Operations, 10Gerrit, 10SRE-Access-Requests: Add qchris to `archiva-deployers` so he can upload artifacts for gerrit deployments - https://phabricator.wikimedia.org/T256404 (10Dzahn) p:05Triage→03High
[18:24:20] <wikibugs>	 (03PS2) 10Dzahn: zuul: remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/607854
[18:24:55] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] zuul: remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/607854 (owner: 10Dzahn)
[18:25:31] <wikibugs>	 10Operations, 10Gerrit, 10SRE-Access-Requests: Add qchris to `archiva-deployers` so he can upload artifacts for gerrit deployments - https://phabricator.wikimedia.org/T256404 (10Dzahn) a:03Dzahn
[18:25:50] <wikibugs>	 10Operations, 10Gerrit, 10LDAP-Access-Requests: Add qchris to `archiva-deployers` so he can upload artifacts for gerrit deployments - https://phabricator.wikimedia.org/T256404 (10Dzahn)
[18:32:50] <wikibugs>	 10Operations, 10ops-eqiad, 10netops: (Need by: 2019-09-30) upgrade msw1-eqiad from EX4200 to EX4300 - https://phabricator.wikimedia.org/T225121 (10Cmjohnson) This is being worked on, I had to put the OS image back on the usb stick.  When I reset the switch to factory default the usb was wiped as welll.
[18:34:42] <wikibugs>	 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): (Need by: 2020-04-02) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10Cmjohnson) a:05Cmjohnson→03Dzahn @Dzahn  Could you try to image one of these and let me know if you see a setting missed.   I am able to lo...
[18:36:35] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[18:38:23] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[18:45:38] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Scoring-platform-team (Current): Production shell access for Chris Albon - https://phabricator.wikimedia.org/T256412 (10Halfak)
[18:49:24] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] "matches profile::mediawiki::httpd" [puppet] - 10https://gerrit.wikimedia.org/r/607848 (https://phabricator.wikimedia.org/T255629) (owner: 10Dzahn)
[18:52:40] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Scoring-platform-team (Current): Production shell access for Chris Albon - https://phabricator.wikimedia.org/T256412 (10calbon) Public key for prod (different from all other keys):  {F31905662}  Preferred username: calbon
[18:55:15] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=cloud_dev_pdns_rec site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:57:14] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Scoring-platform-team (Current): Production shell access for Chris Albon - https://phabricator.wikimedia.org/T256412 (10elukey)
[18:57:43] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Scoring-platform-team (Current): Production shell access for Chris Albon - https://phabricator.wikimedia.org/T256412 (10elukey)
[18:58:02] <mutante>	 !log LDAP - added qchris to archiva-deployers (T256404)
[18:58:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:58:07] <stashbot>	 T256404: Add qchris to `archiva-deployers` so he can upload artifacts for gerrit deployments - https://phabricator.wikimedia.org/T256404
[18:58:37] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Scoring-platform-team (Current): Production shell access for Chris Albon - https://phabricator.wikimedia.org/T256412 (10elukey) No need for `statistics-privatedata-users`, the group has been decommed :)
[18:58:54] <wikibugs>	 10Operations, 10Gerrit, 10LDAP-Access-Requests: Add qchris to `archiva-deployers` so he can upload artifacts for gerrit deployments - https://phabricator.wikimedia.org/T256404 (10Dzahn) 05Open→03Resolved done. a puppet change was not needed because qchris is existing shell user and gerrit-root
[19:00:04] <jouncebot>	 brennen and hashar: My dear minions, it's time we take the moon! Just kidding. Time for Mediawiki train - American+European Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200625T1900).
[19:01:42] <brennen>	 things seem reasonably calm, proceeding with deploy.
[19:03:20] <wikibugs>	 (03PS1) 10Brennen Bearnes: all wikis to 1.35.0-wmf.38 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607864
[19:03:22] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+2] all wikis to 1.35.0-wmf.38 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607864 (owner: 10Brennen Bearnes)
[19:03:53] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Scoring-platform-team (Current): Production shell access for Chris Albon - https://phabricator.wikimedia.org/T256412 (10Halfak) Aha!  Thanks for the cleanup.
[19:04:11] <wikibugs>	 (03Merged) 10jenkins-bot: all wikis to 1.35.0-wmf.38 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607864 (owner: 10Brennen Bearnes)
[19:05:46] <logmsgbot>	 !log brennen@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.38
[19:05:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:12:40] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Scoring-platform-team (Current): Production shell access for Chris Albon - https://phabricator.wikimedia.org/T256412 (10Dzahn) nda and wmf are LDAP groups (that would be a separate phabricator tag, ldap-access-requests) while the other are shell groups.  It's also eithe...
[19:18:07] <wikibugs>	 (03PS4) 10Dzahn: planet: replace system/user group with systemd-sysuser [puppet] - 10https://gerrit.wikimedia.org/r/606287
[19:24:36] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Scoring-platform-team (Current): Production shell access for Chris Albon - https://phabricator.wikimedia.org/T256412 (10calbon)
[19:25:43] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Scoring-platform-team (Current): Production shell access for Chris Albon - https://phabricator.wikimedia.org/T256412 (10calbon) I don't understand everything Dzahn said but I removed the nda tag, I'm staff.
[19:26:38] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Scoring-platform-team (Current): Production shell access for Chris Albon - https://phabricator.wikimedia.org/T256412 (10Dzahn)
[19:27:41] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Scoring-platform-team (Current): Production shell access for Chris Albon - https://phabricator.wikimedia.org/T256412 (10Dzahn) @calbon Thanks, it's all good and that was right to do. I added another tag for being added to the LDAP group.  And w...
[19:29:45] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Scoring-platform-team (Current): Production shell access for Chris Albon - https://phabricator.wikimedia.org/T256412 (10Dzahn)
[19:30:04] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Scoring-platform-team (Current): Production shell access for Chris Albon - https://phabricator.wikimedia.org/T256412 (10Dzahn)
[19:30:43] <dcausse>	 !log repooling wdqs1007.eqiad.wmnet
[19:30:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:31:03] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Scoring-platform-team (Current): Production shell access for Chris Albon - https://phabricator.wikimedia.org/T256412 (10Dzahn) I made some edits to the task description to clarify what is what kind of thing (LDAP / production shell / cloud (aka...
[19:32:04] <wikibugs>	 10Operations, 10observability: improve cron spam visibility - https://phabricator.wikimedia.org/T84845 (10colewhite)
[19:32:46] <wikibugs>	 10Operations, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10Dzahn)
[19:32:51] <wikibugs>	 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10Dzahn) 05Open→03Resolved Decided with Hash...
[19:32:53] <wikibugs>	 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10Dzahn)
[19:34:00] <wikibugs>	 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10hashar) Eventually I wanted to switch back to...
[19:39:06] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "Could not find resource 'User[planet]' in parameter 'require'" [puppet] - 10https://gerrit.wikimedia.org/r/606287 (owner: 10Dzahn)
[19:39:39] <wikibugs>	 (03CR) 10Hashar: "The testsuite is broken, the spec run with the facts from the container OS instead of whatever distribution(s) we target :]" [puppet] - 10https://gerrit.wikimedia.org/r/607854 (owner: 10Dzahn)
[19:40:46] <wikibugs>	 (03PS5) 10Dzahn: planet: replace system/user group with systemd-sysuser [puppet] - 10https://gerrit.wikimedia.org/r/606287
[19:41:06] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] planet: replace system/user group with systemd-sysuser [puppet] - 10https://gerrit.wikimedia.org/r/606287 (owner: 10Dzahn)
[19:41:22] <wikibugs>	 10Operations, 10observability: improve cron spam visibility - https://phabricator.wikimedia.org/T84845 (10colewhite) An option we discussed recently was to ingest mail generated by the servers into Logstash by either pulling events from a mailbox or piping off events at the mail servers.  Once in ES, queries c...
[19:43:51] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "LGTM!  Let us know when this is ready for deployment and we'll see it through." [puppet] - 10https://gerrit.wikimedia.org/r/606049 (https://phabricator.wikimedia.org/T255627) (owner: 10Krinkle)
[19:47:51] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "achievement unlocked: "Illegal class reference"" [puppet] - 10https://gerrit.wikimedia.org/r/606287 (owner: 10Dzahn)
[19:49:07] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[19:51:16] <wikibugs>	 (03PS1) 10Hashar: zuul: set site/initsystem in rspec configuration [puppet] - 10https://gerrit.wikimedia.org/r/607867
[19:52:25] <wikibugs>	 (03CR) 10Hashar: "The spec issue due to a missing initsystem should be fixed by https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/607867/ ." [puppet] - 10https://gerrit.wikimedia.org/r/607853 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn)
[19:52:47] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[19:53:11] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] zuul: set site/initsystem in rspec configuration [puppet] - 10https://gerrit.wikimedia.org/r/607867 (owner: 10Hashar)
[19:53:45] <wikibugs>	 (03CR) 10Krinkle: "It's ready :) Does today work?" [puppet] - 10https://gerrit.wikimedia.org/r/606049 (https://phabricator.wikimedia.org/T255627) (owner: 10Krinkle)
[19:53:57] <Krinkle>	 shdubsh: ^ :)
[19:54:12] <shdubsh>	 ack!
[19:54:37] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] zuul: set site/initsystem in rspec configuration [puppet] - 10https://gerrit.wikimedia.org/r/607867 (owner: 10Hashar)
[19:55:15] <wikibugs>	 (03CR) 10Hashar: "we can dish out contint::composer since it needs php anyway ;)  That is also one step toward stopping using that way of installing compose" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/607858 (owner: 10Dzahn)
[19:55:36] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] mediawiki,logstash: Update type:parsoid-php -> type:mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/606049 (https://phabricator.wikimedia.org/T255627) (owner: 10Krinkle)
[19:57:43] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] mediawiki,logstash: Update type:parsoid-php -> type:mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/606049 (https://phabricator.wikimedia.org/T255627) (owner: 10Krinkle)
[20:00:34] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] planet: replace system/user group with systemd-sysuser (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/606287 (owner: 10Dzahn)
[20:02:04] <wikibugs>	 (03PS6) 10Dzahn: planet: replace system/user group with systemd-sysuser [puppet] - 10https://gerrit.wikimedia.org/r/606287
[20:02:24] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] planet: replace system/user group with systemd-sysuser [puppet] - 10https://gerrit.wikimedia.org/r/606287 (owner: 10Dzahn)
[20:13:32] <wikibugs>	 10Operations: Ensure that there are no firewall rules in modules - https://phabricator.wikimedia.org/T114209 (10Dzahn)
[20:14:48] <wikibugs>	 (03PS1) 10Dzahn: DHCP: add logstash2030, logstash2031 [puppet] - 10https://gerrit.wikimedia.org/r/607872 (https://phabricator.wikimedia.org/T256139)
[20:15:56] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] DHCP: add logstash2030, logstash2031 [puppet] - 10https://gerrit.wikimedia.org/r/607872 (https://phabricator.wikimedia.org/T256139) (owner: 10Dzahn)
[20:23:13] <wikibugs>	 10Operations: Ensure that there are no firewall rules in modules - https://phabricator.wikimedia.org/T114209 (10Dzahn) recently fixed:  icinga: https://gerrit.wikimedia.org/r/c/operations/puppet/+/606730 codesearch: https://gerrit.wikimedia.org/r/c/operations/puppet/+/606735 dumps: https://gerrit.wikimedia.org/r...
[20:25:30] <wikibugs>	 10Operations: Ensure that there are no firewall rules in modules - https://phabricator.wikimedia.org/T114209 (10Dzahn)
[20:26:01] <wikibugs>	 10Operations: Ensure that there are no firewall rules in modules - https://phabricator.wikimedia.org/T114209 (10Dzahn) modules that still have a ferm::service as of today:  acme_chief aptly base phabricator prometheus rsync scap service udp2log   added check boxes
[20:31:15] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
[20:31:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:33:48] <wikibugs>	 (03PS1) 10Dzahn: partman: add logstash103[0-1] and logstash2003[0-1] [puppet] - 10https://gerrit.wikimedia.org/r/607875 (https://phabricator.wikimedia.org/T256139)
[20:35:22] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] partman: add logstash103[0-1] and logstash2003[0-1] [puppet] - 10https://gerrit.wikimedia.org/r/607875 (https://phabricator.wikimedia.org/T256139) (owner: 10Dzahn)
[20:40:11] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install <hadoop testing nodes> - https://phabricator.wikimedia.org/T255520 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson
[20:40:51] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install 3 lightweight hadoop nodes - https://phabricator.wikimedia.org/T255518 (10Jclark-ctr) a:05elukey→03Cmjohnson
[20:41:25] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install 3 lightweight hadoop nodes - https://phabricator.wikimedia.org/T255518 (10Jclark-ctr) added to netbox
[20:42:00] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install <hadoop testing nodes> - https://phabricator.wikimedia.org/T255520 (10Jclark-ctr) added to netbox
[20:42:17] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is CRITICAL: 38.82 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[20:42:31] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install <hadoop testing nodes> - https://phabricator.wikimedia.org/T255520 (10Jclark-ctr)
[20:43:19] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install 3 lightweight hadoop nodes - https://phabricator.wikimedia.org/T255518 (10Jclark-ctr)
[20:46:47] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10netops, 10cloud-services-team (Hardware): (Need By: 2020-06-12) rack/setup/install WMCS 10G switches - https://phabricator.wikimedia.org/T251632 (10Jclark-ctr) @ayounsi switches are cabled and powered waiting on configuration
[20:50:53] <icinga-wm>	 PROBLEM - HTTPS-dbtree on dbmonitor1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org
[20:52:33] <icinga-wm>	 RECOVERY - HTTPS-dbtree on dbmonitor1001 is OK: HTTP OK: HTTP/1.1 200 OK - 92408 bytes in 1.756 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org
[20:54:19] <icinga-wm>	 PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 101.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37
[20:54:57] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[20:57:41] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
[20:57:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:29:19] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: Onboarding Wolfgang Kandek - https://phabricator.wikimedia.org/T249352 (10Dzahn)
[21:31:38] <mutante>	 @seen andre_
[21:31:38] <wm-bot>	 mutante: Last time I saw andre_ they were leaving the channel #wikibooks-es at 3/20/2020 1:19:55 PM (97d8h11m42s ago)
[21:31:46] <mutante>	 @seen andre__
[21:31:46] <wm-bot>	 mutante: Last time I saw andre__ they were quitting the network with reason: Quit: Out. N/A at 6/19/2020 5:55:09 PM (6d3h36m37s ago)
[21:42:14] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: Onboarding Wolfgang Kandek - https://phabricator.wikimedia.org/T249352 (10CDanis)
[21:57:51] <wikibugs>	 (03PS2) 10Dave Pifke: [WIP] arclamp: Deploy from scap [puppet] - 10https://gerrit.wikimedia.org/r/607370 (https://phabricator.wikimedia.org/T200109)
[22:00:37] <icinga-wm>	 PROBLEM - Host kubestagetcd1004 is DOWN: PING CRITICAL - Packet loss = 100%
[22:12:36] <wikibugs>	 (03PS3) 10Dave Pifke: [WIP] arclamp: Deploy from scap [puppet] - 10https://gerrit.wikimedia.org/r/607370 (https://phabricator.wikimedia.org/T200109)
[22:25:50] <mutante>	 !log puppetmaster - signing certs and initial run for logstash2030/2031 - no prod role yet
[22:25:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:27:23] <icinga-wm>	 PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX
[22:29:03] <icinga-wm>	 RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[22:33:45] <wikibugs>	 (03PS1) 10Dzahn: site/DHCP: add logstash[1]203[12] [puppet] - 10https://gerrit.wikimedia.org/r/607895 (https://phabricator.wikimedia.org/T256139)
[22:37:02] <wikibugs>	 (03PS2) 10Dzahn: site/DHCP: add logstash[1]203[12] [puppet] - 10https://gerrit.wikimedia.org/r/607895 (https://phabricator.wikimedia.org/T256139)
[22:41:14] <wikibugs>	 (03PS3) 10Dzahn: site/DHCP: add logstash[12]03[01] [puppet] - 10https://gerrit.wikimedia.org/r/607895 (https://phabricator.wikimedia.org/T256139)
[22:46:41] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and nda groups for edtadros - https://phabricator.wikimedia.org/T256435 (10Edtadros)
[22:49:25] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:50:24] <wikibugs>	 10Operations, 10ops-codfw, 10SRE-swift-storage: 3 ms-be mgmt interfaces not back after mgmt switch maintenance - https://phabricator.wikimedia.org/T256436 (10Dzahn)
[22:51:22] <wikibugs>	 10Operations, 10ops-codfw, 10SRE-swift-storage: 3 ms-be mgmt interfaces not back after mgmt switch maintenance - https://phabricator.wikimedia.org/T256436 (10Dzahn) {F31905905}
[22:51:48] <wikibugs>	 10Operations, 10ops-codfw, 10SRE-swift-storage: 3 ms-be mgmt interfaces not back after mgmt switch maintenance - https://phabricator.wikimedia.org/T256436 (10Dzahn) p:05Triage→03Medium
[22:52:12] <icinga-wm>	 ACKNOWLEDGEMENT - Host ms-be2051.mgmt is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T256436
[22:52:12] <icinga-wm>	 ACKNOWLEDGEMENT - Host ms-be2053.mgmt is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T256436
[22:52:12] <icinga-wm>	 ACKNOWLEDGEMENT - Host ms-be2055.mgmt is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T256436
[22:53:43] <wikibugs>	 10Operations, 10ops-codfw, 10SRE-swift-storage: 3 ms-be mgmt interfaces not back after mgmt switch maintenance - https://phabricator.wikimedia.org/T256436 (10Papaul) a:03Papaul
[22:54:11] <icinga-wm>	 RECOVERY - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is OK: (C)100 gt (W)80 gt 79.32 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37
[22:54:57] <icinga-wm>	 ACKNOWLEDGEMENT - Host kubestagetcd1004 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T244530#6256018
[22:56:08] <wikibugs>	 10Operations, 10ops-eqiad: upgrade memory in ganeti100[5-8].eqiad.wmnet - https://phabricator.wikimedia.org/T244530 (10Dzahn) kubestagetcd1004 is still down. not sure if desired. ACKed in Icinga.
[22:57:11] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/23485/" [puppet] - 10https://gerrit.wikimedia.org/r/607895 (https://phabricator.wikimedia.org/T256139) (owner: 10Dzahn)
[22:59:35] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[23:00:04] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy Evening backport window(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200625T2300).
[23:04:59] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[23:26:31] <wikibugs>	 (03PS4) 10Dave Pifke: [WIP] arclamp: Deploy from scap [puppet] - 10https://gerrit.wikimedia.org/r/607370 (https://phabricator.wikimedia.org/T200109)
[23:31:17] <wikibugs>	 (03PS5) 10Dave Pifke: [WIP] arclamp: Deploy from scap [puppet] - 10https://gerrit.wikimedia.org/r/607370 (https://phabricator.wikimedia.org/T200109)
[23:34:15] <wikibugs>	 (03PS6) 10Dave Pifke: [WIP] arclamp: Deploy from scap [puppet] - 10https://gerrit.wikimedia.org/r/607370 (https://phabricator.wikimedia.org/T200109)
[23:37:24] <mutante>	 !log puppetmaster - signing certs and initial puppet run for logstash1030/logstash1031 - no prod role yet
[23:37:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:43:48] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10observability, 10User-fgiunchedi: VM requests for additional Logstash capacity - https://phabricator.wikimedia.org/T256139 (10Dzahn) @fgiunchedi   4 VMs have been created. OS has been installed.   They have been added to puppet with the "insetup" role.  IPv6 records ha...
[23:46:55] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] Consolidate query_service profile duplication [puppet] - 10https://gerrit.wikimedia.org/r/599146 (owner: 10EBernhardson)
[23:47:28] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] "Sorry for the delay in getting to this. PCC looks good: https://puppet-compiler.wmflabs.org/compiler1003/23486/wdqs1006.eqiad.wmnet/index." [puppet] - 10https://gerrit.wikimedia.org/r/599146 (owner: 10EBernhardson)
[23:52:43] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10observability, 10serviceops, 10User-fgiunchedi: VM requests for additional Logstash capacity - https://phabricator.wikimedia.org/T256139 (10Dzahn)
[23:53:00] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10observability, 10User-fgiunchedi: move 4 new logstash VMs into production - https://phabricator.wikimedia.org/T256443 (10Dzahn)
[23:54:46] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10observability, 10User-fgiunchedi: move 4 new logstash VMs into production - https://phabricator.wikimedia.org/T256443 (10Dzahn)
[23:55:22] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10observability, 10User-fgiunchedi: move 4 new logstash VMs into production - https://phabricator.wikimedia.org/T256443 (10Dzahn)
[23:55:36] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10observability, 10serviceops, 10User-fgiunchedi: VM requests for additional Logstash capacity - https://phabricator.wikimedia.org/T256139 (10Dzahn)
[23:55:41] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10observability, 10User-fgiunchedi: move 4 new logstash VMs into production - https://phabricator.wikimedia.org/T256443 (10Dzahn)
[23:55:43] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10observability, 10serviceops, 10User-fgiunchedi: VM requests for additional Logstash capacity - https://phabricator.wikimedia.org/T256139 (10Dzahn) 05Open→03Resolved
[23:55:47] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Increase logging pipeline ingestion capacity - https://phabricator.wikimedia.org/T255243 (10Dzahn)
[23:58:05] <wikibugs>	 (03CR) 10Krinkle: [WIP] arclamp: Deploy from scap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/607370 (https://phabricator.wikimedia.org/T200109) (owner: 10Dave Pifke)
[23:58:12] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Increase logging pipeline ingestion capacity - https://phabricator.wikimedia.org/T255243 (10Dzahn)
[23:58:14] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10observability, 10User-fgiunchedi: move 4 new logstash VMs into production - https://phabricator.wikimedia.org/T256443 (10Dzahn)
[23:58:56] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] "PCC looks good." [puppet] - 10https://gerrit.wikimedia.org/r/607542 (https://phabricator.wikimedia.org/T252913) (owner: 10Herron)
[23:59:21] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Increase logging pipeline ingestion capacity - https://phabricator.wikimedia.org/T255243 (10Dzahn) There are 4 new ganeti VMs now, 2 in eqiad and 2 in codfw, in row D each.  They are ready to be taken into product...