[00:00:04] twentyafterfour: It is that lovely time of the day again! You are hereby commanded to deploy Phabricator update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200625T0000). [00:00:41] (03CR) 10Volans: [C: 03+2] scripts: unset the face too in the offline script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/607644 (owner: 10Volans) [00:02:13] (03PS6) 10Dzahn: jenkins: replace system user/group with systemd-sysuser [puppet] - 10https://gerrit.wikimedia.org/r/606286 (https://phabricator.wikimedia.org/T224591) [00:02:38] (03CR) 10Dzahn: jenkins: replace system user/group with systemd-sysuser (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/606286 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [00:03:44] (03CR) 10DannyS712: "Can the changes (or at least of highlight) be noted in the commit message?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607605 (owner: 10Reedy) [00:03:46] (03PS7) 10Dzahn: jenkins: replace system user/group with systemd-sysuser [puppet] - 10https://gerrit.wikimedia.org/r/606286 (https://phabricator.wikimedia.org/T224591) [00:05:25] (03CR) 10Dzahn: [C: 04-1] "just like https://gerrit.wikimedia.org/r/c/operations/puppet/+/606287 this still fails in general and needs Chris Danis' fix at https://g" [puppet] - 10https://gerrit.wikimedia.org/r/606286 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [00:05:32] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [00:07:12] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [00:07:39] these sometimes happen for a very short time, i just rescheduled the icinga check [00:09:59] 10Operations, 10Analytics, 10netops: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10faidon) To add to the above, I'm also wondering how difficult it would be to also include AS *names*, e.g. coming from the MaxMind GeoIP ASN database. I think we've use... [00:10:47] (03CR) 10Dzahn: Add initial puppetization for libraryupgrader (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/607452 (https://phabricator.wikimedia.org/T173478) (owner: 10Legoktm) [00:11:20] (03CR) 10Dzahn: Add initial puppetization for libraryupgrader (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/607452 (https://phabricator.wikimedia.org/T173478) (owner: 10Legoktm) [00:11:27] !log updating phabricator to release/2020-06-25/1, momentary (<1 minute) downtime expected. [00:11:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:18] !log phabricator updated, all seems normal [00:12:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:42] thanks twentyafterfour, looks normal indeed [00:12:49] mutante: if you think the systemd-sysuser thing will be fixed soon I don't mind waiting a bit, otherwise I can switch it back [00:14:44] legoktm: uhm.. switch it back and i will replace it again later [00:14:55] let's merge that [00:15:13] ok [00:18:32] legoktm: is it possible to define a source for connections to port 3002 or does it need to be from anywhere? [00:18:33] (03PS4) 10Legoktm: Add initial puppetization for libraryupgrader [puppet] - 10https://gerrit.wikimedia.org/r/607452 (https://phabricator.wikimedia.org/T173478) [00:18:35] (03PS1) 10Legoktm: libraryupgrader: Switch to systemd-sysusers [puppet] - 10https://gerrit.wikimedia.org/r/607646 [00:19:06] mutante: the source would be the cloud dynamic-proxy [00:20:29] legoktm: ah, we just did the same for phab in cloud. we can use $CACHES [00:20:43] i can add that in another change [00:22:23] ok [00:22:33] we can do the same for codesearch too then [00:25:30] legoktm: first i meant this kind of thing: https://gerrit.wikimedia.org/r/c/operations/puppet/+/606255 but it only becomes interesting once we are actually in both prod and cloud.. so it's easier..amending to yours [00:27:09] (03PS5) 10Dzahn: Add initial puppetization for libraryupgrader [puppet] - 10https://gerrit.wikimedia.org/r/607452 (https://phabricator.wikimedia.org/T173478) (owner: 10Legoktm) [00:27:17] mutante: ah, gotcha. fwiw port 3002 is arbitrary, we can switch to something more standard if that's easier. I just picked that because it's unprivledged and I already kept it open on my laptop [00:29:04] (03PS6) 10Dzahn: Add initial puppetization for libraryupgrader [puppet] - 10https://gerrit.wikimedia.org/r/607452 (https://phabricator.wikimedia.org/T173478) (owner: 10Legoktm) [00:29:58] legoktm: 3002 is fine. it's not anything well-known and doesn't really matter for the code [00:31:07] so i am doing simply srange => '$CACHES' and it should work [00:31:33] (03CR) 10Dzahn: [C: 03+2] Add initial puppetization for libraryupgrader [puppet] - 10https://gerrit.wikimedia.org/r/607452 (https://phabricator.wikimedia.org/T173478) (owner: 10Legoktm) [00:32:35] legoktm: do you want to add it on codesearch6, should I? later? [00:33:00] you mean upgrader07? I can do that in a few minutes [00:33:43] heh, yea:) [00:33:55] ok [00:35:19] (03CR) 10Dzahn: "PS6: limited source range for connections to port 3002 to $CACHES." [puppet] - 10https://gerrit.wikimedia.org/r/607452 (https://phabricator.wikimedia.org/T173478) (owner: 10Legoktm) [00:35:44] ACCEPT tcp -- proxy-01.project-proxy.eqiad1.wikimedia.cloud anywhere tcp dpt:http [00:35:47] ACCEPT tcp -- proxy-02.project-proxy.eqiad1.wikimedia.cloud anywhere tcp dpt:http [00:36:09] legoktm: ^ this is what i am expecting you should see in the end in iptables -L [00:36:29] that is looking a phab-in-cloud instance where we used $CACHES as well [00:37:50] (03PS1) 10Dzahn: codesearch: limit connections to port3002 to $CACHES [puppet] - 10https://gerrit.wikimedia.org/r/607647 [00:38:28] (03PS2) 10Dzahn: codesearch: limit connections to port 3002 to $CACHES [puppet] - 10https://gerrit.wikimedia.org/r/607647 [00:39:21] (03PS2) 10Ssingh: prometheus: update scheme for wikidough (improves ab8a948a) [puppet] - 10https://gerrit.wikimedia.org/r/607570 [00:41:36] (03CR) 10Ssingh: "No code change for patch set 2; I have just updated the commit message." [puppet] - 10https://gerrit.wikimedia.org/r/607570 (owner: 10Ssingh) [00:43:12] mutante: Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Function lookup() did not find a value for the name 'profile::libraryupgrader::base_dir' (file: /etc/puppet/modules/profile/manifests/libraryupgrader.pp, line: 1) on node upgrader-07.library-upgrader.eqiad.wmflabs [00:43:42] twentyafterfour: https://phabricator.wikimedia.org/T256343 is that after the update? [00:43:57] Source file "/srv/deployment/phabricator/deployment-cache/revs/4547f31de8f69854e0cd9d3e0a802ce517360ee0/phabricator/src/applications/legalpad/conduit/LegalpadSignatureSearchConduitAPIMethod.php" failed to load. [00:44:31] Reedy: looks like it [00:44:34] fixing... [00:45:10] legoktm: uhm.. is the name of the project in horizon actually "libraryupgrader" ? [00:45:21] oh uh, no [00:45:24] it's library-upgrader [00:45:44] we gotta move the yaml file around then [00:46:10] hierdata/cloud/eqiad1/$project_name/common.yaml [00:46:25] ohhh, my bad. I'll submit a patch [00:48:23] (03PS1) 10Legoktm: Fix location of library-upgrader hieradata [puppet] - 10https://gerrit.wikimedia.org/r/607648 [00:48:29] mutante: ^ [00:48:40] !log restart php-fpm on phab1001 to fix T256343 [00:48:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:48:45] T256343: Unable to Access the Repository - https://phabricator.wikimedia.org/source/tool-commons-android-app/ - https://phabricator.wikimedia.org/T256343 [00:48:56] (03CR) 10Dzahn: [C: 03+2] Fix location of library-upgrader hieradata [puppet] - 10https://gerrit.wikimedia.org/r/607648 (owner: 10Legoktm) [00:49:45] legoktm: synced on prod puppetmasters [00:52:30] puppet is running! [00:54:53] great [00:55:22] legoktm: when it's done, do an 'iptables -L | grep http' or so [00:56:19] legoktm@upgrader-07:~$ iptables -L | grep http [00:56:19] -bash: iptables: command not found [00:56:32] there's an iptables-xml ? [00:57:45] oh [00:57:47] it's in sbin [00:57:48] legoktm: as root [00:57:58] well, or that [00:58:29] uh, no output under http but [00:58:36] root@upgrader-07:~# iptables -L | grep 3002 [00:58:36] ACCEPT tcp -- proxy-01.project-proxy.eqiad1.wikimedia.cloud anywhere tcp dpt:3002 [00:58:36] ACCEPT tcp -- proxy-02.project-proxy.eqiad1.wikimedia.cloud anywhere tcp dpt:3002 [00:58:36] ACCEPT tcp -- deployment-cache-text06.deployment-prep.eqiad1.wikimedia.cloud anywhere tcp dpt:3002 [00:58:39] sorry, 3002 [00:59:08] there are the 2 proxies and that one deployment-cache server for some reason [00:59:15] but this is what i expected, yep [00:59:19] it should work [00:59:56] that is what $CACHES means in cloud, while it means "all the cp* servers" in prod [01:00:21] there doesn't have to be a $realm check or anything this way [01:00:23] awesome :D next step for libraryupgrader is to migrate the systemd units over to puppet, I'll probably spend some time tonight on that [01:00:33] (03CR) 10Legoktm: [C: 03+1] codesearch: limit connections to port 3002 to $CACHES [puppet] - 10https://gerrit.wikimedia.org/r/607647 (owner: 10Dzahn) [01:00:49] (03CR) 10Dzahn: [C: 03+2] codesearch: limit connections to port 3002 to $CACHES [puppet] - 10https://gerrit.wikimedia.org/r/607647 (owner: 10Dzahn) [01:02:50] mutante: thank you for all the help so far :) [01:03:43] legoktm: you're welcome, talk to you later then [01:26:30] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 24038976 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:28:16] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 783376 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:33:01] 10Operations, 10Commons, 10MediaWiki-File-management, 10Traffic: Cached thumbnails and originals are sometimes not being purged correctly/quickly - https://phabricator.wikimedia.org/T256313 (10King_of_Hearts) Yet another one, currently still broken as of time of writing: https://upload.wikimedia.org/wikipe... [01:41:26] 10Operations, 10Commons, 10MediaWiki-File-management, 10SRE-swift-storage, 10Traffic: Cached thumbnails and originals are sometimes not being purged correctly/quickly - https://phabricator.wikimedia.org/T256313 (10Reedy) [03:08:33] (03CR) 10Bmansurov: "> May I ask what the configuration changes were? We probably need to amend the chart to account for those." [deployment-charts] - 10https://gerrit.wikimedia.org/r/602527 (https://phabricator.wikimedia.org/T241230) (owner: 10Bmansurov) [04:17:29] (03PS1) 10Marostegui: db2120: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/607660 [04:22:13] (03CR) 10Marostegui: [C: 03+2] db2120: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/607660 (owner: 10Marostegui) [04:25:23] !log Deploy schema change on s2 codfw - T238966 [04:25:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:25:28] T238966: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 [04:26:04] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 2 (releases1002, ...), Fresh: 101 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [04:26:29] !log Remove triggers from db2095:3312 - T238966 [04:26:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:37:12] (03PS4) 10Marostegui: mariadb: Promote db1097 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/606953 (https://phabricator.wikimedia.org/T254556) [04:57:59] 10Operations, 10DBA: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10Marostegui) Anything else left here after the 100% repool or we can close this? Thank you! [05:25:32] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 52 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:29:52] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 429 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: [05:29:52] most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [05:30:52] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 48 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:35:24] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [05:45:56] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:51:14] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:56:01] (03CR) 10Elukey: hadoop - Add change-distro.py and stop-cluster.py (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/606736 (https://phabricator.wikimedia.org/T244499) (owner: 10Elukey) [05:59:05] (03PS11) 10Elukey: hadoop - Add change-distro.py and stop-cluster.py [cookbooks] - 10https://gerrit.wikimedia.org/r/606736 (https://phabricator.wikimedia.org/T244499) [05:59:14] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 429 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: [05:59:15] most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [06:03:47] !log reboot an-airflow1001 for kernel upgrades [06:03:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:42] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [06:12:26] PROBLEM - Check systemd state on an-airflow1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:14:52] ifup for ens5 fails - RTNETLINK answers: File exists [06:15:28] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 429 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: [06:15:28] most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [06:17:16] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [06:19:34] RECOVERY - Check systemd state on an-airflow1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:19:55] !log execute ip addr flush ens5 on an-airflow1001 to clear RTNETLINK answers: File exists (error from ifup@ens5.service) [06:19:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:28] !log reboot analytics-tool1001 for kernel upgrades [06:22:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:54] ACKNOWLEDGEMENT - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: CDanis Scheduled maintenance TTN-0004068701 - The acknowledgement expires at: 2020-06-25 10:22:27. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:22:54] ACKNOWLEDGEMENT - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0: CDanis Scheduled maintenance TTN-0004068701 - The acknowledgement expires at: 2020-06-25 10:22:27. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:23:16] !log reboot analytics-tool1004 for kernel upgrades (Superset host) [06:23:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:40] !log reboot an-tool* vms for kernel upgrades [06:24:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:08] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 429 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: [06:28:08] most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [06:28:14] PROBLEM - Check systemd state on an-tool1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:29:14] PROBLEM - Check size of conntrack table on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:29:26] PROBLEM - Check size of conntrack table on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:29:40] PROBLEM - ores uWSGI web app on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:30:08] PROBLEM - Check whether ferm is active by checking the default input chain on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:30:08] PROBLEM - ores uWSGI web app on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:30:14] PROBLEM - Check systemd state on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:30:31] ah this is ores dying for the logrotate [06:30:44] PROBLEM - Check systemd state on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:31:21] !log force puppet run on ores1003/1005 to restore celery (killed by the oom) [06:31:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:40] PROBLEM - puppet last run on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:32:00] RECOVERY - Check systemd state on ores1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:32:32] RECOVERY - Check systemd state on ores1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:32:50] RECOVERY - Check size of conntrack table on ores1003 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:33:00] RECOVERY - Check size of conntrack table on ores1005 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:34:32] !log reboot archiva for kernel upgrades [06:34:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:04] !log reboot archiva1002 (new vm, not yet in service) for kernel upgrades [06:35:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:08] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [06:37:28] RECOVERY - puppet last run on ores1003 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:40:44] !log reboot matomo1002 for kernel upgrades [06:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:10] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 429 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: [06:46:10] most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [06:49:44] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [06:49:50] (03CR) 10Elukey: [C: 04-1] hadoop - Add change-distro.py and stop-cluster.py (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/606736 (https://phabricator.wikimedia.org/T244499) (owner: 10Elukey) [06:55:24] (03PS12) 10Elukey: hadoop - Add change-distro.py and stop-cluster.py [cookbooks] - 10https://gerrit.wikimedia.org/r/606736 (https://phabricator.wikimedia.org/T244499) [06:56:18] (03CR) 10jerkins-bot: [V: 04-1] hadoop - Add change-distro.py and stop-cluster.py [cookbooks] - 10https://gerrit.wikimedia.org/r/606736 (https://phabricator.wikimedia.org/T244499) (owner: 10Elukey) [06:57:02] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 429 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: [06:57:02] most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [07:00:54] RECOVERY - Check whether ferm is active by checking the default input chain on ores1005 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [07:02:51] (03PS13) 10Elukey: hadoop - Add change-distro.py and stop-cluster.py [cookbooks] - 10https://gerrit.wikimedia.org/r/606736 (https://phabricator.wikimedia.org/T244499) [07:03:16] 10Operations, 10Commons, 10MediaWiki-File-management, 10SRE-swift-storage, 10Traffic: Cached thumbnails and originals are sometimes not being purged correctly/quickly - https://phabricator.wikimedia.org/T256313 (10Aklapper) [07:07:58] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [07:08:50] !log Start pre switchover steps on m1 T254556 [07:08:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:55] T254556: Upgrade m1 to Buster and Mariadb 10.4 - https://phabricator.wikimedia.org/T254556 [07:14:25] (03CR) 10Elukey: [C: 03+2] "Time to test!" [cookbooks] - 10https://gerrit.wikimedia.org/r/606736 (https://phabricator.wikimedia.org/T244499) (owner: 10Elukey) [07:15:12] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 429 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: [07:15:12] most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [07:17:16] 10Operations, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review: replace backends for releases.wikimedia.org with buster VMs - https://phabricator.wikimedia.org/T247652 (10hashar) The server with `/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java` has been started by our systemd unit a... [07:18:10] !log reboot kafkamon* vms for kernel upgrades [07:18:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:21] this may generate some kafka lag alerts, hopefully not --^ [07:30:25] (03CR) 10Marostegui: mariadb: Promote db1097 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/606953 (https://phabricator.wikimedia.org/T254556) (owner: 10Marostegui) [07:30:40] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1097 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/606953 (https://phabricator.wikimedia.org/T254556) (owner: 10Marostegui) [07:32:37] (03PS1) 10Elukey: camus: use refinery-camus-0.128 [puppet] - 10https://gerrit.wikimedia.org/r/607717 [07:32:49] (03CR) 10Alexandros Kosiaris: [C: 03+2] Introduce kubernetes[12]01[56] [dns] - 10https://gerrit.wikimedia.org/r/607495 (https://phabricator.wikimedia.org/T256236) (owner: 10Alexandros Kosiaris) [07:32:54] (03PS4) 10Alexandros Kosiaris: Introduce kubernetes[12]01[56] [dns] - 10https://gerrit.wikimedia.org/r/607495 (https://phabricator.wikimedia.org/T256236) [07:33:36] (03CR) 10Joal: [C: 03+1] "LGTM! Thanks elukey" [puppet] - 10https://gerrit.wikimedia.org/r/607717 (owner: 10Elukey) [07:36:25] !log reboot an-launcher1001 for kernel upgrades [07:36:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:50] (03CR) 10Elukey: [C: 03+2] camus: use refinery-camus-0.128 [puppet] - 10https://gerrit.wikimedia.org/r/607717 (owner: 10Elukey) [07:37:06] RECOVERY - WDQS high update lag on wdqs1007 is OK: (C)4.32e+04 ge (W)2.16e+04 ge 2.158e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [07:38:48] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash-codfw instance=kafkamon1001 job=burrow partition={0,1,2,4,5} site=eqiad topic={udp_localhost-info,udp_localhost-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+ [07:38:48] r-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [07:39:53] 10Operations, 10DBA: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10Kormat) 05Open→03Resolved Nope, all done! [07:41:22] so the lag seemed starting before I rebooted kafkamon [07:41:42] Cc godog [07:41:54] PROBLEM - Check systemd state on an-launcher1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:42:56] (03CR) 10Ayounsi: cumin: backup all of /srv where a lot of deployment state may live (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/607258 (owner: 10Jcrespo) [07:47:26] (03CR) 10Hashar: "We should remove the PHP packages from the releases hosts." [puppet] - 10https://gerrit.wikimedia.org/r/607641 (https://phabricator.wikimedia.org/T247652) (owner: 10Dzahn) [07:47:44] (03CR) 10Jcrespo: "So this is my plan based on the feedback:" [puppet] - 10https://gerrit.wikimedia.org/r/607258 (owner: 10Jcrespo) [07:47:58] !log akosiaris@cumin1001 START - Cookbook sre.ganeti.makevm [07:48:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:27] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [07:48:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:36] !log akosiaris@cumin1001 START - Cookbook sre.ganeti.makevm [07:48:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:01] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [07:49:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:09] !log akosiaris@cumin1001 START - Cookbook sre.ganeti.makevm [07:49:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:24] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [07:49:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:32] !log akosiaris@cumin1001 START - Cookbook sre.ganeti.makevm [07:49:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:40] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [07:49:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:59] how do i page traffic to say that akosiaris is DoSing this channel? ;) [07:51:04] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [07:51:09] kormat: lol [07:51:20] akosiaris: anything spicerack-related? [07:52:08] !log stop bacula-director on backup1001 for db maintenance T254556 [07:52:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:12] T254556: Upgrade m1 to Buster and Mariadb 10.4 - https://phabricator.wikimedia.org/T254556 [07:54:44] PROBLEM - bacula director process on backup1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (bacula), command name bacula-dir https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:56:28] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 429 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: [07:56:28] most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [07:56:44] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [07:57:23] 10Operations, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review: replace backends for releases.wikimedia.org with buster VMs - https://phabricator.wikimedia.org/T247652 (10jcrespo) This is bacula when trying to backup releases2002: `lines=10 25-Jun 04:05 backup1001.eqiad.wmnet JobId... [07:58:34] the bacula alerts is me, see log [07:58:37] will ack it [08:00:04] marostegui, jynus, and akosiaris: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for m1 database master failover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200625T0800). [08:00:04] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [08:00:05] (03PS1) 10Muehlenhoff: Extend access for PHPCC people [puppet] - 10https://gerrit.wikimedia.org/r/607720 [08:00:27] jynus akosiaris let's go? [08:00:34] I am here [08:00:49] jynus: is the "Prometheus jobs reduced availability" alert for job={bacula,wikidough} site={codfw,eqiad} also related? [08:00:55] yeah [08:01:00] ack, thanks [08:01:03] will check it later [08:01:29] akosiaris ok from your side to go ahead? [08:02:15] ema: although that has been happening for 15 hours so maybe not [08:02:57] marostegui: yes [08:03:03] ok, let's start then [08:03:05] !log Failover m1 from db1135 to db1097 - T254556 [08:03:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:09] T254556: Upgrade m1 to Buster and Mariadb 10.4 - https://phabricator.wikimedia.org/T254556 [08:03:16] (03CR) 10Muehlenhoff: [C: 03+2] Extend access for PHPCC people [puppet] - 10https://gerrit.wikimedia.org/r/607720 (owner: 10Muehlenhoff) [08:03:52] ethepad: upstream connect error or disconnect/reset before headers. reset reason: connection failure [08:03:59] all done [08:04:17] etherpad logs show that things proceed as normal... [08:04:23] ah no [08:04:27] it just started logging exceptions [08:04:28] mmm, one sec [08:04:31] An error occurred Please press and hold Ctrl and press F5 to reload this page [08:04:38] yes, something failed [08:04:39] checking [08:04:54] I 'll wait it out 30s before restarting it [08:05:01] should be good now [08:05:18] etherpad still says upstream connect error or disconnect/reset before headers. reset reason: connection failure [08:05:22] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: update scheme for wikidough (improves ab8a948a) [puppet] - 10https://gerrit.wikimedia.org/r/607570 (owner: 10Ssingh) [08:05:25] [2020-06-25 08:04:23.327] [ERROR] console - ERROR: Problem while initalizing the database [08:05:26] jynus: I think it says it's been happening for 15 hours because that's when the wikidough issues started, while bacula started having issues only at 7:53ish [08:05:39] ema: makes sense [08:05:51] no activity on the logs [08:05:55] marostegui: what is the status mysql-wise [08:06:00] it is all done [08:06:02] everything looking good on db and proxy? [08:06:06] yeo [08:06:08] yep [08:06:14] ok, so it is app side now, akosiaris [08:06:36] most likely the persistent connections [08:06:38] ok that means a restart is needed. It did restart on its own, but that for some reason did not help [08:06:46] 2m17s ago [08:06:49] jynus: I have killed connections on db1135 [08:07:05] yeah, but app logic is somtimes strange [08:07:11] yep [08:07:14] oh it's definitely in the app [08:07:20] let's confirm it works after app restart [08:07:23] I would be surprised if it wasn't [08:07:26] it could be something unexpected [08:07:30] PROBLEM - etherpad_up reduced availability on icinga1001 is CRITICAL: 0 le 0.8 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_exporters_%22up%22_metrics_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:07:30] * akosiaris investigating a bit before restarting [08:07:31] akosiaris: XD [08:07:45] marostegui: get screen captures of everything done [08:08:11] what other things are on m1? [08:08:12] hmm, it's not the app [08:08:17] it's dead alright [08:08:20] (03PS1) 10Muehlenhoff: Remove access for nathante [puppet] - 10https://gerrit.wikimedia.org/r/607722 [08:08:21] systemd isn't restarting it [08:08:24] akosiaris: what does the error say? [08:08:26] PROBLEM - Check systemd state on etherpad1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:08:27] ok we found something [08:08:35] Active: failed (Result: exit-code) since Thu 2020-06-25 08:04:23 UTC; 3min 17s ago [08:08:40] librenms does work [08:08:54] PROBLEM - etherpad_lite_process_running on etherpad1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/node /usr/share/etherpad-lite/node_modules/ep_etherpad-lite/node/server.js https://wikitech.wikimedia.org/wiki/Etherpad.wikimedia.org [08:09:12] (03CR) 10jerkins-bot: [V: 04-1] Remove access for nathante [puppet] - 10https://gerrit.wikimedia.org/r/607722 (owner: 10Muehlenhoff) [08:09:15] for some reason systemd restart policy failed [08:09:18] PROBLEM - etherpad.wikimedia.org HTTP on etherpad1002 is CRITICAL: connect to address 10.64.32.178 and port 9001: Connection refused https://wikitech.wikimedia.org/wiki/Etherpad.wikimedia.org [08:09:22] anyway I 'll restart and read up on it [08:09:24] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on pc2007 - https://phabricator.wikimedia.org/T255904 (10Kormat) 05Open→03Resolved Array rebuild has completed, and is back in "optimal" state. [08:09:31] rt also works [08:09:31] that is ok, that is why we gather info [08:09:52] RECOVERY - MegaRAID on pc2007 is OK: OK: optimal, 1 logical, 4 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:09:54] !log restart etherpad-lite on etherpad1002 [08:09:56] so it looks only etherpad specific [08:09:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:12] RECOVERY - Check systemd state on etherpad1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:10:12] etherpad is back for me now [08:10:15] lets see what happens after restart [08:10:26] yep [08:10:26] I can write fine on etherpad [08:10:27] yeah, something with the systemd unit. It should have tried to restart it again, but it didn't [08:10:40] RECOVERY - etherpad_lite_process_running on etherpad1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/node /usr/share/etherpad-lite/node_modules/ep_etherpad-lite/node/server.js https://wikitech.wikimedia.org/wiki/Etherpad.wikimedia.org [08:10:41] I think I need to tweak a bit the restart policy [08:10:44] to be fair, we had documented to restart etherpad, we agree for further testing [08:10:50] it does was Restart=always [08:10:52] *ed [08:10:56] s/was/have/ [08:11:00] because it is nice to understand why [08:11:06] RECOVERY - etherpad.wikimedia.org HTTP on etherpad1002 is OK: HTTP OK: HTTP/1.1 200 OK - 9000 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Etherpad.wikimedia.org [08:11:09] (03CR) 10Volans: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/607258 (owner: 10Jcrespo) [08:11:11] but I think it died too much and systemd stopped trying [08:11:23] it could be [08:11:39] the problem with app logic is that how it reacts is complicated [08:11:44] Jun 25 08:04:23 etherpad1002 systemd[1]: etherpad-lite.service: Service RestartSec=100ms expired, scheduling restart. [08:11:44] Jun 25 08:04:23 etherpad1002 systemd[1]: etherpad-lite.service: Scheduled restart job, restart counter is at 7. [08:11:44] Jun 25 08:04:23 etherpad1002 systemd[1]: Stopped Etherpad-lite daemon. [08:11:44] Jun 25 08:04:23 etherpad1002 systemd[1]: etherpad-lite.service: Start request repeated too quickly. [08:11:48] specially with multiple threads [08:11:49] yup, that's it [08:11:50] and read only [08:11:57] threads? [08:12:05] this is a nodejs app we are talking about [08:12:06] e.g. db connections [08:12:09] no threads really here [08:12:13] at some poing [08:12:16] hahaha [08:12:16] no multiple connections either [08:12:18] *point [08:12:26] one connection could see everthing is all right [08:12:33] and other see it is in read only or down [08:12:40] and not react accordinly [08:12:42] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 429 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: [08:12:42] most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [08:12:48] we can tweak RestartSec, the default is very low, 0.1 seconds or so [08:12:51] if things are stable [08:12:52] RECOVERY - etherpad_up reduced availability on icinga1001 is OK: (C)0.8 le (W)0.9 le 1 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_exporters_%22up%22_metrics_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:12:59] moritzm: yup, that's what I 'll do [08:12:59] I will start bacula [08:13:07] jynus: go ahead [08:13:09] wikifeeds being rate limited? [08:13:09] ok [08:13:15] 429? what's up with that? [08:14:01] !log restarting bacula-dir on backup1001 [08:14:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:23] Info: /Stage[main]/Bacula::Director/Service[bacula-director]: Unscheduling refresh on Service[bacula-director] [08:14:28] so, librenms, rt and racktables are working fine too [08:14:32] RECOVERY - bacula director process on backup1001 is OK: PROCS OK: 1 process with UID = 112 (bacula), command name bacula-dir https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [08:14:42] https://grafana.wikimedia.org/d/35vIuGpZk/wikifeeds?orgId=1&refresh=1m&from=now-2d&to=now [08:14:48] interesting. requests doubled on the 24th [08:14:51] (03PS2) 10Muehlenhoff: Remove access for nathante [puppet] - 10https://gerrit.wikimedia.org/r/607722 [08:15:09] lets run a backup just to be sure [08:15:39] gerrit1001.wikimedia.org-Hourly-Sun-production-gerrit-repo-data is running [08:15:41] (03CR) 10jerkins-bot: [V: 04-1] Remove access for nathante [puppet] - 10https://gerrit.wikimedia.org/r/607722 (owner: 10Muehlenhoff) [08:16:01] 239104 Incr 2,153 98.53 M OK 25-Jun-20 08:15 gerrit1001.wikimedia.org-Hourly-Sun-production-gerrit-repo-data [08:16:04] ran ok [08:16:32] https://grafana.wikimedia.org/d/35vIuGpZk/wikifeeds?panelId=28&fullscreen&orgId=1&refresh=1m&from=now-2d&to=now [08:16:34] marostegui: anythign weird on db connections / traffic to old servers? [08:16:36] wow, that's not good [08:16:45] * akosiaris opening task [08:17:14] jynus: nope it is empty and actually replication 10.4 -> 10.1 is not broken yet :) [08:17:30] I was about to ask [08:17:55] Query Throughput seems lower [08:17:57] on [08:18:08] https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-group=misc&var-shard=m1&var-role=All&from=1593051483625&to=1593073083625 [08:18:14] (03PS3) 10Muehlenhoff: Remove access for nathante [puppet] - 10https://gerrit.wikimedia.org/r/607722 [08:18:57] should we double check zarcillo or refresh prometheus? [08:19:26] https://grafana.wikimedia.org/d/000000273/mysql?panelId=37&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1135&var-port=9104 [08:19:39] https://grafana.wikimedia.org/d/000000273/mysql?panelId=37&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1097&var-port=9104 [08:19:45] open connections went from 30-41 to 21 [08:19:46] that looks good [08:20:04] so it could be prometheus? [08:20:12] (03PS15) 10Kormat: Add native mysql spicerack module. [software/spicerack] - 10https://gerrit.wikimedia.org/r/603434 (https://phabricator.wikimedia.org/T255409) [08:20:23] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for nathante [puppet] - 10https://gerrit.wikimedia.org/r/607722 (owner: 10Muehlenhoff) [08:20:37] let me refresh prometheus config [08:20:43] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [08:20:52] so basically the connections have shifted from db1135 and db1097 with more or less the same number [08:21:25] master: db1135 [08:21:35] replica only: db1117 [08:21:45] how come? [08:21:50] so there maybe somethig missing on zarcillo [08:21:55] (03CR) 10Kormat: "Now that the code itself is pretty settled, i'll start working on tests." (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/603434 (https://phabricator.wikimedia.org/T255409) (owner: 10Kormat) [08:22:05] root@cumin1001:/home/marostegui# ./section m1 | grep db1097 [08:22:05] db1097.eqiad.wmnet 3306 [08:22:08] what should be on replica? [08:22:22] replica should be db1117 and master db1097 [08:22:25] checking [08:22:38] marostegui: do you need an extra pair of eyes for anything? [08:22:39] Updating zarcillo... [08:22:39] [WARNING] Old master not found on zarcillo master list [08:23:08] kormat: no, not needed at this point, thank you [08:23:25] there you have it [08:23:35] let me see what was missing [08:23:52] db1097 : core [08:24:03] I think it should be misc [08:24:03] ha [08:24:06] yep [08:24:11] I update [08:24:27] so metrics where happening [08:24:30] | m1 | eqiad | db1135 | [08:24:34] but were being sent to m1 on core [08:24:37] Do you update that or I do? [08:24:39] I do [08:24:42] ok [08:25:34] I will refresh prometheus now [08:25:51] now both are there [08:26:16] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/607645 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [08:26:19] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 429 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: [08:26:19] most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [08:26:22] goood [08:26:34] I wonder how ever why the script failed [08:26:42] because that should not be a cause [08:26:47] 10Puppet, 10Toolforge, 10Documentation, 10User-srodlund: Document our GridEngine set up - https://phabricator.wikimedia.org/T88733 (10Aklapper) https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin#GridEngine_Master seems to be closest here; may want to link from https://wikitech.wikimedia.org/wiki/He... [08:27:26] jynus: I know why [08:27:28] ZARCILLO_INSTANCE = 'db1115' # instance_name:port format [08:27:45] that needs to be db2093 [08:27:56] ah [08:28:08] master right now is: m1 | eqiad | db1135 [08:28:17] what should it say? [08:28:21] db1097 [08:29:06] (03PS1) 10Marostegui: switchover.py: Change zarcillo instance [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/607725 (https://phabricator.wikimedia.org/T254556) [08:29:18] jynus: ^ [08:29:47] (03CR) 10Jcrespo: [C: 03+1] switchover.py: Change zarcillo instance [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/607725 (https://phabricator.wikimedia.org/T254556) (owner: 10Marostegui) [08:29:57] (03CR) 10Marostegui: [C: 03+2] switchover.py: Change zarcillo instance [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/607725 (https://phabricator.wikimedia.org/T254556) (owner: 10Marostegui) [08:30:02] we should have a better discovery method than a global constant :-D [08:30:12] or a cname :) [08:30:44] master: db1097 [08:30:58] coool [08:30:58] replicas: db1117:13321, db1135:9104 [08:31:07] and codfw? [08:32:03] PROBLEM - Unmerged changes on repository puppet on labtestpuppetmaster2001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [08:32:30] it only shows replica: db2078:13321 [08:32:32] no master [08:32:39] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [08:32:52] that's strange [08:32:57] db2132 is codfw's master for m1 [08:33:18] on zarcillo it says: m1 | codfw | db2132 [08:33:30] yeah, that is the one [08:33:57] it is on core too [08:34:13] maybe you can take care of reviewing the hosts that are on core that should be on misc? [08:34:21] yeah [08:34:23] I can do that [08:34:28] this will fail until we have it automated [08:34:36] so it is a best effort [08:34:52] Can you fix db2132 and I will take care of the rest of hosts? [08:34:55] metrics are not lost, but they are added to the wrong group [08:34:59] RECOVERY - Check systemd state on an-tool1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:35:06] I can update db2132 yes [08:35:13] cool, I will check the rest of misc sections [08:35:43] [zarcillo]> update instances set `group` = 'misc' where name ='db2132'; [08:36:06] ha, good timing, now I don't have to do a describe instances; [08:36:19] RECOVERY - Check systemd state on an-launcher1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:37:20] let's move convo to databases [08:37:25] as maintenance seems done [08:37:41] +1 [08:40:03] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 429 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: [08:40:03] most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [08:40:14] 10Operations, 10DBA, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Request new database for idp-test.wikimedia.org - https://phabricator.wikimedia.org/T256120 (10jbond) >>! In T256120#6252975, @Marostegui wrote: > Should be fixed now. Thanks although I'm now getting "Error message: CREATE command denied... [08:41:42] 10Operations, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review: replace backends for releases.wikimedia.org with buster VMs - https://phabricator.wikimedia.org/T247652 (10hashar) There are ferm rules: ` iptables --list -v|grep bacula 36 2160 ACCEPT tcp -- any any b... [08:41:57] 10Operations, 10DBA, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Request new database for idp-test.wikimedia.org - https://phabricator.wikimedia.org/T256120 (10Marostegui) Fixed [08:42:34] !log releases2002: restarted bacula-fd to take in account the puppet provided configuration # T247652 [08:42:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:39] T247652: replace backends for releases.wikimedia.org with buster VMs - https://phabricator.wikimedia.org/T247652 [08:43:12] 10Operations, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review: replace backends for releases.wikimedia.org with buster VMs - https://phabricator.wikimedia.org/T247652 (10hashar) @jcrespo should be good now: ` # netstat -tlnp|grep bacula tcp 0 0 0.0.0.0:9102... [08:45:28] 10Operations, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review: replace backends for releases.wikimedia.org with buster VMs - https://phabricator.wikimedia.org/T247652 (10jcrespo) Thanks, it ran successfully now: ` 239105 Full 21 22.89 K OK 25-Jun-20 08:44 rele... [08:46:20] 10Operations, 10SRE-Access-Requests: Requesting access to PROD for lmata (SRE) - https://phabricator.wikimedia.org/T254818 (10jbond) 05Open→03Resolved >>! In T254818#6253018, @Dzahn wrote: > "Membership of ops group in LDAP and YAML are not identical: ['lmata']" This is fixed now thanks @Dzahn [08:46:59] ACKNOWLEDGEMENT - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 429 (expecting: 200) alexandros kosiaris https://phabricator.wikimedia.org/T256358 - The acknowledgement ex [08:46:59] -26 18:45:49. https://wikitech.wikimedia.org/wiki/Wikifeeds [08:50:48] (03PS1) 10Marostegui: db1135: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/607728 (https://phabricator.wikimedia.org/T253217) [08:51:30] (03CR) 10Marostegui: [C: 03+2] db1135: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/607728 (https://phabricator.wikimedia.org/T253217) (owner: 10Marostegui) [08:51:52] moritzm: is your change ok to merge? [08:54:38] ema: indeed it went away, but job=wikidough site=codfw is still ongoing [08:54:38] (03CR) 10Hashar: ci: switch integration.wikimedia.org to scap DocumentRoot (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/607076 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar) [08:55:51] jynus: right, probably due to ab8a948a. I'll investigate further, thanks! [08:56:41] !log joal@deploy1001 Started deploy [analytics/refinery@4aba370]: Analytics fix over weekly train [analytics/refinery@4aba370] [08:56:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:06] (03PS1) 10Muehlenhoff: Fix status check for Kerberos principal deletion [puppet] - 10https://gerrit.wikimedia.org/r/607729 [08:58:05] elukey: with your permission I will ack analytics1030 alerts (scheduled for decom) to also remove them from the unack list [08:58:07] [08:58:29] RECOVERY - Unmerged changes on repository puppet on labtestpuppetmaster2001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [08:59:03] I will also ack services for netbox-dev2001 with is being built and it is WIP [09:01:25] !log restarting acme-chief instances to catch up on kernel updates [09:01:27] I think that will help with ongoing issues discoverability, I will revert if that impacts any or your work [09:01:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:06] idem for analytics1039, scheduled for decom [09:02:17] jynus: sure no problem, where did you find the alarms? I thought I had everything acked in icinga, the host is part of the test cluster (we'll replace it with proper nodes not OOW soon) [09:02:29] it is disabled [09:02:33] so no issue [09:02:43] but if you search for ongoing alerts, it lists it [09:02:50] ahh all of them [09:02:51] so it doesn't hurt to ack it [09:02:53] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [09:03:01] if you don't mind, not really needed [09:03:12] +1 thanks a lot for the cleanup [09:03:23] so you did nothing wrong [09:03:39] but it helps me tracking other ongoing issues [09:04:37] for example, https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=all&type=detail&servicestatustypes=16&hoststatustypes=3&serviceprops=2097162 only shows 2 alerts [09:04:57] and someone was working on one [09:05:51] or the wikifeeds thing that alex mentioned stands out more [09:06:16] just a preference of mine, but hopefully it is helpful for others too [09:08:20] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 429 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: [09:08:20] most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [09:13:08] !log joal@deploy1001 Finished deploy [analytics/refinery@4aba370]: Analytics fix over weekly train [analytics/refinery@4aba370] (duration: 16m 27s) [09:13:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:30] !log joal@deploy1001 Started deploy [analytics/refinery@4aba370] (thin): Analytics fix over weekly train THIN [analytics/refinery@4aba370] [09:13:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:34] I wish I could ack a flapping service somehow in icinga [09:13:40] !log joal@deploy1001 Finished deploy [analytics/refinery@4aba370] (thin): Analytics fix over weekly train THIN [analytics/refinery@4aba370] (duration: 00m 10s) [09:13:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:21] "catch the service while it is flapping (away)" [09:15:39] PROBLEM - Thanos compact is halted on icinga1001 is CRITICAL: cluster=thanos instance=thanos-fe2001 job=thanos-compact prometheus=ops site=codfw https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [09:16:48] wikifeeds seems weird up to now [09:17:10] it's like suddenly users from 2 countries in the world decided to use the app more [09:17:52] the 2 countries part is not exactly right of course, it's just that those are really large countries [09:18:41] (03CR) 10Jcrespo: "> Patch Set 1:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/607258 (owner: 10Jcrespo) [09:19:37] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: update scheme for wikidough (improves ab8a948a) [puppet] - 10https://gerrit.wikimedia.org/r/607570 (owner: 10Ssingh) [09:21:19] !log rolling restart of ncredir instances to catch up on kernel updates [09:21:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:38] I'm looking at the thanos compact alert [09:25:15] and it ran out of space on the local host while compacting -.- [09:25:17] (03PS2) 10Muehlenhoff: Remove cas-logstash from caches [puppet] - 10https://gerrit.wikimedia.org/r/607508 (https://phabricator.wikimedia.org/T246998) [09:26:46] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/607645 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [09:28:05] !log extend lv on thanos-fe2001 and restart thanos-compact [09:28:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:44] !log schedule downtime for eqiad wikifeeds as it's flapping too much without yet knowing why. T256358 [09:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:48] T256358: wikifeeds usage increased by 3x on 2020-06-24 - https://phabricator.wikimedia.org/T256358 [09:28:51] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 103 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [09:29:03] (03CR) 10Muehlenhoff: releases::mediawiki:: support buster / PHP 7.3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/607641 (https://phabricator.wikimedia.org/T247652) (owner: 10Dzahn) [09:29:59] (03PS2) 10Jcrespo: cumin: backup all of /srv where a lot of deployment state may live [puppet] - 10https://gerrit.wikimedia.org/r/607258 [09:30:05] RECOVERY - Thanos compact is halted on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [09:33:29] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/607609 (owner: 10CDanis) [09:34:15] !log volans@cumin1001 START - Cookbook sre.dns.netbox [09:34:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:22] (03CR) 10Jcrespo: "Let me know if this version is ok:" [puppet] - 10https://gerrit.wikimedia.org/r/607258 (owner: 10Jcrespo) [09:36:55] PROBLEM - Thanos compact has not run on icinga1001 is CRITICAL: 4.425e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [09:37:19] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:37:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:23] (03PS1) 10Jforrester: ExtensionDistribution: Drop REL1_33, EOL'ed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607740 (https://phabricator.wikimedia.org/T256087) [09:38:27] found it I think. There seems to be a restbase deploy right before the wikifeeds issues start [09:38:40] I wonder whether I should rollback or leave it to the devs [09:41:28] (03PS1) 10Volans: mgmt: netbox-generated data for frack mgmt codfw [dns] - 10https://gerrit.wikimedia.org/r/607741 (https://phabricator.wikimedia.org/T233183) [09:44:21] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:49:29] RECOVERY - Thanos compact has not run on icinga1001 is OK: (C)24 ge (W)12 ge 0.003265 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [09:51:02] (03PS1) 10Mvolz: Update citoid to dcc45a42 [deployment-charts] - 10https://gerrit.wikimedia.org/r/607745 [09:52:57] Hey, I notice that citoid is listed in the services deployment windows here: https://wikitech.wikimedia.org/wiki/Deployments - but I mostly deploy citoid I've never actually used any of those windows. 😅 Is this schedule supposed to be prescriptive or descriptive? [09:53:25] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [09:53:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:09] !log akosiaris@cumin1001 START - Cookbook sre.ganeti.makevm [09:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:32] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [09:57:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:19] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [09:58:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:44] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [09:59:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:11] PROBLEM - rsyslog TLS listener on port 6514 on centrallog1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Logs [10:00:29] RECOVERY - rsyslog TLS listener on port 6514 on centrallog1001 is OK: SSL OK - Certificate centrallog1001.eqiad.wmnet valid until 2024-06-25 15:42:33 +0000 (expires in 1461 days) https://wikitech.wikimedia.org/wiki/Logs [10:00:34] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [10:00:35] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:00:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:38] 10Operations, 10ops-eqiad: upgrade memory in ganeti100[5-8].eqiad.wmnet - https://phabricator.wikimedia.org/T244530 (10ops-monitoring-bot) Icinga downtime for 12:00:00 set by akosiaris@cumin1001 on 1 host(s) and their services with reason: Memory upgrade ` kubestagetcd1004.eqiad.wmnet ` [10:00:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:43] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [10:00:44] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:00:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:48] 10Operations, 10ops-eqiad: upgrade memory in ganeti100[5-8].eqiad.wmnet - https://phabricator.wikimedia.org/T244530 (10ops-monitoring-bot) Icinga downtime for 12:00:00 set by akosiaris@cumin1001 on 1 host(s) and their services with reason: Memory upgrade ` ganeti1005.eqiad.wmnet ` [10:02:44] (03CR) 10Filippo Giunchedi: [C: 03+1] "> Patch Set 2:" [dns] - 10https://gerrit.wikimedia.org/r/607634 (https://phabricator.wikimedia.org/T256139) (owner: 10Dzahn) [10:04:37] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [10:04:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:41] !log poweroff kubestagetcd1004 and ganeti1005 for T244530 [10:04:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:44] T244530: upgrade memory in ganeti100[5-8].eqiad.wmnet - https://phabricator.wikimedia.org/T244530 [10:04:54] (03CR) 10Filippo Giunchedi: [C: 03+1] "Indeed re: ipv6 (see comment on I33596b)" [dns] - 10https://gerrit.wikimedia.org/r/607637 (https://phabricator.wikimedia.org/T256139) (owner: 10Dzahn) [10:04:59] 10Operations, 10ops-eqiad: upgrade memory in ganeti100[5-8].eqiad.wmnet - https://phabricator.wikimedia.org/T244530 (10akosiaris) [10:05:35] 10Operations, 10ops-eqiad: upgrade memory in ganeti100[5-8].eqiad.wmnet - https://phabricator.wikimedia.org/T244530 (10akosiaris) @Jclark-ctr: ganeti1005 is ready. Fully depooled, downtimed and powered off. [10:07:21] !log akosiaris@cumin1001 START - Cookbook sre.ganeti.makevm [10:07:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:32] !log akosiaris@cumin1001 START - Cookbook sre.ganeti.makevm [10:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:41] !log akosiaris@cumin1001 START - Cookbook sre.ganeti.makevm [10:07:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:49] !log akosiaris@cumin1001 START - Cookbook sre.ganeti.makevm [10:07:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:27] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [10:12:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:37] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [10:12:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:26] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [10:13:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:25] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [10:17:27] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [10:17:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:17] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [10:18:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:07] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:22:03] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [10:22:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:47] PROBLEM - Check systemd state on ncredir2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:25:39] (03CR) 10Ayounsi: "> It could be evaluated if this record should be in mgmt.frack or just frack as it is right now." [dns] - 10https://gerrit.wikimedia.org/r/607741 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [10:25:54] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [10:25:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:18] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [10:27:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:15] (03CR) 10Ayounsi: [C: 03+1] "> Thinking about it, .mgmt.frack sounds better, as it's an IP in that vlan. But no strong opinion, so whatever is easier to manage." [dns] - 10https://gerrit.wikimedia.org/r/607741 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [10:32:11] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [10:32:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:58] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [10:34:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:21] (03PS1) 10Alexandros Kosiaris: Introduce kubernetes[12]01[56] [puppet] - 10https://gerrit.wikimedia.org/r/607752 (https://phabricator.wikimedia.org/T256236) [10:34:46] (03CR) 10Elukey: [C: 03+1] "Surely a pebcak on my side, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/607729 (owner: 10Muehlenhoff) [10:35:39] (03CR) 10Alexandros Kosiaris: [C: 03+2] Introduce kubernetes[12]01[56] [puppet] - 10https://gerrit.wikimedia.org/r/607752 (https://phabricator.wikimedia.org/T256236) (owner: 10Alexandros Kosiaris) [10:36:12] (03CR) 10Muehlenhoff: [C: 03+2] Fix status check for Kerberos principal deletion [puppet] - 10https://gerrit.wikimedia.org/r/607729 (owner: 10Muehlenhoff) [10:38:51] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [10:38:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:55] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 2016 responds with unexpected value at path = Missing keys: [mostread] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:38:55] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:39:11] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:40:37] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:40:41] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:40:53] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:41:11] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [10:41:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:42] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: use system openjdk 11 for logging ES7 instances [puppet] - 10https://gerrit.wikimedia.org/r/607542 (https://phabricator.wikimedia.org/T252913) (owner: 10Herron) [10:41:47] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [10:45:19] !log rolling reboot of ms-be[2044-2056] [10:45:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:38] (03PS1) 10Alexandros Kosiaris: Add kubernetes[12]01[56] [homer/public] - 10https://gerrit.wikimedia.org/r/607754 (https://phabricator.wikimedia.org/T256236) [10:45:52] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [10:45:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:10] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [10:50:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:32] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [10:53:30] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [10:53:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:04] (03CR) 10Jforrester: "We use the PHP version for docroot hosting for a few things still, don't we? doc1001 is for most things, but there's still… coverage repor" [puppet] - 10https://gerrit.wikimedia.org/r/607641 (https://phabricator.wikimedia.org/T247652) (owner: 10Dzahn) [10:56:45] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [10:56:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:18] (03CR) 10Jbond: [C: 03+1] Remove cas-logstash from caches [puppet] - 10https://gerrit.wikimedia.org/r/607508 (https://phabricator.wikimedia.org/T246998) (owner: 10Muehlenhoff) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy European mid-day backport window(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200625T1100). [11:00:26] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [11:00:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:30] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [11:01:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:52] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:01:54] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:05:45] (03PS4) 10Jcrespo: mariadb-backups: Move x1 backup source from db1095 to db1102 [puppet] - 10https://gerrit.wikimedia.org/r/607510 (https://phabricator.wikimedia.org/T254871) [11:06:16] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:06:22] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [11:06:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:10] (03CR) 10Volans: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/607754 (https://phabricator.wikimedia.org/T256236) (owner: 10Alexandros Kosiaris) [11:12:25] (03CR) 10Volans: [C: 03+1] "LGTM, thanks for the patch!" [puppet] - 10https://gerrit.wikimedia.org/r/607258 (owner: 10Jcrespo) [11:14:05] (03CR) 10Ayounsi: [C: 03+1] Add kubernetes[12]01[56] (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/607754 (https://phabricator.wikimedia.org/T256236) (owner: 10Alexandros Kosiaris) [11:16:11] I'd like to add something to the BACON window, I can deploy it myself. [11:17:32] awight: go ahead :) [11:18:44] (03CR) 10Volans: [C: 03+1] "> Patch Set 15:" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/603434 (https://phabricator.wikimedia.org/T255409) (owner: 10Kormat) [11:21:06] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [11:21:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:57] ty! [11:24:09] (03PS1) 10Awight: [beta] Enable mobile view for dewiki survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607762 (https://phabricator.wikimedia.org/T253112) [11:24:28] (03CR) 10Awight: [C: 03+2] "beta-only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607762 (https://phabricator.wikimedia.org/T253112) (owner: 10Awight) [11:25:06] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [11:25:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:18] (03Merged) 10jenkins-bot: [beta] Enable mobile view for dewiki survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607762 (https://phabricator.wikimedia.org/T253112) (owner: 10Awight) [11:27:54] !log rolling reboot of ms-be[1044-1059].eqiad.wmnet [11:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:02] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:34:52] (03PS1) 10Awight: Enable WMDE Tech Wishes survey configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607763 (https://phabricator.wikimedia.org/T253112) [11:35:14] (03CR) 10Awight: [C: 03+2] "BACON" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607763 (https://phabricator.wikimedia.org/T253112) (owner: 10Awight) [11:36:03] (03Merged) 10jenkins-bot: Enable WMDE Tech Wishes survey configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607763 (https://phabricator.wikimedia.org/T253112) (owner: 10Awight) [11:36:37] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [11:36:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:34] (03PS1) 10Elukey: Set notebook100[3,4] with role::insetup [puppet] - 10https://gerrit.wikimedia.org/r/607764 (https://phabricator.wikimedia.org/T256363) [11:38:54] !log awight@deploy1001 Synchronized wmf-config/InitialiseSettings.php: BACON: [[gerrit:607763|Enable WMDE Tech Wishes survey configuration (T253112)]] (duration: 01m 09s) [11:38:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:59] T253112: Create survey for TechWish prototype announcements on dewiki and metawiki - https://phabricator.wikimedia.org/T253112 [11:39:39] (03PS2) 10Elukey: Set notebook100[3,4] with role::insetup [puppet] - 10https://gerrit.wikimedia.org/r/607764 (https://phabricator.wikimedia.org/T256363) [11:41:29] (03PS3) 10Arturo Borrero Gonzalez: toolforge: mailrelay: introduce SRS to correctly envelope forwarded emails [puppet] - 10https://gerrit.wikimedia.org/r/607279 (https://phabricator.wikimedia.org/T120225) [11:41:31] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [11:41:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:09] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [11:45:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:30] (03PS1) 10Awight: Enable QuickSurveys on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607767 (https://phabricator.wikimedia.org/T253112) [11:46:45] (03CR) 10Awight: [C: 03+2] "BACON" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607767 (https://phabricator.wikimedia.org/T253112) (owner: 10Awight) [11:47:36] (03Merged) 10jenkins-bot: Enable QuickSurveys on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607767 (https://phabricator.wikimedia.org/T253112) (owner: 10Awight) [11:48:30] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [11:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:42] 10Operations: Create ssh keypair for integration/docroot deployment with scap - https://phabricator.wikimedia.org/T256138 (10ema) Key generated and added to the private puppet repo under `modules/secret/secrets/keyholder`. [11:49:59] !log awight@deploy1001 Synchronized wmf-config/InitialiseSettings.php: BACON: [[gerrit:607767|Enable QuickSurveys on metawiki (T253112)]] (duration: 01m 05s) [11:50:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:34] T253112: Create survey for TechWish prototype announcements on dewiki and metawiki - https://phabricator.wikimedia.org/T253112 [11:50:48] 10Operations: Create ssh keypair for integration/docroot deployment with scap - https://phabricator.wikimedia.org/T256138 (10ema) a:05ema→03None [11:51:04] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [11:51:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:34] (03PS1) 10Ssingh: wikidough: organize shared fake passwords [labs/private] - 10https://gerrit.wikimedia.org/r/607769 [11:55:01] !log EU BACON is cooked [11:55:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:03] (03CR) 10Ssingh: [V: 03+2 C: 03+2] wikidough: organize shared fake passwords [labs/private] - 10https://gerrit.wikimedia.org/r/607769 (owner: 10Ssingh) [11:55:32] !log installing python3.4 security updates [11:55:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:58] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [11:56:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:02] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/23471/" [puppet] - 10https://gerrit.wikimedia.org/r/607764 (https://phabricator.wikimedia.org/T256363) (owner: 10Elukey) [12:03:58] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp5010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [12:05:59] (03PS1) 10Elukey: Clean up old reference to notebook100[3,4] and set PXE to Buster [puppet] - 10https://gerrit.wikimedia.org/r/607771 (https://phabricator.wikimedia.org/T256363) [12:07:17] (03CR) 10Elukey: [C: 03+2] Clean up old reference to notebook100[3,4] and set PXE to Buster [puppet] - 10https://gerrit.wikimedia.org/r/607771 (https://phabricator.wikimedia.org/T256363) (owner: 10Elukey) [12:08:08] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:09:38] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [12:09:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:01] (03PS1) 10Ssingh: prometheus: use the correct password for the wikidough job [puppet] - 10https://gerrit.wikimedia.org/r/607772 (https://phabricator.wikimedia.org/T252132) [12:13:18] RECOVERY - Check systemd state on ncredir2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:13:30] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [12:13:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:34] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp5010 is OK: HTTP OK: HTTP/1.0 200 OK - 23528 bytes in 0.750 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [12:14:36] "wikidough" is one of the greatest names I've seen in a long time [12:16:19] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1001/23472/" [puppet] - 10https://gerrit.wikimedia.org/r/607772 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [12:16:28] (03PS1) 10Awight: Enable TechWishes survey for dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607773 (https://phabricator.wikimedia.org/T253112) [12:16:40] legoktm: haha thank you! [12:17:03] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [12:17:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:52] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: use the correct password for the wikidough job [puppet] - 10https://gerrit.wikimedia.org/r/607772 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [12:19:21] (03CR) 10Ssingh: [C: 03+2] prometheus: use the correct password for the wikidough job [puppet] - 10https://gerrit.wikimedia.org/r/607772 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [12:21:12] (03PS1) 10Vgutierrez: Release 8.0.8-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/607774 [12:21:57] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [12:21:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:34] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:25:20] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [12:25:22] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [12:25:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:42] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=wikidough site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:26:16] !log installing libjpeg-turbo security updates [12:26:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:50] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:27:45] ACKNOWLEDGEMENT - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: Ayounsi TTN-0004198221 / TTN-0004197860 - The acknowledgement expires at: 2020-06-25 18:00:00. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:27:45] ACKNOWLEDGEMENT - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0: Ayounsi TTN-0004198221 / TTN-0004197860 - The acknowledgement expires at: 2020-06-25 18:00:00. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:27:56] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:27:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:01] "Our technicians ETA to the Ft. Worth site has been updated to approximately 4 hours." so ACKing it for 6h [12:28:43] (03CR) 10CDanis: [C: 03+2] fix multiple invocations of systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/607609 (owner: 10CDanis) [12:30:22] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [12:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:44] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [12:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:28] (03CR) 10Kormat: [C: 03+2] Add native mysql spicerack module. [software/spicerack] - 10https://gerrit.wikimedia.org/r/603434 (https://phabricator.wikimedia.org/T255409) (owner: 10Kormat) [12:32:33] !log installing libssh2 security updates [12:32:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:38] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [12:34:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:27] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [12:39:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:50] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=wikidough site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:40:53] (03PS1) 10Elukey: Remove notebook1003 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/607779 (https://phabricator.wikimedia.org/T256363) [12:41:17] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:41:50] (03CR) 10Elukey: [C: 03+2] Remove notebook1003 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/607779 (https://phabricator.wikimedia.org/T256363) (owner: 10Elukey) [12:42:18] !log installing libmspack security updates [12:42:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:04] !log elukey@cumin1001 START - Cookbook sre.hosts.decommission [12:44:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:21] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [12:44:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:37] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [12:45:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:35] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [12:47:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:44] (03PS4) 10Arturo Borrero Gonzalez: toolforge: mailrelay: introduce SRS to correctly envelope forwarded emails [puppet] - 10https://gerrit.wikimedia.org/r/607279 (https://phabricator.wikimedia.org/T120225) [12:51:13] (03PS1) 10Elukey: Rename notebook1003 records to an-launcher1002 records [dns] - 10https://gerrit.wikimedia.org/r/607780 (https://phabricator.wikimedia.org/T256363) [12:51:19] volans: if you have a sec --^ [12:51:22] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: mailrelay: introduce SRS to correctly envelope forwarded emails [puppet] - 10https://gerrit.wikimedia.org/r/607279 (https://phabricator.wikimedia.org/T120225) (owner: 10Arturo Borrero Gonzalez) [12:51:27] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [12:51:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:42] sure [12:53:00] ipv6 records are missing, will add them later on [12:54:21] (03CR) 10Volans: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/607780 (https://phabricator.wikimedia.org/T256363) (owner: 10Elukey) [12:54:46] \o/ [12:54:51] PROBLEM - Host ganeti1005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:54:54] (03CR) 10Elukey: [C: 03+2] Rename notebook1003 records to an-launcher1002 records [dns] - 10https://gerrit.wikimedia.org/r/607780 (https://phabricator.wikimedia.org/T256363) (owner: 10Elukey) [12:55:26] !log rename notebook1003 to an-launcher1002 - T256363 [12:55:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:30] T256363: Repurpose notebook100[3,4] - https://phabricator.wikimedia.org/T256363 [12:55:39] RECOVERY - Maps - OSM synchronization lag - codfw on icinga1001 is OK: (C)2.592e+05 ge (W)1.764e+05 ge 7.174e+04 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=12&fullscreen&orgId=1 [12:55:56] elukey: ack, it's consistent with what's defined already [12:57:17] (03PS5) 10Arturo Borrero Gonzalez: toolforge: mailrelay: introduce SRS to correctly envelope forwarded emails [puppet] - 10https://gerrit.wikimedia.org/r/607279 (https://phabricator.wikimedia.org/T120225) [12:59:18] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: mailrelay: introduce SRS to correctly envelope forwarded emails [puppet] - 10https://gerrit.wikimedia.org/r/607279 (https://phabricator.wikimedia.org/T120225) (owner: 10Arturo Borrero Gonzalez) [12:59:52] (03PS1) 10Elukey: Add an-launcher1002 to puppet config [puppet] - 10https://gerrit.wikimedia.org/r/607781 (https://phabricator.wikimedia.org/T256363) [13:00:04] brennen and hashar: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Mediawiki train - American+European Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200625T1300). [13:00:19] (03CR) 10jerkins-bot: [V: 04-1] Add an-launcher1002 to puppet config [puppet] - 10https://gerrit.wikimedia.org/r/607781 (https://phabricator.wikimedia.org/T256363) (owner: 10Elukey) [13:00:57] 10Operations, 10ops-eqiad: upgrade memory in ganeti100[5-8].eqiad.wmnet - https://phabricator.wikimedia.org/T244530 (10Jclark-ctr) @akosiaris ganeti1005 is finished and booting up now Thanks! [13:01:23] RECOVERY - Host ganeti1005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.78 ms [13:01:59] (03PS2) 10Elukey: Add an-launcher1002 to puppet config [puppet] - 10https://gerrit.wikimedia.org/r/607781 (https://phabricator.wikimedia.org/T256363) [13:02:38] (03CR) 10Elukey: [C: 03+2] Add an-launcher1002 to puppet config [puppet] - 10https://gerrit.wikimedia.org/r/607781 (https://phabricator.wikimedia.org/T256363) (owner: 10Elukey) [13:02:47] !log installing 4.9.210-1+deb9u1~deb8u1 on jessie hosts (fixed kernel for recent cacheoutattack CPU leaks) [13:02:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:12] (03PS1) 10Jbond: jpa: add workaround for HikariCP dependency clash [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/607782 [13:11:36] (03PS1) 10Filippo Giunchedi: thanos: set consistency-delay on store [puppet] - 10https://gerrit.wikimedia.org/r/607783 (https://phabricator.wikimedia.org/T233956) [13:13:46] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [13:13:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:30] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install - https://phabricator.wikimedia.org/T255520 (10Jclark-ctr) @elukey any rows that these need to be in? [13:17:52] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1001/23474/" [puppet] - 10https://gerrit.wikimedia.org/r/607783 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [13:19:50] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [13:19:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:09] (03PS1) 10Kormat: mysql: Spruce up documentation formatting [software/spicerack] - 10https://gerrit.wikimedia.org/r/607786 [13:25:47] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash instance=kafkamon1001 job=burrow partition={0,1,2,3,4,5} site=eqiad topic={udp_localhost-info,udp_localhost-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prom [13:25:47] uster=logging-eqiad&var-topic=All&var-consumer_group=All [13:26:38] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [13:26:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:03] mmhh logstash1007's unhappy, I'll bounce logstash there [13:28:31] !log bounce logstash on logstash1007 [13:28:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:12] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/607782 (owner: 10Jbond) [13:30:31] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [13:30:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:53] (03CR) 10Jbond: [V: 03+2 C: 03+2] jpa: add workaround for HikariCP dependency clash [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/607782 (owner: 10Jbond) [13:33:09] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [13:34:58] (03CR) 10Vgutierrez: [C: 03+2] Release 8.0.8-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/607774 (owner: 10Vgutierrez) [13:36:00] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [13:36:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:47] 10Operations, 10DBA, 10DC-Ops, 10Sustainability (Incident Prevention): PXE Boot defaults to automatically reimaging (normally destroying os and all filesystemdata) on all servers - https://phabricator.wikimedia.org/T251416 (10Marostegui) Can this task be closed? By default hosts reimage now but they do kee... [13:40:54] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [13:40:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:01] (03PS3) 10Reedy: Replace PasswordNotInLargeBlacklist with PasswordNotInCommonList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605904 [13:47:05] jouncebot: now [13:47:06] For the next 1 hour(s) and 12 minute(s): Mediawiki train - American+European Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200625T1300) [13:47:26] (03CR) 10Reedy: [C: 03+2] Replace PasswordNotInLargeBlacklist with PasswordNotInCommonList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605904 (owner: 10Reedy) [13:48:29] (03Merged) 10jenkins-bot: Replace PasswordNotInLargeBlacklist with PasswordNotInCommonList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605904 (owner: 10Reedy) [13:49:54] !log reedy@deploy1001 Synchronized wmf-config/CommonSettings.php: Replace PasswordNotInLargeBlacklist with PasswordNotInCommonList (duration: 01m 06s) [13:49:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:23] !log reedy@deploy1001 Synchronized wmf-config/CommonSettings-labs.php: Replace PasswordNotInLargeBlacklist with PasswordNotInCommonList (duration: 01m 05s) [13:51:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:32] !log upload trafficserver 8.0.8 to apt.wm.o (buster) [13:51:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:31] Reedy: ❤️ [13:52:52] Just noticed it was still sitting there, so might aswell ship it [13:54:01] (03PS3) 10Reedy: Remove OAuthReplaceMessage hook subscriber [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601910 (https://phabricator.wikimedia.org/T254301) [13:54:09] (03CR) 10Reedy: [C: 03+2] Remove OAuthReplaceMessage hook subscriber [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601910 (https://phabricator.wikimedia.org/T254301) (owner: 10Reedy) [13:55:01] (03Merged) 10jenkins-bot: Remove OAuthReplaceMessage hook subscriber [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601910 (https://phabricator.wikimedia.org/T254301) (owner: 10Reedy) [13:56:19] !log upgrade ATS in ulsfo to version 8.0.8 [13:56:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:26] !log reedy@deploy1001 Synchronized wmf-config/CommonSettings.php: T254301 Remove OAuthReplaceMessage hook subscriber (duration: 01m 05s) [13:57:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:14] T254301: Replace OAuthReplaceMessage subscriber in CommonSettings.php - https://phabricator.wikimedia.org/T254301 [13:58:56] (03PS1) 10Jbond: apereo_cas: Enable SSL for DB connections [puppet] - 10https://gerrit.wikimedia.org/r/607793 (https://phabricator.wikimedia.org/T256113) [13:59:26] (03PS1) 10Muehlenhoff: Handle CAS war updates [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/607794 (https://phabricator.wikimedia.org/T233950) [13:59:38] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [13:59:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:59] (03PS3) 10Reedy: Use structured logging fields for xff logs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607389 [14:01:14] 10Operations, 10ops-codfw, 10netops: codfw: rack/setup new srx300 (mr1) - https://phabricator.wikimedia.org/T255577 (10Papaul) [14:01:51] (03PS2) 10Krinkle: logging: Fold Parsoid into type:mediawiki, add 'servergroup:' instead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606038 (https://phabricator.wikimedia.org/T255627) [14:02:07] (03PS2) 10Krinkle: mediawiki,logstash: Update type:parsoid-php -> type:mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/606049 (https://phabricator.wikimedia.org/T255627) [14:02:47] 10Operations, 10ops-codfw, 10netops: codfw: rack/setup new srx300 (mr1) - https://phabricator.wikimedia.org/T255577 (10Papaul) [14:03:26] (03CR) 10Krinkle: Use structured logging fields for xff logs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607389 (owner: 10Reedy) [14:03:29] 10Operations, 10DBA, 10DC-Ops, 10Sustainability (Incident Prevention): PXE Boot defaults to automatically reimaging (normally destroying os and all filesystemdata) on all servers - https://phabricator.wikimedia.org/T251416 (10jcrespo) a:03Kormat [14:04:01] (03CR) 10Reedy: Use structured logging fields for xff logs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607389 (owner: 10Reedy) [14:04:22] !log marostegui@cumin2001 dbctl commit (dc=all): 'Depool db2088:3312', diff saved to https://phabricator.wikimedia.org/P11663 and previous config saved to /var/cache/conftool/dbconfig/20200625-140421-marostegui.json [14:04:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:33] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [14:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2104', diff saved to https://phabricator.wikimedia.org/P11664 and previous config saved to /var/cache/conftool/dbconfig/20200625-140519-marostegui.json [14:05:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:28] !log Stop MySQL on db2104 and db2088:3312 [14:05:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:46] (03PS4) 10Reedy: Use structured logging fields for xff logs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607389 [14:12:08] (03Abandoned) 10Filippo Giunchedi: DNM: adjust logstash index template for ES 7 [puppet] - 10https://gerrit.wikimedia.org/r/545566 (https://phabricator.wikimedia.org/T235891) (owner: 10Filippo Giunchedi) [14:12:53] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [14:13:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:21] (03Abandoned) 10Filippo Giunchedi: swift: add role::swift::swiftrepl to ms-fe1001 [puppet] - 10https://gerrit.wikimedia.org/r/254412 (owner: 10Filippo Giunchedi) [14:14:49] (03Abandoned) 10Filippo Giunchedi: swift: add swift replication support via swiftrepl [puppet] - 10https://gerrit.wikimedia.org/r/254411 (owner: 10Filippo Giunchedi) [14:16:33] (03PS1) 10Faidon Liambotis: Allow SELECTED_PATH selection for IXP routes as well [homer/public] - 10https://gerrit.wikimedia.org/r/607800 [14:17:40] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [14:17:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:52] 10Operations, 10ORES, 10Scoring-platform-team: [Epic] Deploy ORES in kubernetes cluster - https://phabricator.wikimedia.org/T182331 (10akosiaris) Couple of more benefits of k8s I forgot to mention yesterday * Ability for >1 deployments. This might be beneficial from a product perspective, e.g. create an ORE... [14:18:26] (03CR) 10Ayounsi: [C: 03+1] Allow SELECTED_PATH selection for IXP routes as well [homer/public] - 10https://gerrit.wikimedia.org/r/607800 (owner: 10Faidon Liambotis) [14:19:17] 10Operations, 10DBA, 10DC-Ops, 10Sustainability (Incident Prevention): PXE Boot defaults to automatically reimaging (normally destroying os and all filesystemdata) on all servers - https://phabricator.wikimedia.org/T251416 (10Kormat) From the perspective of #dba, this issue is mostly resolved. Most DB mach... [14:19:24] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [14:19:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:35] 10Operations, 10DC-Ops, 10SRE-tools, 10Sustainability (Incident Prevention): PXE Boot defaults to automatically reimaging (normally destroying os and all filesystemdata) on all servers - https://phabricator.wikimedia.org/T251416 (10Kormat) [14:19:40] !log upgrade ATS in eqsin to version 8.0.8 [14:19:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:47] 10Operations, 10DC-Ops, 10SRE-tools, 10Sustainability (Incident Prevention): PXE Boot defaults to automatically reimaging (normally destroying os and all filesystemdata) on all servers - https://phabricator.wikimedia.org/T251416 (10Kormat) a:05Kormat→03None [14:19:55] 10Operations, 10ops-eqiad: upgrade memory in ganeti100[5-8].eqiad.wmnet - https://phabricator.wikimedia.org/T244530 (10akosiaris) [14:20:23] PROBLEM - Host scs-a1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:20:51] 10Operations, 10ops-eqiad: upgrade memory in ganeti100[5-8].eqiad.wmnet - https://phabricator.wikimedia.org/T244530 (10akosiaris) @Jclark-ctr Excellent. I started the process of emptying ganeti1006 (and filling ganeti1005), that should take quite a while, but we should be on time for next Thursday. Many thanks! [14:21:06] (03CR) 10Jbond: Handle CAS war updates (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/607794 (https://phabricator.wikimedia.org/T233950) (owner: 10Muehlenhoff) [14:24:19] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [14:24:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:02] (03PS2) 10Muehlenhoff: Handle CAS war updates [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/607794 (https://phabricator.wikimedia.org/T233950) [14:26:06] (03CR) 10Muehlenhoff: Handle CAS war updates (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/607794 (https://phabricator.wikimedia.org/T233950) (owner: 10Muehlenhoff) [14:26:13] RECOVERY - Host scs-a1-codfw is UP: PING WARNING - Packet loss = 50%, RTA = 36.74 ms [14:29:41] !log replacing mr1-codfw [14:29:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:30] papaul: let me know if you need any help or when you're done [14:31:43] XioNoX: sure thanks will let you know [14:32:08] and parent/child is working fine, all mgmt show up as UNREACH in icinga, and don't alert here [14:32:19] XioNoX: xool [14:32:21] cool [14:32:29] PROBLEM - Host asw-b-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:32:30] (03CR) 10Alexandros Kosiaris: Add kubernetes[12]01[56] (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/607754 (https://phabricator.wikimedia.org/T256236) (owner: 10Alexandros Kosiaris) [14:32:45] PROBLEM - Host asw-a-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:32:59] eh, maybe not all, but it's fine it's only a few [14:33:02] (03PS2) 10Alexandros Kosiaris: Add kubernetes[12]01[56] [homer/public] - 10https://gerrit.wikimedia.org/r/607754 (https://phabricator.wikimedia.org/T256236) [14:33:05] PROBLEM - Host asw-c-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:33:33] PROBLEM - Host fasw-c-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:33:39] (03CR) 10Ema: [C: 03+1] Disable HTCP purging everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607593 (https://phabricator.wikimedia.org/T250781) (owner: 10Ppchelko) [14:34:13] PROBLEM - Host asw-d-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:34:45] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:34:55] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=pdu site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:36:07] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:38:25] (03PS1) 10Reedy: Fix name of PasswordNotInCommonList in CommonSettings-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607805 (https://phabricator.wikimedia.org/T256374) [14:38:51] (03CR) 10Reedy: [C: 03+2] Fix name of PasswordNotInCommonList in CommonSettings-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607805 (https://phabricator.wikimedia.org/T256374) (owner: 10Reedy) [14:39:41] (03Merged) 10jenkins-bot: Fix name of PasswordNotInCommonList in CommonSettings-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607805 (https://phabricator.wikimedia.org/T256374) (owner: 10Reedy) [14:41:40] 10Operations, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review: replace backends for releases.wikimedia.org with buster VMs - https://phabricator.wikimedia.org/T247652 (10Dzahn) >>! In T247652#6255779, @jcrespo wrote: > Thanks, it ran successfully now: > > ` > 239105 Full... [14:43:32] !log upgrade ATS in esams to version 8.0.8 [14:43:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:57] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [14:48:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:21] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install - https://phabricator.wikimedia.org/T255520 (10Jclark-ctr) @Cmjohnson will be bulk uploading to netbox after leaving data center HOST , SWITCHPORT , RACK , UNIT, ASSET TAG an-test-worker1001 25 A3 25 WMF4833 an-t... [14:50:12] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install 3 lightweight hadoop nodes - https://phabricator.wikimedia.org/T255518 (10Jclark-ctr) @Cmjohnson will be bulk uploading to netbox after leaving data center an-test-master1001 30 A5 30 WMF4836 an-test-master1002 36 C5 34 WMF4837 an-test-co... [14:50:29] RECOVERY - Host asw-a-codfw is UP: PING OK - Packet loss = 0%, RTA = 41.83 ms [14:50:30] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install - https://phabricator.wikimedia.org/T255520 (10elukey) No preference, if possible one host per row, otherwise any arrangement that fit bests for you! [14:50:34] RECOVERY - Host asw-b-codfw is UP: PING OK - Packet loss = 0%, RTA = 36.80 ms [14:50:57] RECOVERY - Host asw-c-codfw is UP: PING OK - Packet loss = 0%, RTA = 37.86 ms [14:51:11] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:51:31] yaaa [14:51:31] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:51:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:51] (03CR) 10Volans: [C: 03+1] "LGTM, thanks!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/607786 (owner: 10Kormat) [14:51:53] RECOVERY - Host fasw-c-codfw is UP: PING OK - Packet loss = 0%, RTA = 36.68 ms [14:52:00] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:52:02] RECOVERY - Host asw-d-codfw is UP: PING OK - Packet loss = 0%, RTA = 36.76 ms [14:52:20] (03CR) 10Kormat: [C: 03+2] mysql: Spruce up documentation formatting [software/spicerack] - 10https://gerrit.wikimedia.org/r/607786 (owner: 10Kormat) [14:52:31] papaul: SRX220H2? the old one again? [14:52:38] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:54:15] 10Operations, 10DBA, 10SRE-tools, 10Patch-For-Review: Audit all cumin queries in switchdc scripts - https://phabricator.wikimedia.org/T243935 (10Kormat) [14:54:35] (03Merged) 10jenkins-bot: mysql: Spruce up documentation formatting [software/spicerack] - 10https://gerrit.wikimedia.org/r/607786 (owner: 10Kormat) [14:55:13] (03PS1) 10Elukey: Add ipv6 AAAA/PTR records for an-launcher1002 [dns] - 10https://gerrit.wikimedia.org/r/607808 (https://phabricator.wikimedia.org/T256363) [14:55:14] XioNoX: yes the new one got stucked at Octeon srx_300_ram# so trying ti fix that [14:55:37] (03CR) 10jerkins-bot: [V: 04-1] Add ipv6 AAAA/PTR records for an-launcher1002 [dns] - 10https://gerrit.wikimedia.org/r/607808 (https://phabricator.wikimedia.org/T256363) (owner: 10Elukey) [14:56:18] XioNoX: turn it off unplug it to move it and plug it back and after boot i get stuck at that [14:56:54] papaul: tried a reboot I guess? [14:57:02] XioNoX: yes doing that [14:57:02] reading https://forums.juniper.net/t5/SRX-Services-Gateway/After-abrupt-power-loss-SRX300-stack-in-Octeon-srx-300-ram/td-p/306366 [14:57:55] (03PS2) 10Elukey: Add ipv6 AAAA/PTR records for an-launcher1002 [dns] - 10https://gerrit.wikimedia.org/r/607808 (https://phabricator.wikimedia.org/T256363) [14:58:24] (03CR) 10Dzahn: [C: 03+2] add logstash1030 and logstash1031 [dns] - 10https://gerrit.wikimedia.org/r/607634 (https://phabricator.wikimedia.org/T256139) (owner: 10Dzahn) [14:58:49] (03CR) 10Elukey: [C: 03+2] Add ipv6 AAAA/PTR records for an-launcher1002 [dns] - 10https://gerrit.wikimedia.org/r/607808 (https://phabricator.wikimedia.org/T256363) (owner: 10Elukey) [14:59:12] mutante: ah snap, do we want to coordinate? [14:59:28] a clean power-off/power on might solve it [14:59:39] elukey: i already started the authdns-update but by patch is not actually merged.. you can try again now for yours [15:00:11] elukey: you should be free to merge now [15:00:31] ack ! [15:02:51] elukey: oh.. we even took the same IP.. i see :p [15:03:08] whattt [15:03:14] XioNoX: ok it is bsack up [15:03:34] elukey: we both saw the same "21" IP being free to take.. you can have it :) [15:03:55] i noticed because it needed manual rebase [15:04:20] papaul: nice [15:04:30] mutante: I am confused, I just added AAAA/PTR records [15:05:48] elukey: oh.. then it was somebody else who took it meanwhile. don't worry about it. i just need to fix it [15:06:10] mutante: ahhh fiiuuu, I thought something horrible happened :D good I can breathe again [15:06:35] XioNoX: it did boot up from backup so i am in the process of reinstalling Ju [15:06:40] elukey: no no.. it's ok :) [15:07:17] XioNoX: it did bootup in backupup mode so in the process of reinstalling Junos on it it will take a minute [15:08:17] papaul: if it's on the backup, a clean reboot might bring it back to primary. Ideally try to backup the config too. But a junos upgrade shouldn't impact it [15:08:38] XioNoX: ok doing a clean reboot [15:08:49] (03CR) 10Jbond: "still not sure this is right?" (032 comments) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/607794 (https://phabricator.wikimedia.org/T233950) (owner: 10Muehlenhoff) [15:08:59] (03PS3) 10Dzahn: add logstash1030 and logstash1031 [dns] - 10https://gerrit.wikimedia.org/r/607634 (https://phabricator.wikimedia.org/T256139) [15:12:36] (03PS4) 10Dzahn: add logstash1030 and logstash1031 [dns] - 10https://gerrit.wikimedia.org/r/607634 (https://phabricator.wikimedia.org/T256139) [15:13:09] (03CR) 10jerkins-bot: [V: 04-1] add logstash1030 and logstash1031 [dns] - 10https://gerrit.wikimedia.org/r/607634 (https://phabricator.wikimedia.org/T256139) (owner: 10Dzahn) [15:13:16] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:15:06] (03PS5) 10Dzahn: add logstash1030 and logstash1031 [dns] - 10https://gerrit.wikimedia.org/r/607634 (https://phabricator.wikimedia.org/T256139) [15:15:31] (03CR) 10jerkins-bot: [V: 04-1] add logstash1030 and logstash1031 [dns] - 10https://gerrit.wikimedia.org/r/607634 (https://phabricator.wikimedia.org/T256139) (owner: 10Dzahn) [15:15:44] XioNoX: clean boot now [15:15:59] papaul: nice! [15:15:59] eveything back normal [15:16:09] incoming icinga alert flood for the mgmt's [15:16:12] but known then [15:16:50] (03CR) 10Jbond: [C: 03+2] apereo_cas: Enable SSL for DB connections [puppet] - 10https://gerrit.wikimedia.org/r/607793 (https://phabricator.wikimedia.org/T256113) (owner: 10Jbond) [15:16:57] papaul: is it racked at it's final location or you have to power it down again? [15:17:58] XioNoX: it is at final location no more powering down again [15:18:40] (03PS6) 10Dzahn: add logstash1030 and logstash1031 [dns] - 10https://gerrit.wikimedia.org/r/607634 (https://phabricator.wikimedia.org/T256139) [15:18:57] cool [15:19:08] !log ppchelko@deploy1001 Started deploy [restbase/deploy@821e96b]: Only emit vary: accept-language for feeds when it matters T256358 [15:19:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:13] T256358: wikifeeds usage increased by 3x on 2020-06-24 - https://phabricator.wikimedia.org/T256358 [15:20:07] papaul: I'm connected to it via oob, let me know if it's fully cabled [15:20:22] XioNoX: give me a minute [15:20:30] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:20:45] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@821e96b]: Only emit vary: accept-language for feeds when it matters T256358 (duration: 01m 37s) [15:20:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:44] (03PS1) 10Jbond: apereo_cas: set db dialect to MariaDBDialect [puppet] - 10https://gerrit.wikimedia.org/r/607813 [15:22:25] (03CR) 10Dzahn: [C: 03+2] add logstash1030 and logstash1031 [dns] - 10https://gerrit.wikimedia.org/r/607634 (https://phabricator.wikimedia.org/T256139) (owner: 10Dzahn) [15:22:27] (03PS5) 10Reedy: Use structured logging fields for xff logs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607389 [15:22:32] (03CR) 10Jbond: [C: 03+2] apereo_cas: set db dialect to MariaDBDialect [puppet] - 10https://gerrit.wikimedia.org/r/607813 (owner: 10Jbond) [15:22:52] (03PS6) 10Reedy: Use structured logging fields for xff logs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607389 [15:23:00] jouncebot: now [15:23:00] No deployments scheduled for the next 0 hour(s) and 36 minute(s) [15:23:02] PROBLEM - Host ms-be2051.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:23:06] !log ppchelko@deploy1001 Started deploy [restbase/deploy@821e96b]: Only emit vary: accept-language for feeds when it matters T256358, take 2 [15:23:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:12] (03CR) 10Reedy: [C: 03+2] Use structured logging fields for xff logs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607389 (owner: 10Reedy) [15:23:21] XioNoX: on all is back up new mr in place and all interfaces are up [15:23:31] yep, and I'm able to reach it [15:23:37] checking everything [15:23:38] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:23:38] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:23:42] PROBLEM - Host ms-be2053.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:23:42] PROBLEM - Host ms-be2055.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:24:01] (03Merged) 10jenkins-bot: Use structured logging fields for xff logs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607389 (owner: 10Reedy) [15:25:05] (03CR) 10Dzahn: "> Yes eventually all should have v6, I think we're ok to add v6 for new instances and retroactively add v6 to existing hosts later" [dns] - 10https://gerrit.wikimedia.org/r/607634 (https://phabricator.wikimedia.org/T256139) (owner: 10Dzahn) [15:25:40] !log reedy@deploy1001 Synchronized wmf-config/CommonSettings.php: structured logging for xff log, stop logging jobrunner requests (duration: 01m 05s) [15:25:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:56] Pchelolo: there was an interesting increase of 504s while you were deploying https://grafana.wikimedia.org/d/35vIuGpZk/wikifeeds?panelId=15&fullscreen&orgId=1&from=now-30m&to=now [15:26:45] hm... [15:26:49] but indeed I see request rates to the various endpoints dropping now [15:27:02] and errors are no longer around, so \o/ [15:27:20] i was about to ACK that with the "wikifeeds 3x" ticket.. then it was already recovered [15:28:04] akosiaris: there was an interesting side effect to this that we probably need to mitigate [15:28:14] we set cache-control to feeds for 5 mins [15:28:29] and it seems like ALL caches are expiring simultaniously [15:28:33] creating a spike [15:28:49] now we had a lot of vary: headers, so a spike was huge [15:28:57] creating 429 on metrics endpoints [15:29:05] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [15:29:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:15] but even without it, the spike's probably there, just less visible [15:29:17] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops, 10cloud-services-team (Hardware): (Need By: 2020-06-12) rack/setup/install WMCS 10G switches - https://phabricator.wikimedia.org/T251632 (10Jclark-ctr) @Cmjohnson host are cabled need to be configured. [15:29:33] I think I need to add some randomization to cache-control: max-age [15:29:44] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@821e96b]: Only emit vary: accept-language for feeds when it matters T256358, take 2 (duration: 06m 38s) [15:29:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:48] T256358: wikifeeds usage increased by 3x on 2020-06-24 - https://phabricator.wikimedia.org/T256358 [15:30:17] !log ppchelko@deploy1001 Started deploy [restbase/deploy@821e96b]: Only emit vary: accept-language for feeds when it matters T256358, more groups [15:30:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:30] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [15:30:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:51] !log upgrade ATS in codfw to version 8.0.8 [15:30:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:05] (03PS1) 10Elukey: Move all analytics timers but RU ones from an-launcher1001 to 1002 [puppet] - 10https://gerrit.wikimedia.org/r/607819 (https://phabricator.wikimedia.org/T256363) [15:31:07] it's very hard to deploy restbase during wikifeeds issue, the restbase checks include check to wikifeeds that keeps failing [15:31:12] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:32:05] ouch. [15:32:12] (03CR) 10jerkins-bot: [V: 04-1] Move all analytics timers but RU ones from an-launcher1001 to 1002 [puppet] - 10https://gerrit.wikimedia.org/r/607819 (https://phabricator.wikimedia.org/T256363) (owner: 10Elukey) [15:32:20] papaul: it all looks good to me, I ran homer which normalized the root password, ssh keys, etc... [15:32:37] XioNoX: cool thanks [15:32:47] papaul: thank you! great work! [15:32:49] XioNoX: will start the clen up [15:32:58] XioNoX: np [15:33:08] Pchelolo: it does seem like we are back to normal traffic levels btw. I 'll let it be for today and close the task tomorrow EU morning if everything checks out [15:33:35] ok akosiaris. I'll make a little followup with randomization of cache-control for feeds [15:33:41] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@821e96b]: Only emit vary: accept-language for feeds when it matters T256358, more groups (duration: 03m 24s) [15:33:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:46] !log ppchelko@deploy1001 Started deploy [restbase/deploy@821e96b]: Only emit vary: accept-language for feeds when it matters T256358, more groups [15:33:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:08] (03PS1) 10Dzahn: site: add logstash1030/31 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/607821 (https://phabricator.wikimedia.org/T256139) [15:35:16] (03CR) 10Dzahn: [C: 03+2] site: add logstash1030/31 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/607821 (https://phabricator.wikimedia.org/T256139) (owner: 10Dzahn) [15:35:23] (03PS2) 10Dzahn: site: add logstash1030/31 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/607821 (https://phabricator.wikimedia.org/T256139) [15:35:53] (03PS1) 10Ayounsi: New mr1-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/607823 [15:37:13] 10Operations, 10observability: Icinga refresh hardware selection (2020) - https://phabricator.wikimedia.org/T251644 (10Jclark-ctr) [15:37:24] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@821e96b]: Only emit vary: accept-language for feeds when it matters T256358, more groups (duration: 03m 38s) [15:37:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:28] T256358: wikifeeds usage increased by 3x on 2020-06-24 - https://phabricator.wikimedia.org/T256358 [15:37:47] !log ppchelko@deploy1001 Started deploy [restbase/deploy@821e96b]: Only emit vary: accept-language for feeds when it matters T256358, more groups [15:37:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:21] (03CR) 10Ayounsi: [C: 03+2] New mr1-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/607823 (owner: 10Ayounsi) [15:40:38] (03PS2) 10Elukey: Move all analytics timers but RU ones from an-launcher1001 to 1002 [puppet] - 10https://gerrit.wikimedia.org/r/607819 (https://phabricator.wikimedia.org/T256363) [15:42:56] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@821e96b]: Only emit vary: accept-language for feeds when it matters T256358, more groups (duration: 05m 09s) [15:42:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:00] T256358: wikifeeds usage increased by 3x on 2020-06-24 - https://phabricator.wikimedia.org/T256358 [15:45:34] (03PS4) 10Dzahn: add logstash2030 and logstash2031 [dns] - 10https://gerrit.wikimedia.org/r/607637 (https://phabricator.wikimedia.org/T256139) [15:45:36] (03CR) 10Alexandros Kosiaris: [C: 04-1] Add recommendation-api helmfile stanzas (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/602527 (https://phabricator.wikimedia.org/T241230) (owner: 10Bmansurov) [15:46:21] (03CR) 10Dzahn: "> I'm ok either with going ahead with this now and followup with a v6 patch later, or add v6 to this." [dns] - 10https://gerrit.wikimedia.org/r/607637 (https://phabricator.wikimedia.org/T256139) (owner: 10Dzahn) [15:46:34] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:46:34] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [15:47:53] (03CR) 10Krinkle: [C: 03+2] logging: Fold Parsoid into type:mediawiki, add 'servergroup:' instead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606038 (https://phabricator.wikimedia.org/T255627) (owner: 10Krinkle) [15:48:42] (03Merged) 10jenkins-bot: logging: Fold Parsoid into type:mediawiki, add 'servergroup:' instead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606038 (https://phabricator.wikimedia.org/T255627) (owner: 10Krinkle) [15:49:00] (03CR) 10Dzahn: [C: 03+2] add logstash2030 and logstash2031 [dns] - 10https://gerrit.wikimedia.org/r/607637 (https://phabricator.wikimedia.org/T256139) (owner: 10Dzahn) [15:49:06] * Krinkle testing on mwdebug1002 [15:49:22] 10Operations, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review, 10User-fgiunchedi: VM requests for additional Logstash capacity - https://phabricator.wikimedia.org/T256139 (10Dzahn) a:03Dzahn [15:51:08] !log upgrade ATS in eqiad to version 8.0.8 [15:51:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:04] 10Operations, 10ops-codfw, 10netops: codfw: rack/setup new srx300 (mr1) - https://phabricator.wikimedia.org/T255577 (10Papaul) Offline script in Netbox will do this {F31905493} [15:53:28] (03CR) 10Krinkle: [C: 03+2] "Confimed this yields servergroup:api_appserver on mw1276 and servergroup:appserver on mw1274 and mwdebug1002 as example." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606038 (https://phabricator.wikimedia.org/T255627) (owner: 10Krinkle) [15:53:56] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [15:53:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:53] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [15:54:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:15] !log krinkle@deploy1001 Synchronized wmf-config/logging.php: I4c519f88c613fc (duration: 01m 05s) [15:55:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:21] 10Operations, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review, 10User-fgiunchedi: VM requests for additional Logstash capacity - https://phabricator.wikimedia.org/T256139 (10Dzahn) Creating VM logstash1030.eqiad.wmnet in cluster ganeti01.svc.eqiad.wmnet with row=D vcpus=4 memory=8GB disk=50GB l... [15:59:22] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [15:59:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:04] godog and _joe_: (Dis)respected human, time to deploy Puppet request window(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200625T1600). Please do the needful. [16:00:53] nothing in the puppet window [16:01:31] stuff got merged anyways without the need for a time slot [16:02:06] PROBLEM - Router interfaces on mr1-codfw is CRITICAL: CRITICAL: host 208.80.153.196, interfaces up: 39, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:02:33] (03CR) 10Dzahn: [C: 04-1] "Duplicate declaration: Group[jenkins] is already declared at (file: /srv/jenkins-workspace/puppet-compiler/23476/change/src/modules/jenkin" [puppet] - 10https://gerrit.wikimedia.org/r/607645 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [16:03:16] (03PS3) 10Elukey: Move all analytics timers but RU ones from an-launcher1001 to 1002 [puppet] - 10https://gerrit.wikimedia.org/r/607819 (https://phabricator.wikimedia.org/T256363) [16:03:24] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [16:03:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:27] mutante: could use help with https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/606049/ and/or a pointer for who can help instead :) [16:04:43] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [16:04:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:50] (03CR) 10Elukey: [C: 03+2] Move all analytics timers but RU ones from an-launcher1001 to 1002 [puppet] - 10https://gerrit.wikimedia.org/r/607819 (https://phabricator.wikimedia.org/T256363) (owner: 10Elukey) [16:05:46] RECOVERY - Router interfaces on mr1-codfw is OK: OK: host 208.80.153.196, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:06:10] !log installing 4.9.210-1+deb9u1~deb8u1 on jessie hosts (fixed kernel for recent cacheoutattack CPU leaks) [16:06:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:33] Krinkle: please add me on Gerrit and i will get to it [16:07:46] ok :) [16:09:37] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [16:09:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:32] (03PS1) 10Andrew Bogott: Openstack Nova: enable soft affinity (and soft anti-affinity) server groups [puppet] - 10https://gerrit.wikimedia.org/r/607825 (https://phabricator.wikimedia.org/T253267) [16:11:07] (03PS8) 10Dzahn: jenkins: replace system user/group with systemd-sysuser [puppet] - 10https://gerrit.wikimedia.org/r/606286 (https://phabricator.wikimedia.org/T224591) [16:11:25] (03Abandoned) 10Dzahn: admins: add system user for jenkins, reserve UID 903 [puppet] - 10https://gerrit.wikimedia.org/r/607645 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [16:12:05] (03CR) 10Dzahn: "this is now a single change at https://gerrit.wikimedia.org/r/c/operations/puppet/+/606286 to avoid the duplicate declaration issue" [puppet] - 10https://gerrit.wikimedia.org/r/607645 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [16:12:27] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [16:12:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:02] (03CR) 10Andrew Bogott: [C: 03+2] Openstack Nova: enable soft affinity (and soft anti-affinity) server groups [puppet] - 10https://gerrit.wikimedia.org/r/607825 (https://phabricator.wikimedia.org/T253267) (owner: 10Andrew Bogott) [16:15:01] !log installing libxml2 security updates [16:15:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:19] !log I've deleted a "saved object" visualisation in logstash called "Production Errors & Deployments" which seemed to be corrupt and redirect random logstash dashboards to a management page. Backed up at https://phabricator.wikimedia.org/P11666 (NDA) [16:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:27] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [16:16:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:50] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [16:16:53] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [16:16:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:06] (03PS1) 10Andrew Bogott: Openstack Nova: enable soft affinity (and soft anti-affinity) server groups [puppet] - 10https://gerrit.wikimedia.org/r/607827 (https://phabricator.wikimedia.org/T253267) [16:18:37] (03CR) 10Andrew Bogott: [C: 03+2] Openstack Nova: enable soft affinity (and soft anti-affinity) server groups [puppet] - 10https://gerrit.wikimedia.org/r/607827 (https://phabricator.wikimedia.org/T253267) (owner: 10Andrew Bogott) [16:20:28] (03CR) 10Dzahn: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1003/23478/" [puppet] - 10https://gerrit.wikimedia.org/r/606286 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [16:20:30] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [16:20:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:06] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add kubernetes[12]01[56] [homer/public] - 10https://gerrit.wikimedia.org/r/607754 (https://phabricator.wikimedia.org/T256236) (owner: 10Alexandros Kosiaris) [16:21:33] (03Merged) 10jenkins-bot: Add kubernetes[12]01[56] [homer/public] - 10https://gerrit.wikimedia.org/r/607754 (https://phabricator.wikimedia.org/T256236) (owner: 10Alexandros Kosiaris) [16:23:42] 10Operations, 10LDAP-Access-Requests: NDA for superset access request from WMDE employee danshick - https://phabricator.wikimedia.org/T254442 (10KFrancis) @ema The NDA/MOU has been added to the spreadsheet. Thanks! [16:25:20] (03PS1) 10Krinkle: logging: Use 'other' instead of '' as default servergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607829 [16:25:29] (03CR) 10Krinkle: [C: 03+2] logging: Use 'other' instead of '' as default servergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607829 (owner: 10Krinkle) [16:26:21] (03Merged) 10jenkins-bot: logging: Use 'other' instead of '' as default servergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607829 (owner: 10Krinkle) [16:28:03] !log krinkle@deploy1001 Synchronized wmf-config/logging.php: Ia6ef7617d378 (duration: 01m 02s) [16:28:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:52] (03CR) 10Dzahn: "This is what this does:" [puppet] - 10https://gerrit.wikimedia.org/r/606218 (https://phabricator.wikimedia.org/T255629) (owner: 10Dzahn) [16:30:04] 10Operations, 10ops-codfw, 10netops: codfw: rack/setup new srx300 (mr1) - https://phabricator.wikimedia.org/T255577 (10Papaul) [16:30:22] 10Operations, 10ops-codfw, 10netops: codfw: rack/setup new srx300 (mr1) - https://phabricator.wikimedia.org/T255577 (10Papaul) 05Open→03Resolved This is complete [16:30:49] 10Operations, 10ops-codfw, 10netops: codfw: Decommission old mr1 - https://phabricator.wikimedia.org/T256143 (10Papaul) [16:30:57] 10Operations, 10ops-codfw, 10netops: codfw: rack/setup new srx300 (mr1) - https://phabricator.wikimedia.org/T255577 (10Papaul) [16:30:59] 10Operations, 10ops-codfw, 10netops: codfw: Decommission old mr1 - https://phabricator.wikimedia.org/T256143 (10Papaul) 05Open→03Resolved Compete [16:31:45] !log sukhe@cumin1001 START - Cookbook sre.hosts.downtime [16:31:45] !log sukhe@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:31:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:42] (03CR) 10Dzahn: "[mwdebug1001:~] $ curl -s --head localhost | grep Server:" [puppet] - 10https://gerrit.wikimedia.org/r/606218 (https://phabricator.wikimedia.org/T255629) (owner: 10Dzahn) [16:37:03] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 135, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:40:18] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [16:40:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:29] PROBLEM - Ensure hosts are not performing a change on every puppet run on puppetdb1002 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: releases1002.eqiad.wmnet, kubernetes2015.codfw.wmnet, malmok.wikimedia.org, kubernetes2016.codfw.wmnet, releases2002.codfw.wmnet https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [16:46:33] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:47:26] i will look at the releases* part of that alert above [16:48:21] sukhe: maybe you could check what it is on malmok? [16:48:45] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:49:25] mutante: I scheduled a downtime for it but I should have picked a longer interval. could it be related to that? I don't see anything on the host [16:49:49] that's a global check [16:49:54] sukhe: the claim is that it changes something on every single puppet run [16:49:55] doens't depend on the host on icinga [16:50:20] malmok isn't the only host, it's just in the list [16:50:28] and at some point it gets globally over threshold [16:50:45] yea, unrelated to downtime [16:51:26] to be clear, so it's not included in the downtime of a host and all their services [16:51:49] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:52:23] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:52:29] re: releases* hosts.. it is because the puppet role does not support buster package names yet.. but i see Gerrit comments i can just drop all those package installs. will do that later today [16:52:34] I am actually not sure what to look for and how to help debug, but I am happy to follow instructions [16:52:57] on malmok, it does say "The last Puppet run was at Wed Jun 24 11:47:43 UTC 2020 (1745 minutes ago)." which is not true of course [16:53:30] sukhe: do you want puppet to be disabled right now? [16:53:56] 1745 min ago is long. yea [16:54:21] mutante: I am not making any changes on the host nor plan to, for today, so you can disable it if required [16:54:31] sukhe: the quickest thing to look at is puppetboard: https://puppetboard.wikimedia.org/node/malmok.wikimedia.org [16:54:36] sukhe: the opposite, i want to run it repeatedly [16:54:56] debugging would just mean running it multiple times and see if there is a thing it repeats each time or not [16:55:04] and look at the last few puppet runs in the bottom-left column [16:55:07] and it's weird that it has that number when it wasnt disabled [16:55:41] sukhe: ok, so it's not running because currently puppet code is broken ..not because it was disabled [16:56:07] I see. I am looking to see if I can find out why [16:56:17] i guess the icinga check must interpret that in the same way as "changes stuff on each run" for some reason [16:56:42] sukhe: same error of what we were chatting yesterday [16:56:43] https://puppetboard.wikimedia.org/report/malmok.wikimedia.org/8f6768ec616cbce36c8773d7e9c4d53f4918b8fc [16:56:48] it seems to be the thing that jbond was debugging yesterday? [16:56:52] ack [16:57:05] ahhh thanks volans, I was trying to make the login work [16:58:09] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:02] that sounds like a pre-requisite :-P [17:00:04] halfak and accraze: My dear minions, it's time we take the moon! Just kidding. Time for Services – Graphoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200625T1700). [17:00:32] but why is it failing when the change is not in production yet? [17:01:27] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/profile/manifests/wikidough.pp#4 is there [17:02:19] sukhe: it probably failed after https://gerrit.wikimedia.org/r/c/labs/private/+/607769 [17:04:39] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster: Renamed notebook1003 to an-launcher1002 - https://phabricator.wikimedia.org/T256397 (10elukey) [17:08:41] nevermind, not labs/private of course. but a change in the actual private repo that may have gone with it [17:09:12] fixing, and yeah, it was in private [17:11:20] cool! [17:11:53] * sukhe waits for the recovery [17:12:47] the individual puppet run alert on malmok won't be shown because it happened to be in downtime for other reasons [17:13:17] the global alert ..not sure if that gets us under the threshold yet since others have issues too.. but i am looking to fix 2 more [17:13:56] malmok has recovered now. thanks for the alert and help [17:14:15] yw, thanks [17:14:26] (03PS1) 10Dzahn: releases::mediawiki: only install PHP packages if pre-buster [puppet] - 10https://gerrit.wikimedia.org/r/607838 (https://phabricator.wikimedia.org/T247652) [17:17:11] (03PS1) 10Elukey: Remove hiera specific overrides for an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/607839 (https://phabricator.wikimedia.org/T256363) [17:17:36] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/23480/" [puppet] - 10https://gerrit.wikimedia.org/r/607838 (https://phabricator.wikimedia.org/T247652) (owner: 10Dzahn) [17:18:07] (03CR) 10Dzahn: "hot fix to avoid broken puppet runs that trigger icinga alerts" [puppet] - 10https://gerrit.wikimedia.org/r/607838 (https://phabricator.wikimedia.org/T247652) (owner: 10Dzahn) [17:18:37] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash-codfw instance=kafkamon1001 job=burrow partition={0,1} site=eqiad topic=udp_localhost-info https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=log [17:18:37] pic=All&var-consumer_group=All [17:20:03] (03CR) 10Elukey: [C: 03+2] Remove hiera specific overrides for an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/607839 (https://phabricator.wikimedia.org/T256363) (owner: 10Elukey) [17:21:06] (03CR) 10Dzahn: "quick fix for now: https://gerrit.wikimedia.org/r/c/operations/puppet/+/607838" [puppet] - 10https://gerrit.wikimedia.org/r/607641 (https://phabricator.wikimedia.org/T247652) (owner: 10Dzahn) [17:24:00] (03CR) 10Hashar: "> We use the PHP version for docroot hosting for a few things still, don't we? doc1001 is for most things, but there's still… coverage rep" [puppet] - 10https://gerrit.wikimedia.org/r/607641 (https://phabricator.wikimedia.org/T247652) (owner: 10Dzahn) [17:25:44] (03CR) 10Dzahn: [C: 03+2] "https://phabricator.wikimedia.org/T255629#6257233" [puppet] - 10https://gerrit.wikimedia.org/r/606218 (https://phabricator.wikimedia.org/T255629) (owner: 10Dzahn) [17:28:03] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:28:23] (03PS2) 10Dzahn: mediawiki::maintenance: add server-header config [puppet] - 10https://gerrit.wikimedia.org/r/606218 (https://phabricator.wikimedia.org/T255629) [17:30:33] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:32:59] (03CR) 10Dzahn: "affects only mwmaint*, not other mw*" [puppet] - 10https://gerrit.wikimedia.org/r/606218 (https://phabricator.wikimedia.org/T255629) (owner: 10Dzahn) [17:36:40] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me, but one comment should be clarified before merging to avoid breaking anything." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/606286 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [17:37:43] !log mwmaint1002 - restarted apache2 to add server_headers snippet for T255629 - but not working as expected yet [17:37:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:20] T255629: The "Server: mw•" response header is missing on mwmaint/noc.wm.o - https://phabricator.wikimedia.org/T255629 [17:40:28] (03CR) 10Dzahn: "applied this and restarted apache2 on mwmaint1002 - but it does not show the difference yet because the security2 modules is not loaded he" [puppet] - 10https://gerrit.wikimedia.org/r/606218 (https://phabricator.wikimedia.org/T255629) (owner: 10Dzahn) [17:48:22] 10Operations, 10Gerrit, 10SRE-Access-Requests: Add qchris to `archiva-deployers` so he can upload artifacts for gerrit deployments - https://phabricator.wikimedia.org/T256404 (10QChris) [17:48:44] 10Operations, 10Gerrit, 10SRE-Access-Requests: Add qchris to `archiva-deployers` so he can upload artifacts for gerrit deployments - https://phabricator.wikimedia.org/T256404 (10QChris) [17:50:53] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [17:53:06] (03PS1) 10Dzahn: mediawiki::maintenance: load mod_security2 also on mwmaint*, not just mw* [puppet] - 10https://gerrit.wikimedia.org/r/607848 (https://phabricator.wikimedia.org/T255629) [18:00:04] RoanKattouw, Niharika, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Morning backport window(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200625T1800). [18:00:38] (03CR) 10Dzahn: [C: 03+1] jenkins: replace system user/group with systemd-sysuser (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/606286 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [18:01:21] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:02:43] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:08:06] (03PS1) 10Urbanecm: Change bnwiki logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607852 (https://phabricator.wikimedia.org/T255328) [18:08:16] (03PS1) 10Dzahn: zuul: replace user/group with systemd-sysuser and reserved UID [puppet] - 10https://gerrit.wikimedia.org/r/607853 (https://phabricator.wikimedia.org/T224591) [18:08:50] (03CR) 10Dzahn: [C: 03+1] "also added to https://wikitech.wikimedia.org/wiki/UID and a message that one should use the admin module to reserve UIDs now" [puppet] - 10https://gerrit.wikimedia.org/r/606286 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [18:09:08] (03CR) 10jerkins-bot: [V: 04-1] zuul: replace user/group with systemd-sysuser and reserved UID [puppet] - 10https://gerrit.wikimedia.org/r/607853 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [18:10:08] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:11:49] (03PS1) 10Dzahn: zuul: remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/607854 [18:12:21] (03CR) 10jerkins-bot: [V: 04-1] zuul: remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/607854 (owner: 10Dzahn) [18:13:23] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:13:57] (03PS2) 10Dzahn: zuul: replace user/group with systemd-sysuser and reserved UID [puppet] - 10https://gerrit.wikimedia.org/r/607853 (https://phabricator.wikimedia.org/T224591) [18:15:02] (03CR) 10jerkins-bot: [V: 04-1] zuul: replace user/group with systemd-sysuser and reserved UID [puppet] - 10https://gerrit.wikimedia.org/r/607853 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [18:18:05] (03PS1) 10Dzahn: releases::mediawiki: remove PHP packages [puppet] - 10https://gerrit.wikimedia.org/r/607858 [18:23:57] 10Operations, 10Gerrit, 10SRE-Access-Requests: Add qchris to `archiva-deployers` so he can upload artifacts for gerrit deployments - https://phabricator.wikimedia.org/T256404 (10Dzahn) p:05Triage→03High [18:24:20] (03PS2) 10Dzahn: zuul: remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/607854 [18:24:55] (03CR) 10jerkins-bot: [V: 04-1] zuul: remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/607854 (owner: 10Dzahn) [18:25:31] 10Operations, 10Gerrit, 10SRE-Access-Requests: Add qchris to `archiva-deployers` so he can upload artifacts for gerrit deployments - https://phabricator.wikimedia.org/T256404 (10Dzahn) a:03Dzahn [18:25:50] 10Operations, 10Gerrit, 10LDAP-Access-Requests: Add qchris to `archiva-deployers` so he can upload artifacts for gerrit deployments - https://phabricator.wikimedia.org/T256404 (10Dzahn) [18:32:50] 10Operations, 10ops-eqiad, 10netops: (Need by: 2019-09-30) upgrade msw1-eqiad from EX4200 to EX4300 - https://phabricator.wikimedia.org/T225121 (10Cmjohnson) This is being worked on, I had to put the OS image back on the usb stick. When I reset the switch to factory default the usb was wiped as welll. [18:34:42] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): (Need by: 2020-04-02) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10Cmjohnson) a:05Cmjohnson→03Dzahn @Dzahn Could you try to image one of these and let me know if you see a setting missed. I am able to lo... [18:36:35] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:38:23] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:45:38] 10Operations, 10SRE-Access-Requests, 10Scoring-platform-team (Current): Production shell access for Chris Albon - https://phabricator.wikimedia.org/T256412 (10Halfak) [18:49:24] (03CR) 10Krinkle: [C: 03+1] "matches profile::mediawiki::httpd" [puppet] - 10https://gerrit.wikimedia.org/r/607848 (https://phabricator.wikimedia.org/T255629) (owner: 10Dzahn) [18:52:40] 10Operations, 10SRE-Access-Requests, 10Scoring-platform-team (Current): Production shell access for Chris Albon - https://phabricator.wikimedia.org/T256412 (10calbon) Public key for prod (different from all other keys): {F31905662} Preferred username: calbon [18:55:15] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=cloud_dev_pdns_rec site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:57:14] 10Operations, 10SRE-Access-Requests, 10Scoring-platform-team (Current): Production shell access for Chris Albon - https://phabricator.wikimedia.org/T256412 (10elukey) [18:57:43] 10Operations, 10SRE-Access-Requests, 10Scoring-platform-team (Current): Production shell access for Chris Albon - https://phabricator.wikimedia.org/T256412 (10elukey) [18:58:02] !log LDAP - added qchris to archiva-deployers (T256404) [18:58:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:07] T256404: Add qchris to `archiva-deployers` so he can upload artifacts for gerrit deployments - https://phabricator.wikimedia.org/T256404 [18:58:37] 10Operations, 10SRE-Access-Requests, 10Scoring-platform-team (Current): Production shell access for Chris Albon - https://phabricator.wikimedia.org/T256412 (10elukey) No need for `statistics-privatedata-users`, the group has been decommed :) [18:58:54] 10Operations, 10Gerrit, 10LDAP-Access-Requests: Add qchris to `archiva-deployers` so he can upload artifacts for gerrit deployments - https://phabricator.wikimedia.org/T256404 (10Dzahn) 05Open→03Resolved done. a puppet change was not needed because qchris is existing shell user and gerrit-root [19:00:04] brennen and hashar: My dear minions, it's time we take the moon! Just kidding. Time for Mediawiki train - American+European Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200625T1900). [19:01:42] things seem reasonably calm, proceeding with deploy. [19:03:20] (03PS1) 10Brennen Bearnes: all wikis to 1.35.0-wmf.38 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607864 [19:03:22] (03CR) 10Brennen Bearnes: [C: 03+2] all wikis to 1.35.0-wmf.38 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607864 (owner: 10Brennen Bearnes) [19:03:53] 10Operations, 10SRE-Access-Requests, 10Scoring-platform-team (Current): Production shell access for Chris Albon - https://phabricator.wikimedia.org/T256412 (10Halfak) Aha! Thanks for the cleanup. [19:04:11] (03Merged) 10jenkins-bot: all wikis to 1.35.0-wmf.38 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607864 (owner: 10Brennen Bearnes) [19:05:46] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.38 [19:05:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:40] 10Operations, 10SRE-Access-Requests, 10Scoring-platform-team (Current): Production shell access for Chris Albon - https://phabricator.wikimedia.org/T256412 (10Dzahn) nda and wmf are LDAP groups (that would be a separate phabricator tag, ldap-access-requests) while the other are shell groups. It's also eithe... [19:18:07] (03PS4) 10Dzahn: planet: replace system/user group with systemd-sysuser [puppet] - 10https://gerrit.wikimedia.org/r/606287 [19:24:36] 10Operations, 10SRE-Access-Requests, 10Scoring-platform-team (Current): Production shell access for Chris Albon - https://phabricator.wikimedia.org/T256412 (10calbon) [19:25:43] 10Operations, 10SRE-Access-Requests, 10Scoring-platform-team (Current): Production shell access for Chris Albon - https://phabricator.wikimedia.org/T256412 (10calbon) I don't understand everything Dzahn said but I removed the nda tag, I'm staff. [19:26:38] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Scoring-platform-team (Current): Production shell access for Chris Albon - https://phabricator.wikimedia.org/T256412 (10Dzahn) [19:27:41] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Scoring-platform-team (Current): Production shell access for Chris Albon - https://phabricator.wikimedia.org/T256412 (10Dzahn) @calbon Thanks, it's all good and that was right to do. I added another tag for being added to the LDAP group. And w... [19:29:45] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Scoring-platform-team (Current): Production shell access for Chris Albon - https://phabricator.wikimedia.org/T256412 (10Dzahn) [19:30:04] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Scoring-platform-team (Current): Production shell access for Chris Albon - https://phabricator.wikimedia.org/T256412 (10Dzahn) [19:30:43] !log repooling wdqs1007.eqiad.wmnet [19:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:03] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Scoring-platform-team (Current): Production shell access for Chris Albon - https://phabricator.wikimedia.org/T256412 (10Dzahn) I made some edits to the task description to clarify what is what kind of thing (LDAP / production shell / cloud (aka... [19:32:04] 10Operations, 10observability: improve cron spam visibility - https://phabricator.wikimedia.org/T84845 (10colewhite) [19:32:46] 10Operations, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10Dzahn) [19:32:51] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10Dzahn) 05Open→03Resolved Decided with Hash... [19:32:53] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10Dzahn) [19:34:00] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10hashar) Eventually I wanted to switch back to... [19:39:06] (03CR) 10Dzahn: [C: 04-1] "Could not find resource 'User[planet]' in parameter 'require'" [puppet] - 10https://gerrit.wikimedia.org/r/606287 (owner: 10Dzahn) [19:39:39] (03CR) 10Hashar: "The testsuite is broken, the spec run with the facts from the container OS instead of whatever distribution(s) we target :]" [puppet] - 10https://gerrit.wikimedia.org/r/607854 (owner: 10Dzahn) [19:40:46] (03PS5) 10Dzahn: planet: replace system/user group with systemd-sysuser [puppet] - 10https://gerrit.wikimedia.org/r/606287 [19:41:06] (03CR) 10jerkins-bot: [V: 04-1] planet: replace system/user group with systemd-sysuser [puppet] - 10https://gerrit.wikimedia.org/r/606287 (owner: 10Dzahn) [19:41:22] 10Operations, 10observability: improve cron spam visibility - https://phabricator.wikimedia.org/T84845 (10colewhite) An option we discussed recently was to ingest mail generated by the servers into Logstash by either pulling events from a mailbox or piping off events at the mail servers. Once in ES, queries c... [19:43:51] (03CR) 10Cwhite: [C: 03+1] "LGTM! Let us know when this is ready for deployment and we'll see it through." [puppet] - 10https://gerrit.wikimedia.org/r/606049 (https://phabricator.wikimedia.org/T255627) (owner: 10Krinkle) [19:47:51] (03CR) 10Dzahn: [C: 04-1] "achievement unlocked: "Illegal class reference"" [puppet] - 10https://gerrit.wikimedia.org/r/606287 (owner: 10Dzahn) [19:49:07] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:51:16] (03PS1) 10Hashar: zuul: set site/initsystem in rspec configuration [puppet] - 10https://gerrit.wikimedia.org/r/607867 [19:52:25] (03CR) 10Hashar: "The spec issue due to a missing initsystem should be fixed by https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/607867/ ." [puppet] - 10https://gerrit.wikimedia.org/r/607853 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [19:52:47] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:53:11] (03CR) 10Dzahn: [C: 03+2] zuul: set site/initsystem in rspec configuration [puppet] - 10https://gerrit.wikimedia.org/r/607867 (owner: 10Hashar) [19:53:45] (03CR) 10Krinkle: "It's ready :) Does today work?" [puppet] - 10https://gerrit.wikimedia.org/r/606049 (https://phabricator.wikimedia.org/T255627) (owner: 10Krinkle) [19:53:57] shdubsh: ^ :) [19:54:12] ack! [19:54:37] (03CR) 10Dzahn: [V: 03+1 C: 03+2] zuul: set site/initsystem in rspec configuration [puppet] - 10https://gerrit.wikimedia.org/r/607867 (owner: 10Hashar) [19:55:15] (03CR) 10Hashar: "we can dish out contint::composer since it needs php anyway ;) That is also one step toward stopping using that way of installing compose" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/607858 (owner: 10Dzahn) [19:55:36] (03CR) 10Dzahn: [C: 03+1] mediawiki,logstash: Update type:parsoid-php -> type:mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/606049 (https://phabricator.wikimedia.org/T255627) (owner: 10Krinkle) [19:57:43] (03CR) 10Cwhite: [C: 03+2] mediawiki,logstash: Update type:parsoid-php -> type:mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/606049 (https://phabricator.wikimedia.org/T255627) (owner: 10Krinkle) [20:00:34] (03CR) 10Dzahn: [C: 04-1] planet: replace system/user group with systemd-sysuser (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/606287 (owner: 10Dzahn) [20:02:04] (03PS6) 10Dzahn: planet: replace system/user group with systemd-sysuser [puppet] - 10https://gerrit.wikimedia.org/r/606287 [20:02:24] (03CR) 10jerkins-bot: [V: 04-1] planet: replace system/user group with systemd-sysuser [puppet] - 10https://gerrit.wikimedia.org/r/606287 (owner: 10Dzahn) [20:13:32] 10Operations: Ensure that there are no firewall rules in modules - https://phabricator.wikimedia.org/T114209 (10Dzahn) [20:14:48] (03PS1) 10Dzahn: DHCP: add logstash2030, logstash2031 [puppet] - 10https://gerrit.wikimedia.org/r/607872 (https://phabricator.wikimedia.org/T256139) [20:15:56] (03CR) 10Dzahn: [C: 03+2] DHCP: add logstash2030, logstash2031 [puppet] - 10https://gerrit.wikimedia.org/r/607872 (https://phabricator.wikimedia.org/T256139) (owner: 10Dzahn) [20:23:13] 10Operations: Ensure that there are no firewall rules in modules - https://phabricator.wikimedia.org/T114209 (10Dzahn) recently fixed: icinga: https://gerrit.wikimedia.org/r/c/operations/puppet/+/606730 codesearch: https://gerrit.wikimedia.org/r/c/operations/puppet/+/606735 dumps: https://gerrit.wikimedia.org/r... [20:25:30] 10Operations: Ensure that there are no firewall rules in modules - https://phabricator.wikimedia.org/T114209 (10Dzahn) [20:26:01] 10Operations: Ensure that there are no firewall rules in modules - https://phabricator.wikimedia.org/T114209 (10Dzahn) modules that still have a ferm::service as of today: acme_chief aptly base phabricator prometheus rsync scap service udp2log added check boxes [20:31:15] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [20:31:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:48] (03PS1) 10Dzahn: partman: add logstash103[0-1] and logstash2003[0-1] [puppet] - 10https://gerrit.wikimedia.org/r/607875 (https://phabricator.wikimedia.org/T256139) [20:35:22] (03CR) 10Dzahn: [C: 03+2] partman: add logstash103[0-1] and logstash2003[0-1] [puppet] - 10https://gerrit.wikimedia.org/r/607875 (https://phabricator.wikimedia.org/T256139) (owner: 10Dzahn) [20:40:11] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install - https://phabricator.wikimedia.org/T255520 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [20:40:51] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install 3 lightweight hadoop nodes - https://phabricator.wikimedia.org/T255518 (10Jclark-ctr) a:05elukey→03Cmjohnson [20:41:25] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install 3 lightweight hadoop nodes - https://phabricator.wikimedia.org/T255518 (10Jclark-ctr) added to netbox [20:42:00] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install - https://phabricator.wikimedia.org/T255520 (10Jclark-ctr) added to netbox [20:42:17] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is CRITICAL: 38.82 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [20:42:31] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install - https://phabricator.wikimedia.org/T255520 (10Jclark-ctr) [20:43:19] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install 3 lightweight hadoop nodes - https://phabricator.wikimedia.org/T255518 (10Jclark-ctr) [20:46:47] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops, 10cloud-services-team (Hardware): (Need By: 2020-06-12) rack/setup/install WMCS 10G switches - https://phabricator.wikimedia.org/T251632 (10Jclark-ctr) @ayounsi switches are cabled and powered waiting on configuration [20:50:53] PROBLEM - HTTPS-dbtree on dbmonitor1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [20:52:33] RECOVERY - HTTPS-dbtree on dbmonitor1001 is OK: HTTP OK: HTTP/1.1 200 OK - 92408 bytes in 1.756 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [20:54:19] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 101.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [20:54:57] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [20:57:41] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [20:57:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:19] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: Onboarding Wolfgang Kandek - https://phabricator.wikimedia.org/T249352 (10Dzahn) [21:31:38] @seen andre_ [21:31:38] mutante: Last time I saw andre_ they were leaving the channel #wikibooks-es at 3/20/2020 1:19:55 PM (97d8h11m42s ago) [21:31:46] @seen andre__ [21:31:46] mutante: Last time I saw andre__ they were quitting the network with reason: Quit: Out. N/A at 6/19/2020 5:55:09 PM (6d3h36m37s ago) [21:42:14] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: Onboarding Wolfgang Kandek - https://phabricator.wikimedia.org/T249352 (10CDanis) [21:57:51] (03PS2) 10Dave Pifke: [WIP] arclamp: Deploy from scap [puppet] - 10https://gerrit.wikimedia.org/r/607370 (https://phabricator.wikimedia.org/T200109) [22:00:37] PROBLEM - Host kubestagetcd1004 is DOWN: PING CRITICAL - Packet loss = 100% [22:12:36] (03PS3) 10Dave Pifke: [WIP] arclamp: Deploy from scap [puppet] - 10https://gerrit.wikimedia.org/r/607370 (https://phabricator.wikimedia.org/T200109) [22:25:50] !log puppetmaster - signing certs and initial run for logstash2030/2031 - no prod role yet [22:25:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:23] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [22:29:03] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [22:33:45] (03PS1) 10Dzahn: site/DHCP: add logstash[1]203[12] [puppet] - 10https://gerrit.wikimedia.org/r/607895 (https://phabricator.wikimedia.org/T256139) [22:37:02] (03PS2) 10Dzahn: site/DHCP: add logstash[1]203[12] [puppet] - 10https://gerrit.wikimedia.org/r/607895 (https://phabricator.wikimedia.org/T256139) [22:41:14] (03PS3) 10Dzahn: site/DHCP: add logstash[12]03[01] [puppet] - 10https://gerrit.wikimedia.org/r/607895 (https://phabricator.wikimedia.org/T256139) [22:46:41] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and nda groups for edtadros - https://phabricator.wikimedia.org/T256435 (10Edtadros) [22:49:25] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:50:24] 10Operations, 10ops-codfw, 10SRE-swift-storage: 3 ms-be mgmt interfaces not back after mgmt switch maintenance - https://phabricator.wikimedia.org/T256436 (10Dzahn) [22:51:22] 10Operations, 10ops-codfw, 10SRE-swift-storage: 3 ms-be mgmt interfaces not back after mgmt switch maintenance - https://phabricator.wikimedia.org/T256436 (10Dzahn) {F31905905} [22:51:48] 10Operations, 10ops-codfw, 10SRE-swift-storage: 3 ms-be mgmt interfaces not back after mgmt switch maintenance - https://phabricator.wikimedia.org/T256436 (10Dzahn) p:05Triage→03Medium [22:52:12] ACKNOWLEDGEMENT - Host ms-be2051.mgmt is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T256436 [22:52:12] ACKNOWLEDGEMENT - Host ms-be2053.mgmt is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T256436 [22:52:12] ACKNOWLEDGEMENT - Host ms-be2055.mgmt is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T256436 [22:53:43] 10Operations, 10ops-codfw, 10SRE-swift-storage: 3 ms-be mgmt interfaces not back after mgmt switch maintenance - https://phabricator.wikimedia.org/T256436 (10Papaul) a:03Papaul [22:54:11] RECOVERY - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is OK: (C)100 gt (W)80 gt 79.32 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [22:54:57] ACKNOWLEDGEMENT - Host kubestagetcd1004 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T244530#6256018 [22:56:08] 10Operations, 10ops-eqiad: upgrade memory in ganeti100[5-8].eqiad.wmnet - https://phabricator.wikimedia.org/T244530 (10Dzahn) kubestagetcd1004 is still down. not sure if desired. ACKed in Icinga. [22:57:11] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/23485/" [puppet] - 10https://gerrit.wikimedia.org/r/607895 (https://phabricator.wikimedia.org/T256139) (owner: 10Dzahn) [22:59:35] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:00:04] RoanKattouw, Niharika, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy Evening backport window(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200625T2300). [23:04:59] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:26:31] (03PS4) 10Dave Pifke: [WIP] arclamp: Deploy from scap [puppet] - 10https://gerrit.wikimedia.org/r/607370 (https://phabricator.wikimedia.org/T200109) [23:31:17] (03PS5) 10Dave Pifke: [WIP] arclamp: Deploy from scap [puppet] - 10https://gerrit.wikimedia.org/r/607370 (https://phabricator.wikimedia.org/T200109) [23:34:15] (03PS6) 10Dave Pifke: [WIP] arclamp: Deploy from scap [puppet] - 10https://gerrit.wikimedia.org/r/607370 (https://phabricator.wikimedia.org/T200109) [23:37:24] !log puppetmaster - signing certs and initial puppet run for logstash1030/logstash1031 - no prod role yet [23:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:43:48] 10Operations, 10Wikimedia-Logstash, 10observability, 10User-fgiunchedi: VM requests for additional Logstash capacity - https://phabricator.wikimedia.org/T256139 (10Dzahn) @fgiunchedi 4 VMs have been created. OS has been installed. They have been added to puppet with the "insetup" role. IPv6 records ha... [23:46:55] (03CR) 10Ryan Kemper: [C: 03+2] Consolidate query_service profile duplication [puppet] - 10https://gerrit.wikimedia.org/r/599146 (owner: 10EBernhardson) [23:47:28] (03CR) 10Ryan Kemper: [C: 03+2] "Sorry for the delay in getting to this. PCC looks good: https://puppet-compiler.wmflabs.org/compiler1003/23486/wdqs1006.eqiad.wmnet/index." [puppet] - 10https://gerrit.wikimedia.org/r/599146 (owner: 10EBernhardson) [23:52:43] 10Operations, 10Wikimedia-Logstash, 10observability, 10serviceops, 10User-fgiunchedi: VM requests for additional Logstash capacity - https://phabricator.wikimedia.org/T256139 (10Dzahn) [23:53:00] 10Operations, 10Wikimedia-Logstash, 10observability, 10User-fgiunchedi: move 4 new logstash VMs into production - https://phabricator.wikimedia.org/T256443 (10Dzahn) [23:54:46] 10Operations, 10Wikimedia-Logstash, 10observability, 10User-fgiunchedi: move 4 new logstash VMs into production - https://phabricator.wikimedia.org/T256443 (10Dzahn) [23:55:22] 10Operations, 10Wikimedia-Logstash, 10observability, 10User-fgiunchedi: move 4 new logstash VMs into production - https://phabricator.wikimedia.org/T256443 (10Dzahn) [23:55:36] 10Operations, 10Wikimedia-Logstash, 10observability, 10serviceops, 10User-fgiunchedi: VM requests for additional Logstash capacity - https://phabricator.wikimedia.org/T256139 (10Dzahn) [23:55:41] 10Operations, 10Wikimedia-Logstash, 10observability, 10User-fgiunchedi: move 4 new logstash VMs into production - https://phabricator.wikimedia.org/T256443 (10Dzahn) [23:55:43] 10Operations, 10Wikimedia-Logstash, 10observability, 10serviceops, 10User-fgiunchedi: VM requests for additional Logstash capacity - https://phabricator.wikimedia.org/T256139 (10Dzahn) 05Open→03Resolved [23:55:47] 10Operations, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Increase logging pipeline ingestion capacity - https://phabricator.wikimedia.org/T255243 (10Dzahn) [23:58:05] (03CR) 10Krinkle: [WIP] arclamp: Deploy from scap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/607370 (https://phabricator.wikimedia.org/T200109) (owner: 10Dave Pifke) [23:58:12] 10Operations, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Increase logging pipeline ingestion capacity - https://phabricator.wikimedia.org/T255243 (10Dzahn) [23:58:14] 10Operations, 10Wikimedia-Logstash, 10observability, 10User-fgiunchedi: move 4 new logstash VMs into production - https://phabricator.wikimedia.org/T256443 (10Dzahn) [23:58:56] (03CR) 10Ryan Kemper: [C: 03+2] "PCC looks good." [puppet] - 10https://gerrit.wikimedia.org/r/607542 (https://phabricator.wikimedia.org/T252913) (owner: 10Herron) [23:59:21] 10Operations, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Increase logging pipeline ingestion capacity - https://phabricator.wikimedia.org/T255243 (10Dzahn) There are 4 new ganeti VMs now, 2 in eqiad and 2 in codfw, in row D each. They are ready to be taken into product...