[00:13:57] (03PS1) 10Dave Pifke: arclamp: Run & scrape Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/613359 (https://phabricator.wikimedia.org/T256035) [00:16:13] (03PS1) 10Reedy: Remove line saying ldaplist will be removed 30 August 2016 [puppet] - 10https://gerrit.wikimedia.org/r/613360 [00:51:17] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:30:30] 10Operations, 10Performance-Team, 10serviceops, 10Sustainability (Incident Prevention): Test gutter pool failover in production and memcached 1.5.x - https://phabricator.wikimedia.org/T240684 (10aaron) We don't really need purges to go to the gutter cache, given the low TTL there. Lost purges during con... [03:15:27] RECOVERY - Ensure hosts are not performing a change on every puppet run on puppetdb1002 is OK: OK: all nodes running as expected https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [04:37:47] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:39:37] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:46:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1092', diff saved to https://phabricator.wikimedia.org/P11933 and previous config saved to /var/cache/conftool/dbconfig/20200717-044658-marostegui.json [04:47:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:47:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1111', diff saved to https://phabricator.wikimedia.org/P11934 and previous config saved to /var/cache/conftool/dbconfig/20200717-044748-marostegui.json [04:47:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:50:00] 10Operations, 10DBA, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Request new database for idp-test.wikimedia.org - https://phabricator.wikimedia.org/T256120 (10Marostegui) @jbond anything pending here or can this be closed? [04:58:53] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [05:00:39] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [05:01:55] 10Operations, 10DBA, 10Patch-For-Review, 10User-Kormat: Set up replication for zarcillo - https://phabricator.wikimedia.org/T257816 (10Marostegui) a:03Kormat [05:03:15] (03PS1) 10Marostegui: db1124: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/613463 [05:04:01] (03CR) 10Marostegui: [C: 03+2] db1124: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/613463 (owner: 10Marostegui) [05:06:45] (03PS1) 10Marostegui: db1084: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/613466 (https://phabricator.wikimedia.org/T257540) [05:08:04] (03CR) 10Marostegui: [C: 03+2] db1084: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/613466 (https://phabricator.wikimedia.org/T257540) (owner: 10Marostegui) [05:15:22] 10Operations, 10DBA, 10Sustainability (Incident Prevention): Disallow 'weight: 0' for MW db config in dbctl - https://phabricator.wikimedia.org/T239901 (10Marostegui) 05Open→03Declined Going to close this as declined for now, as looks like we are not going to proceed with this so far. [05:24:25] 10Operations, 10DBA, 10Traffic, 10Patch-For-Review: Framework to transfer files over the LAN - https://phabricator.wikimedia.org/T156462 (10Marostegui) a:05Rduran→03None [05:26:19] 10Operations, 10DBA, 10Epic, 10Performance-Team (Radar), 10Sustainability (Incident Prevention): Decide how to improve parsercache replication, sharding and HA - https://phabricator.wikimedia.org/T133523 (10Marostegui) [06:14:06] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 59 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:18:20] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:19:56] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 46 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:20:14] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:28:19] !log rename msw1-eqiad interface range [06:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:52] !log rename msw1-codfw interface range [06:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:14] (03PS2) 10Ayounsi: Add msw interfaces support [homer/public] - 10https://gerrit.wikimedia.org/r/606375 [06:38:24] (03CR) 10Elukey: [C: 03+1] Add an-tool1008 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/613141 (owner: 10Muehlenhoff) [06:38:37] (03CR) 10Ayounsi: [C: 03+2] Add msw interfaces support [homer/public] - 10https://gerrit.wikimedia.org/r/606375 (owner: 10Ayounsi) [06:39:00] (03Merged) 10jenkins-bot: Add msw interfaces support [homer/public] - 10https://gerrit.wikimedia.org/r/606375 (owner: 10Ayounsi) [06:40:34] (03CR) 10Elukey: [C: 03+1] Add separate role for Yarn [puppet] - 10https://gerrit.wikimedia.org/r/613156 (https://phabricator.wikimedia.org/T258152) (owner: 10Muehlenhoff) [06:43:16] 10Operations, 10Analytics-Clusters, 10Discovery-Search, 10vm-requests: VM request for Analytics -> Elastic Search ML models update - https://phabricator.wikimedia.org/T258189 (10elukey) [06:47:25] (03CR) 10Muehlenhoff: debianization (032 comments) [debs/grafana-loki] (debian/sid) - 10https://gerrit.wikimedia.org/r/610864 (owner: 10Cwhite) [06:50:56] 10Operations, 10Analytics-Clusters, 10Discovery-Search, 10vm-requests: VM request for Analytics -> Elastic Search ML models update - https://phabricator.wikimedia.org/T258189 (10elukey) @RKemper I am following https://wikitech.wikimedia.org/wiki/Ganeti#Create_a_VM, I'll list all the steps/details etc.. in... [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200717T0700) [07:00:41] 10Operations, 10Analytics-Clusters, 10Discovery-Search, 10vm-requests: VM request for Analytics -> Elastic Search ML models update - https://phabricator.wikimedia.org/T258189 (10elukey) ` elukey@ganeti1011:~$ sudo gnt-group list Group Nodes Instances AllocPolicy NDParams row_A 4 39 preferred o... [07:03:54] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:03:58] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 50, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:04:20] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 130, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:04:47] this must be Telia's maintenance [07:05:17] yep seems so from the gcal [07:09:56] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:11:24] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:11:26] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 54, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:17:49] (03PS1) 10Elukey: Add PTR/A/AAAA records for search-loader[12]001 VMs [dns] - 10https://gerrit.wikimedia.org/r/613581 (https://phabricator.wikimedia.org/T258189) [07:18:19] (03CR) 10Muehlenhoff: [C: 03+2] Add separate role for Yarn [puppet] - 10https://gerrit.wikimedia.org/r/613156 (https://phabricator.wikimedia.org/T258152) (owner: 10Muehlenhoff) [07:20:43] (03CR) 10Elukey: [C: 03+2] Add PTR/A/AAAA records for search-loader[12]001 VMs [dns] - 10https://gerrit.wikimedia.org/r/613581 (https://phabricator.wikimedia.org/T258189) (owner: 10Elukey) [07:23:17] (03CR) 10Alexandros Kosiaris: [C: 03+2] mobileapps: add request template for load.php requests to config [deployment-charts] - 10https://gerrit.wikimedia.org/r/613237 (https://phabricator.wikimedia.org/T258186) (owner: 10Mholloway) [07:24:21] (03Merged) 10jenkins-bot: mobileapps: add request template for load.php requests to config [deployment-charts] - 10https://gerrit.wikimedia.org/r/613237 (https://phabricator.wikimedia.org/T258186) (owner: 10Mholloway) [07:24:28] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:24:30] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 52, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:28:18] (03PS1) 10Muehlenhoff: Apply Yarn role to an-tool1008 [puppet] - 10https://gerrit.wikimedia.org/r/613585 [07:28:43] (03CR) 10jerkins-bot: [V: 04-1] Apply Yarn role to an-tool1008 [puppet] - 10https://gerrit.wikimedia.org/r/613585 (owner: 10Muehlenhoff) [07:30:28] !log elukey@cumin1001 START - Cookbook sre.ganeti.makevm [07:30:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:06] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [07:32:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:20] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 130, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:33:20] (03PS2) 10Muehlenhoff: Apply Yarn role to an-tool1008 [puppet] - 10https://gerrit.wikimedia.org/r/613585 (https://phabricator.wikimedia.org/T258152) [07:33:34] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' . [07:33:34] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [07:33:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:12] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:34:23] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [07:34:24] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' . [07:34:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:14] (03CR) 10JMeybohm: [C: 03+1] "Nice. Thanks for working this out!" [puppet] - 10https://gerrit.wikimedia.org/r/609403 (owner: 10Alexandros Kosiaris) [07:43:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1111', diff saved to https://phabricator.wikimedia.org/P11935 and previous config saved to /var/cache/conftool/dbconfig/20200717-074335-marostegui.json [07:43:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:22] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 130, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:49:38] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 67 probes of 567 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:50:30] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:50:34] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 54, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:50:54] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:51:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1104', diff saved to https://phabricator.wikimedia.org/P11936 and previous config saved to /var/cache/conftool/dbconfig/20200717-075124-marostegui.json [07:51:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:20] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [07:54:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:30] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 45 probes of 567 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:59:29] (03PS2) 10ZPapierski: add logout config for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/612681 (https://phabricator.wikimedia.org/T257314) (owner: 10Mstyles) [08:05:23] 10Operations, 10Analytics-Clusters, 10Discovery-Search, 10vm-requests, 10Patch-For-Review: VM request for Analytics -> Elastic Search ML models update - https://phabricator.wikimedia.org/T258189 (10elukey) ` elukey@cumin1001:~$ sudo cookbook sre.ganeti.makevm eqiad_D search-loader1001.eqiad.wmnet --vcpus... [08:05:53] !log elukey@cumin1001 START - Cookbook sre.ganeti.makevm [08:05:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:46] (03PS3) 10ZPapierski: add logout config for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/612681 (https://phabricator.wikimedia.org/T257314) (owner: 10Mstyles) [08:25:50] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Sigh, icinga1001 isn't picking up the changes (https://puppet-compiler.wmflabs.org/compiler1003/23917/) This needs some more work." [puppet] - 10https://gerrit.wikimedia.org/r/609403 (owner: 10Alexandros Kosiaris) [08:27:02] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:29:04] 10Operations, 10DBA, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Request new database for idp-test.wikimedia.org - https://phabricator.wikimedia.org/T256120 (10jbond) 05Open→03Resolved nothing pending on this task, resolving and thanks [08:29:06] 10Operations, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: CAS Store U2f tokens in a database - https://phabricator.wikimedia.org/T256113 (10jbond) [08:29:28] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [08:29:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:39] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/613146 (https://phabricator.wikimedia.org/T247967) (owner: 10Muehlenhoff) [08:30:48] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:30:59] (03CR) 10Lucas Werkmeister (WMDE): "> Patch Set 3:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611393 (owner: 10Lucas Werkmeister (WMDE)) [08:32:13] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/612514 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [08:33:26] 10Operations, 10Analytics-Clusters, 10Discovery-Search, 10vm-requests, 10Patch-For-Review: VM request for Analytics -> Elastic Search ML models update - https://phabricator.wikimedia.org/T258189 (10elukey) ` elukey@cumin1001:~$ sudo cookbook sre.ganeti.makevm codfw_D search-loader2001.codfw.wmnet --vcpus... [08:34:56] (03PS1) 10Elukey: Introduce search-loader[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/613599 (https://phabricator.wikimedia.org/T258189) [08:34:58] (03PS10) 10Jbond: profile::mediawiki::mcrouter_wancache: refactor [puppet] - 10https://gerrit.wikimedia.org/r/612514 (https://phabricator.wikimedia.org/T247956) [08:38:04] (03PS2) 10Elukey: Introduce search-loader[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/613599 (https://phabricator.wikimedia.org/T258189) [08:38:13] (03CR) 10Elukey: [C: 03+1] prometheus::memcached_exporter: fix arguments hiera call [puppet] - 10https://gerrit.wikimedia.org/r/612507 (owner: 10Jbond) [08:38:53] (03PS15) 10Jbond: P:mediawiki::mcrouter_wancache: refactor parameters [puppet] - 10https://gerrit.wikimedia.org/r/612523 (https://phabricator.wikimedia.org/T247956) [08:39:04] (03PS4) 10Ema: dnsdist: reload the certificates instead of restarting the service [puppet] - 10https://gerrit.wikimedia.org/r/613187 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [08:39:17] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/613187 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [08:39:45] (03PS1) 10Muehlenhoff: Bump changelog for buster rebuild [debs/prometheus-atlas-exporter] - 10https://gerrit.wikimedia.org/r/613600 [08:40:38] (03PS9) 10Jbond: mcrouter: store defaults in module not in hiera [puppet] - 10https://gerrit.wikimedia.org/r/612532 (https://phabricator.wikimedia.org/T247956) [08:40:50] (03CR) 10Elukey: [C: 03+2] Introduce search-loader[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/613599 (https://phabricator.wikimedia.org/T258189) (owner: 10Elukey) [08:41:27] (03CR) 10Jbond: [C: 03+2] prometheus::memcached_exporter: fix arguments hiera call [puppet] - 10https://gerrit.wikimedia.org/r/612507 (owner: 10Jbond) [08:46:38] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Bump changelog for buster rebuild [debs/prometheus-atlas-exporter] - 10https://gerrit.wikimedia.org/r/613600 (owner: 10Muehlenhoff) [08:48:28] !log imported prometheus-atlas-exporter 1.0+git20191204.ffafab7-2 to buster-wikimedia T247967 [08:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:33] T247967: Migrate role::netmon to Buster - https://phabricator.wikimedia.org/T247967 [08:57:11] (03PS1) 10Jbond: debug_host: add entry point for debug_host script [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/613601 [08:58:10] (03CR) 10Jbond: [C: 03+2] debug_host: add entry point for debug_host script [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/613601 (owner: 10Jbond) [08:59:17] (03PS3) 10Muehlenhoff: Apply Yarn role to an-tool1008 [puppet] - 10https://gerrit.wikimedia.org/r/613585 (https://phabricator.wikimedia.org/T258152) [09:01:30] (03CR) 10Ema: dnsdist: reload the certificates instead of restarting the service (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/613187 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [09:02:04] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:03:54] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:04:13] (03PS6) 10Ema: varnish: handle checks at healthcheck.wm.org/varnish-fe too [puppet] - 10https://gerrit.wikimedia.org/r/610046 (https://phabricator.wikimedia.org/T255015) [09:04:39] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/610046 (https://phabricator.wikimedia.org/T255015) (owner: 10Ema) [09:06:58] (03CR) 10Muehlenhoff: [C: 03+2] Apply Yarn role to an-tool1008 [puppet] - 10https://gerrit.wikimedia.org/r/613585 (https://phabricator.wikimedia.org/T258152) (owner: 10Muehlenhoff) [09:09:58] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:10:48] (03PS1) 10Addshore: Revert "admins: Setup home dir for addshore" [puppet] - 10https://gerrit.wikimedia.org/r/613602 [09:11:36] (03PS2) 10Addshore: Revert "admins: Setup home dir for addshore" [puppet] - 10https://gerrit.wikimedia.org/r/613602 [09:12:00] !log elukey@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=mw1280.eqiad.wmnet [09:12:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:42] this is a mistake sorry [09:14:46] q [09:15:02] !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=mw1280.eqiad.wmnet [09:15:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:34] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:20:01] 10Operations, 10ops-eqiad: upgrade memory in ganeti100[5-8].eqiad.wmnet - https://phabricator.wikimedia.org/T244530 (10akosiaris) >>! In T244530#6313254, @Dzahn wrote: > @Jclark-ctr The mgmt interface of ganeti1008 just went down. Could you please check the cable? Looks ok now in icinga, so this was transient... [09:22:47] 10Operations, 10Analytics-Clusters, 10Discovery-Search, 10vm-requests, 10Patch-For-Review: VM request for Analytics -> Elastic Search ML models update - https://phabricator.wikimedia.org/T258189 (10elukey) VMs ready! [09:23:17] (03PS1) 10Muehlenhoff: Require profile::hadoop::common in the yarn_proxy profile [puppet] - 10https://gerrit.wikimedia.org/r/613603 (https://phabricator.wikimedia.org/T258152) [09:23:31] (03CR) 10Ema: [C: 03+2] varnish: handle checks at healthcheck.wm.org/varnish-fe too [puppet] - 10https://gerrit.wikimedia.org/r/610046 (https://phabricator.wikimedia.org/T255015) (owner: 10Ema) [09:23:42] (03CR) 10Elukey: [C: 03+1] Require profile::hadoop::common in the yarn_proxy profile [puppet] - 10https://gerrit.wikimedia.org/r/613603 (https://phabricator.wikimedia.org/T258152) (owner: 10Muehlenhoff) [09:23:55] 10Operations, 10ops-eqiad: upgrade memory in ganeti100[5-8].eqiad.wmnet - https://phabricator.wikimedia.org/T244530 (10akosiaris) 05Open→03Resolved Host added back in the cluster, VMs being migrated back to it. @Jclark-ctr many thanks! [09:24:24] RECOVERY - Host etcd1002 is UP: PING OK - Packet loss = 0%, RTA = 3.09 ms [09:24:42] RECOVERY - Host kubetcd1005 is UP: PING OK - Packet loss = 0%, RTA = 0.42 ms [09:25:56] (03PS2) 10Muehlenhoff: Require profile::hadoop::common in the yarn_proxy profile [puppet] - 10https://gerrit.wikimedia.org/r/613603 (https://phabricator.wikimedia.org/T258152) [09:25:57] 10Operations, 10Analytics-Clusters, 10Discovery-Search, 10vm-requests, 10Patch-For-Review: VM request for Analytics -> Elastic Search ML models update - https://phabricator.wikimedia.org/T258189 (10elukey) 05Open→03Resolved a:03elukey [09:29:24] (03CR) 10Elukey: [C: 03+2] Require profile::hadoop::common in the yarn_proxy profile [puppet] - 10https://gerrit.wikimedia.org/r/613603 (https://phabricator.wikimedia.org/T258152) (owner: 10Muehlenhoff) [09:40:37] (03PS5) 10Elukey: Add custom ferm srange to Kafka Jumbo brokers [puppet] - 10https://gerrit.wikimedia.org/r/611168 [09:43:30] (03CR) 10Elukey: [C: 04-2] "Don't merge until https://phabricator.wikimedia.org/T258245 is done" [puppet] - 10https://gerrit.wikimedia.org/r/611168 (owner: 10Elukey) [09:44:58] (03PS3) 10Addshore: Update several Wikidata-related configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612918 (owner: 10Matěj Suchánek) [09:45:08] (03CR) 10Addshore: [C: 03+1] "Looks good" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612918 (owner: 10Matěj Suchánek) [09:50:33] PROBLEM - Host db1145 is DOWN: PING CRITICAL - Packet loss = 100% [09:50:49] errr [09:50:49] what? [09:51:15] jynus: ^ that is a backupsource [09:51:50] (03PS1) 10Muehlenhoff: Switch over Yarn to an-tool1008 in ATS [puppet] - 10https://gerrit.wikimedia.org/r/613606 (https://phabricator.wikimedia.org/T258152) [09:51:54] It is rebooting [09:52:04] UEFI0079: One or more uncorrectable Memory errors occurred in the previous [09:52:04] boot. [09:52:04] Check the System Event Log (SEL) to identify the non-functional DIMM, and then [09:53:22] (03PS1) 10Jbond: DO NOT MERGE: testing [puppet] - 10https://gerrit.wikimedia.org/r/613607 [09:53:32] 10Operations, 10ops-eqiad, 10DBA: db1145 crashed - memory issues - https://phabricator.wikimedia.org/T258249 (10Marostegui) [09:53:39] (03PS1) 10Arturo Borrero Gonzalez: cloud: introduce very basic nftables setup in cloudnet servers [puppet] - 10https://gerrit.wikimedia.org/r/613608 [09:54:11] (03CR) 10Jbond: [C: 04-2] "testing PCC" [puppet] - 10https://gerrit.wikimedia.org/r/613607 (owner: 10Jbond) [09:55:12] 10Operations, 10ops-eqiad, 10DBA: db1145 crashed - memory issues - https://phabricator.wikimedia.org/T258249 (10Marostegui) More HW logs - we've got a broken DIMM apparently ` Record: 1 Date/Time: 03/30/2020 23:27:03 Source: system Severity: Ok Description: Log cleared. -----------------------... [09:55:56] (03PS1) 10Marostegui: db1145: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/613609 (https://phabricator.wikimedia.org/T258249) [09:56:23] (03CR) 10Marostegui: [C: 03+2] db1145: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/613609 (https://phabricator.wikimedia.org/T258249) (owner: 10Marostegui) [09:56:45] RECOVERY - Host db1145 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [09:58:28] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1145 crashed - memory issues - https://phabricator.wikimedia.org/T258249 (10Marostegui) I have hit F1, to continue its boot, so we can see the OS. The memory on the OS looks ok: ` root@db1145:~# free -g total used free... [09:58:48] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1145 crashed - memory issues - https://phabricator.wikimedia.org/T258249 (10Marostegui) p:05Triage→03Medium [09:59:23] PROBLEM - MariaDB read only s5 on db1145 is CRITICAL: Could not connect to localhost:3315 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [09:59:43] PROBLEM - mysqld processes on db1145 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [09:59:45] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1145 crashed - memory issues - https://phabricator.wikimedia.org/T258249 (10Marostegui) [10:00:41] (03Abandoned) 10Jbond: DO NOT MERGE: testing [puppet] - 10https://gerrit.wikimedia.org/r/613607 (owner: 10Jbond) [10:02:35] (03PS1) 10Jbond: DO NOT MERGE: Testing pcc [puppet] - 10https://gerrit.wikimedia.org/r/613610 [10:04:24] (03PS2) 10Arturo Borrero Gonzalez: cloud: introduce very basic nftables setup in cloudnet servers [puppet] - 10https://gerrit.wikimedia.org/r/613608 [10:06:52] (03PS1) 10Muehlenhoff: Disable backports on stretch [puppet] - 10https://gerrit.wikimedia.org/r/613611 (https://phabricator.wikimedia.org/T256877) [10:08:37] (03Abandoned) 10Muehlenhoff: Stop including backports on Stretch production hosts [puppet] - 10https://gerrit.wikimedia.org/r/612612 (https://phabricator.wikimedia.org/T256881) (owner: 10Muehlenhoff) [10:09:26] (03CR) 10Muehlenhoff: [C: 03+2] Install the fcgid package on Netmon [puppet] - 10https://gerrit.wikimedia.org/r/613146 (https://phabricator.wikimedia.org/T247967) (owner: 10Muehlenhoff) [10:11:34] (03CR) 10Muehlenhoff: "Superceded by https://gerrit.wikimedia.org/r/c/operations/puppet/+/612865 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/613146" [puppet] - 10https://gerrit.wikimedia.org/r/611318 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [10:15:36] (03PS3) 10Arturo Borrero Gonzalez: cloud: introduce very basic nftables setup in cloudnet servers [puppet] - 10https://gerrit.wikimedia.org/r/613608 [10:17:39] (03CR) 10Arturo Borrero Gonzalez: "PCC https://puppet-compiler.wmflabs.org/compiler1002/23949/" [puppet] - 10https://gerrit.wikimedia.org/r/613608 (owner: 10Arturo Borrero Gonzalez) [10:18:20] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud: introduce very basic nftables setup in cloudnet servers [puppet] - 10https://gerrit.wikimedia.org/r/613608 (owner: 10Arturo Borrero Gonzalez) [10:27:34] (03PS1) 10Volans: setup.py: add upper limit to prospector version [software/homer] - 10https://gerrit.wikimedia.org/r/613612 [10:27:36] (03PS1) 10Volans: netbox: make Netbox errors surface through Jinja [software/homer] - 10https://gerrit.wikimedia.org/r/613613 [11:06:17] 10Operations, 10ops-eqiad, 10DBA: db1145 crashed - memory issues - https://phabricator.wikimedia.org/T258249 (10jcrespo) I've started a restore process from yesterday's backups. [11:11:23] (03CR) 10Muehlenhoff: "Approach looks fine, sans the nits left by Ema and myself" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/613187 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [11:19:45] (03CR) 10Ssingh: "> Patch Set 4:" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/613187 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [11:21:13] (03PS5) 10Ssingh: dnsdist: reload the certificates instead of restarting the service [puppet] - 10https://gerrit.wikimedia.org/r/613187 (https://phabricator.wikimedia.org/T252132) [11:24:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1104', diff saved to https://phabricator.wikimedia.org/P11937 and previous config saved to /var/cache/conftool/dbconfig/20200717-112413-marostegui.json [11:24:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:27] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1002/23950/malmok.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/613187 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [11:27:47] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:27:56] PROBLEM - Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad #page on icinga1001 is CRITICAL: 0.2188 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver [11:28:07] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [11:28:53] PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [11:29:46] I think it is the server I just repooled, the spike was too much for it [11:29:50] RECOVERY - Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad #page on icinga1001 is OK: (C)0.3 lt (W)0.5 lt 0.8382 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver [11:29:57] it is allready going down [11:29:59] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [11:30:05] yeah, the server is ok now [11:30:13] yea recovery in the graphs too [11:30:30] https://grafana.wikimedia.org/d/000000273/mysql?panelId=40&fullscreen&orgId=1&from=1594978415374&to=1594985374717&var-server=db1104&var-port=9104 [11:30:30] and there's te recovery page [11:30:41] marostegui: so php-fpm workers hanging on db conns? (just saw the alerts) [11:30:43] RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [11:30:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1104', diff saved to https://phabricator.wikimedia.org/P11938 and previous config saved to /var/cache/conftool/dbconfig/20200717-113050-marostegui.json [11:30:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:59] going to depool it and then once it is fully fine, slowly repool it back [11:31:16] how do you slowly repool it? [11:31:28] with small weight changes [11:31:31] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:31:34] multiple changes, I see [11:31:35] ok [11:31:37] yep [11:32:33] hmm [11:33:10] it is ALWAYS the DBA [11:33:16] :( [11:33:20] <3 [11:38:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1104', diff saved to https://phabricator.wikimedia.org/P11939 and previous config saved to /var/cache/conftool/dbconfig/20200717-113800-marostegui.json [11:38:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:02] sure looks good from here so far [11:47:39] * volans here if needed [11:48:01] * addshore reads up [11:51:40] (03PS1) 10Elukey: profile::oozie::server: allow to specific jdbc host:port [puppet] - 10https://gerrit.wikimedia.org/r/613619 (https://phabricator.wikimedia.org/T257412) [11:51:48] 10Operations, 10MediaWiki-extensions-Babel: Two user pages using 304 and 291 Babel language boxes on meta can't be rendered: "request has exceeded memory limit" - https://phabricator.wikimedia.org/T231522 (10Ammarpad) [11:51:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1104', diff saved to https://phabricator.wikimedia.org/P11940 and previous config saved to /var/cache/conftool/dbconfig/20200717-115155-marostegui.json [11:51:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:36] (03CR) 10Effie Mouzeli: [C: 03+1] "Will merge on Monday" [puppet] - 10https://gerrit.wikimedia.org/r/612514 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [12:00:26] 10Operations, 10Arc-Lamp, 10Performance-Team: webperf1002 server close to have /srv partition full - https://phabricator.wikimedia.org/T257931 (10jcrespo) free space: /srv 7134 MB / Usage: 98.2% [12:01:07] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:01:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1104', diff saved to https://phabricator.wikimedia.org/P11941 and previous config saved to /var/cache/conftool/dbconfig/20200717-120126-marostegui.json [12:01:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:33] (03PS1) 10Elukey: profile::hive::client: allow to specify jdbc host:port [puppet] - 10https://gerrit.wikimedia.org/r/613620 (https://phabricator.wikimedia.org/T257412) [12:02:59] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:03:48] (03CR) 10jerkins-bot: [V: 04-1] profile::hive::client: allow to specify jdbc host:port [puppet] - 10https://gerrit.wikimedia.org/r/613620 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [12:04:19] marostegui: what did you do to s8? [12:04:24] ;) [12:05:01] it wasn't me, it was the server! [12:09:39] (03PS2) 10Elukey: profile::hive::client: allow to specify jdbc host:port [puppet] - 10https://gerrit.wikimedia.org/r/613620 (https://phabricator.wikimedia.org/T257412) [12:11:10] (03CR) 10Elukey: [C: 03+2] profile::oozie::server: allow to specific jdbc host:port [puppet] - 10https://gerrit.wikimedia.org/r/613619 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [12:11:31] (03CR) 10Elukey: [C: 03+2] profile::hive::client: allow to specify jdbc host:port [puppet] - 10https://gerrit.wikimedia.org/r/613620 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [12:16:56] 10Operations, 10MediaWiki-extensions-Babel: Two user pages using 304 and 291 Babel language boxes on meta can't be rendered: "request has exceeded memory limit" - https://phabricator.wikimedia.org/T231522 (10Ammarpad) >>! In T231522#6313215, @Urbanecm wrote: > @ammarpad Does that mean it should work in theory?... [12:24:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1104', diff saved to https://phabricator.wikimedia.org/P11944 and previous config saved to /var/cache/conftool/dbconfig/20200717-122400-marostegui.json [12:24:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:22] (03PS1) 10Elukey: Deprecate pivot.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/613624 [12:24:25] moritzm: --^ [12:31:21] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/613624 (owner: 10Elukey) [12:32:22] (03CR) 10Elukey: [C: 03+2] Deprecate pivot.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/613624 (owner: 10Elukey) [12:36:09] (03CR) 10Elukey: "After a review of mariadb::config with some coffee and a fresh mind, I see two things:" [puppet] - 10https://gerrit.wikimedia.org/r/612821 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [12:37:03] (03CR) 10Elukey: [C: 03+2] profile::analytics::database::meta: enable TLS [puppet] - 10https://gerrit.wikimedia.org/r/612821 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [12:39:26] (03PS1) 10Muehlenhoff: Switch Turnilo to CAS [puppet] - 10https://gerrit.wikimedia.org/r/613626 (https://phabricator.wikimedia.org/T159584) [12:46:03] \o/ \o/ \o/ [12:46:10] oohhh another one down [12:46:49] (03PS2) 10Jbond: DO NOT MERGE: Testing pcc [puppet] - 10https://gerrit.wikimedia.org/r/613610 [12:46:53] (03CR) 10Elukey: [C: 03+1] Switch Turnilo to CAS [puppet] - 10https://gerrit.wikimedia.org/r/613626 (https://phabricator.wikimedia.org/T159584) (owner: 10Muehlenhoff) [12:47:37] (03CR) 10jerkins-bot: [V: 04-1] DO NOT MERGE: Testing pcc [puppet] - 10https://gerrit.wikimedia.org/r/613610 (owner: 10Jbond) [12:48:06] (03PS3) 10Jbond: DO NOT MERGE: Testing pcc [puppet] - 10https://gerrit.wikimedia.org/r/613610 [12:48:45] (03CR) 10jerkins-bot: [V: 04-1] DO NOT MERGE: Testing pcc [puppet] - 10https://gerrit.wikimedia.org/r/613610 (owner: 10Jbond) [12:49:21] * elukey so tempted to +2 John's patch [12:49:29] :D lol [12:50:23] (03PS2) 10Muehlenhoff: Switch Turnilo to CAS [puppet] - 10https://gerrit.wikimedia.org/r/613626 (https://phabricator.wikimedia.org/T159584) [12:51:14] (03PS1) 10Jbond: Diffs: add ability to create a full diff report [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/613629 (https://phabricator.wikimedia.org/T256448) [12:51:47] (03CR) 10jerkins-bot: [V: 04-1] Diffs: add ability to create a full diff report [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/613629 (https://phabricator.wikimedia.org/T256448) (owner: 10Jbond) [12:59:59] (03PS3) 10Muehlenhoff: Switch Turnilo to CAS [puppet] - 10https://gerrit.wikimedia.org/r/613626 (https://phabricator.wikimedia.org/T159584) [13:00:15] (03CR) 10Ayounsi: [C: 03+1] "Tested and works as expected!" [software/homer] - 10https://gerrit.wikimedia.org/r/613613 (owner: 10Volans) [13:05:54] (03PS3) 10Privacybatm: Make transferpy configurable using a configuration file [software/transferpy] - 10https://gerrit.wikimedia.org/r/613128 (https://phabricator.wikimedia.org/T257602) [13:09:13] (03PS2) 10Jbond: Diffs: add ability to create a full diff report [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/613629 (https://phabricator.wikimedia.org/T256448) [13:11:51] (03CR) 10Jbond: [C: 03+2] Diffs: add ability to create a full diff report [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/613629 (https://phabricator.wikimedia.org/T256448) (owner: 10Jbond) [13:15:03] (03PS4) 10Jbond: DO NOT MERGE: Testing pcc [puppet] - 10https://gerrit.wikimedia.org/r/613610 [13:15:47] (03CR) 10jerkins-bot: [V: 04-1] DO NOT MERGE: Testing pcc [puppet] - 10https://gerrit.wikimedia.org/r/613610 (owner: 10Jbond) [13:19:19] (03PS5) 10Jbond: DO NOT MERGE: Testing pcc [puppet] - 10https://gerrit.wikimedia.org/r/613610 [13:20:09] (03CR) 10jerkins-bot: [V: 04-1] DO NOT MERGE: Testing pcc [puppet] - 10https://gerrit.wikimedia.org/r/613610 (owner: 10Jbond) [13:20:48] 10Operations, 10ops-eqiad, 10DBA: db1145 crashed - memory issues - https://phabricator.wikimedia.org/T258249 (10jcrespo) a:03wiki_willy The service is back up from backups so the backup service continues uninterrupted during the weekend. @wiki_willy let us know what is the next step as mentioned by Maroste... [13:22:10] (03PS1) 10Privacybatm: Transferer.py: Resolve concurrency issue with checksum file names [software/transferpy] - 10https://gerrit.wikimedia.org/r/613633 (https://phabricator.wikimedia.org/T256450) [13:24:27] 10Operations, 10ops-eqiad, 10DBA: db1145 crashed - memory issues - https://phabricator.wikimedia.org/T258249 (10Marostegui) @jcrespo remember I disabled notifications via puppet, I guess we should leave them disabled until the maintenance is done? [13:24:51] (03PS2) 10Privacybatm: Transferer.py: Resolve concurrency issue with checksum file names [software/transferpy] - 10https://gerrit.wikimedia.org/r/613633 (https://phabricator.wikimedia.org/T256450) [13:26:02] 10Operations, 10ops-eqiad, 10DBA: db1145 crashed - memory issues - https://phabricator.wikimedia.org/T258249 (10jcrespo) Yes, I agree with that option. Thanks for creating the ticket and doing the initial triage! [13:26:32] 10Operations, 10ops-eqiad, 10DBA: db1145 crashed - memory issues - https://phabricator.wikimedia.org/T258249 (10Marostegui) Cool! <3 [13:27:05] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/23959/" [puppet] - 10https://gerrit.wikimedia.org/r/613626 (https://phabricator.wikimedia.org/T159584) (owner: 10Muehlenhoff) [13:28:59] (03PS3) 10Privacybatm: Transferer.py: Resolve concurrency issue with checksum file names [software/transferpy] - 10https://gerrit.wikimedia.org/r/613633 (https://phabricator.wikimedia.org/T256450) [13:30:20] (03CR) 10Privacybatm: "Please give me some initial review on this." [software/transferpy] - 10https://gerrit.wikimedia.org/r/613633 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm) [13:30:22] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/homer] - 10https://gerrit.wikimedia.org/r/613612 (owner: 10Volans) [13:37:14] (03PS1) 10JMeybohm: modules/systemd: Allow to define EnvironmentFile for timers [puppet] - 10https://gerrit.wikimedia.org/r/613634 (https://phabricator.wikimedia.org/T253843) [13:37:16] (03PS1) 10JMeybohm: charmuseum: Add systemd timer to package and push charts [puppet] - 10https://gerrit.wikimedia.org/r/613635 (https://phabricator.wikimedia.org/T253843) [13:41:02] (03PS14) 10Alexandros Kosiaris: Add discovery records for chartmuseum [puppet] - 10https://gerrit.wikimedia.org/r/609403 [13:42:19] (03CR) 10jerkins-bot: [V: 04-1] Add discovery records for chartmuseum [puppet] - 10https://gerrit.wikimedia.org/r/609403 (owner: 10Alexandros Kosiaris) [13:43:49] (03PS15) 10Alexandros Kosiaris: Add discovery records for chartmuseum [puppet] - 10https://gerrit.wikimedia.org/r/609403 [13:43:51] (03PS2) 10JMeybohm: charmuseum: Add systemd timer to package and push charts [puppet] - 10https://gerrit.wikimedia.org/r/613635 (https://phabricator.wikimedia.org/T253843) [13:45:05] PROBLEM - Host kubernetes2002 is DOWN: PING CRITICAL - Packet loss = 100% [13:45:54] (03CR) 10Alexandros Kosiaris: [C: 03+1] "minor comment in commit message, otherwise LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/613635 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [13:46:10] wait, what? kubernetes2002 down? [13:48:05] * akosiaris looking [13:49:05] (03PS3) 10JMeybohm: chartmuseum: Add systemd timer to package and push charts [puppet] - 10https://gerrit.wikimedia.org/r/613635 (https://phabricator.wikimedia.org/T253843) [13:49:53] kubernetes detected this btw ^, it's already scheduling pods in other nodes [13:51:17] (03PS4) 10JMeybohm: chartmuseum: Add systemd timer to package and push charts [puppet] - 10https://gerrit.wikimedia.org/r/613635 (https://phabricator.wikimedia.org/T253843) [13:51:40] PROBLEM - ElasticSearch unassigned shard check - 9243 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - dewiki_content_1594699515[0](2020-07-14T04:05:15.243Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [13:52:59] (03PS1) 10ArielGlenn: start restructure of dumps rsync [puppet] - 10https://gerrit.wikimedia.org/r/613639 (https://phabricator.wikimedia.org/T254856) [13:53:04] console's empty [13:53:09] * akosiaris gonna force a reboot [13:53:23] (03CR) 10jerkins-bot: [V: 04-1] start restructure of dumps rsync [puppet] - 10https://gerrit.wikimedia.org/r/613639 (https://phabricator.wikimedia.org/T254856) (owner: 10ArielGlenn) [13:53:28] !log powercycle kubernetes2002 [13:53:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:05] RECOVERY - Host kubernetes2002 is UP: PING OK - Packet loss = 0%, RTA = 36.20 ms [13:56:07] (03PS5) 10JMeybohm: chartmuseum: Add systemd timer to package and push charts [puppet] - 10https://gerrit.wikimedia.org/r/613635 (https://phabricator.wikimedia.org/T253843) [14:08:21] (03PS6) 10JMeybohm: chartmuseum: Add systemd timer to package and push charts [puppet] - 10https://gerrit.wikimedia.org/r/613635 (https://phabricator.wikimedia.org/T253843) [14:13:12] (03PS1) 10Ayounsi: Routers interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/613641 [14:14:18] (03PS1) 10Ayounsi: Add routers interfaces support to wmf-netbox plugin [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/613642 [14:16:24] (03PS7) 10JMeybohm: chartmuseum: Add systemd timer to package and push charts [puppet] - 10https://gerrit.wikimedia.org/r/613635 (https://phabricator.wikimedia.org/T253843) [14:19:01] (03PS1) 10Ssingh: dnsdist: add a parameter to set the size of the ring buffers [puppet] - 10https://gerrit.wikimedia.org/r/613643 (https://phabricator.wikimedia.org/T252132) [14:19:08] (03CR) 10Alexandros Kosiaris: [C: 03+1] modules/systemd: Allow to define EnvironmentFile for timers [puppet] - 10https://gerrit.wikimedia.org/r/613634 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [14:20:55] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1001/23969/malmok.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/613643 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [14:21:06] (03PS16) 10Alexandros Kosiaris: Add discovery records for chartmuseum [puppet] - 10https://gerrit.wikimedia.org/r/609403 [14:23:05] (03CR) 10JMeybohm: "Thanks. There where still lot's of "puppet beginner" errors, syntax errors etc. 😊" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/613635 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [14:25:37] (03CR) 10Muehlenhoff: modules/systemd: Allow to define EnvironmentFile for timers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/613634 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [14:25:55] (03PS1) 10Mholloway: Add mw_resource_loader_uri to Node.js service config vars [puppet] - 10https://gerrit.wikimedia.org/r/613645 (https://phabricator.wikimedia.org/T258186) [14:26:56] (03CR) 10Alexandros Kosiaris: "PCC at https://puppet-compiler.wmflabs.org/compiler1001/23970/, I 'll run a full fleet PCC as well" [puppet] - 10https://gerrit.wikimedia.org/r/609403 (owner: 10Alexandros Kosiaris) [14:29:16] (03CR) 10Alexandros Kosiaris: "https://puppet-compiler.wmflabs.org/compiler1003/23971/ full fleet PCC, will run for the next 2h or so" [puppet] - 10https://gerrit.wikimedia.org/r/609403 (owner: 10Alexandros Kosiaris) [14:32:06] (03CR) 10JMeybohm: modules/systemd: Allow to define EnvironmentFile for timers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/613634 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [14:32:46] (03PS2) 10JMeybohm: modules/systemd: Allow to define EnvironmentFile for timers [puppet] - 10https://gerrit.wikimedia.org/r/613634 (https://phabricator.wikimedia.org/T253843) [14:32:48] (03PS8) 10JMeybohm: chartmuseum: Add systemd timer to package and push charts [puppet] - 10https://gerrit.wikimedia.org/r/613635 (https://phabricator.wikimedia.org/T253843) [14:34:33] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/613634 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [14:35:23] 10Operations, 10ops-codfw, 10RESTBase: restbase2009 down - https://phabricator.wikimedia.org/T256863 (10Eevans) >>! In T256863#6313866, @wiki_willy wrote: > Hi @Eevans - it looks like this was originally scheduled to be refreshed this fiscal year during the annual CapEx planning, but then someone decided to... [14:45:31] (03PS2) 10ArielGlenn: start restructure of dumps rsync [puppet] - 10https://gerrit.wikimedia.org/r/613639 (https://phabricator.wikimedia.org/T254856) [14:54:04] (03CR) 10Ema: [C: 03+1] "ship it!" [puppet] - 10https://gerrit.wikimedia.org/r/613187 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [14:58:40] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/613187 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [15:08:21] (03PS3) 10ArielGlenn: start restructure of dumps rsync [puppet] - 10https://gerrit.wikimedia.org/r/613639 (https://phabricator.wikimedia.org/T254856) [15:09:56] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/613643 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [15:13:31] (03PS1) 10Hnowlan: api-gateway: Restrict unauthenticated write HTTP methods, permit read HTTP methods [deployment-charts] - 10https://gerrit.wikimedia.org/r/613650 (https://phabricator.wikimedia.org/T254906) [15:16:36] (03PS2) 10Hnowlan: api-gateway: Restrict unauthenticated write HTTP methods, permit read HTTP methods [deployment-charts] - 10https://gerrit.wikimedia.org/r/613650 (https://phabricator.wikimedia.org/T256769) [15:21:51] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:23:43] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:40:18] (03PS1) 10Elukey: profile::piwik::database: add TLS config for mariadb [puppet] - 10https://gerrit.wikimedia.org/r/613651 (https://phabricator.wikimedia.org/T234826) [15:45:31] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/23972/matomo1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/613651 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [15:46:09] 10Operations, 10Epic, 10Goal: automatically collect network error reports from users' browsers - https://phabricator.wikimedia.org/T257527 (10CDanis) p:05Triage→03Medium a:03CDanis [16:03:53] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:04:08] (03CR) 10Ssingh: [C: 03+2] dnsdist: reload the certificates instead of restarting the service [puppet] - 10https://gerrit.wikimedia.org/r/613187 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [16:06:06] (03CR) 10Andrew Bogott: [C: 03+2] toolforge: Set Strict-Transport-Security to 366 days [puppet] - 10https://gerrit.wikimedia.org/r/612947 (https://phabricator.wikimedia.org/T102367) (owner: 10BryanDavis) [16:08:01] (03CR) 10Ssingh: [C: 03+2] dnsdist: add a parameter to set the size of the ring buffers [puppet] - 10https://gerrit.wikimedia.org/r/613643 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [16:10:17] (03PS2) 10Ssingh: dnsdist: add a parameter to set the size of the ring buffers [puppet] - 10https://gerrit.wikimedia.org/r/613643 (https://phabricator.wikimedia.org/T252132) [16:10:17] (03CR) 10Ssingh: [V: 03+2 C: 03+2] dnsdist: add a parameter to set the size of the ring buffers [puppet] - 10https://gerrit.wikimedia.org/r/613643 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [16:17:06] 10Operations, 10DNS, 10Traffic: Verify WikimediaFoundation.org ownership for Facebook - https://phabricator.wikimedia.org/T258284 (10Varnent) [16:21:05] (03PS1) 10CDanis: wikimediafoundation.org domain verification for Facebook [dns] - 10https://gerrit.wikimedia.org/r/613653 (https://phabricator.wikimedia.org/T258284) [16:29:47] (03CR) 10Hnowlan: api-gateway: Basic envoy chart WIP (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [16:30:19] (03PS2) 10CDanis: wikimediafoundation.org domain verification for Facebook [dns] - 10https://gerrit.wikimedia.org/r/613653 (https://phabricator.wikimedia.org/T258284) [16:32:17] (03CR) 10Dzahn: [C: 03+2] Revert "admins: Setup home dir for addshore" [puppet] - 10https://gerrit.wikimedia.org/r/613602 (owner: 10Addshore) [16:32:49] (03CR) 10CDanis: [C: 03+2] wikimediafoundation.org domain verification for Facebook [dns] - 10https://gerrit.wikimedia.org/r/613653 (https://phabricator.wikimedia.org/T258284) (owner: 10CDanis) [16:34:52] 10Operations, 10DNS, 10Traffic, 10Patch-For-Review: Verify WikimediaFoundation.org ownership for Facebook - https://phabricator.wikimedia.org/T258284 (10CDanis) 05Open→03Resolved a:03CDanis `✔️ cdanis@authdns1001.wikimedia.org ~ 🕧☕ host -t txt wikimediafoundation.org wikimediafoundation.org descripti... [16:41:19] 10Operations, 10DNS, 10Traffic, 10Patch-For-Review: Verify WikimediaFoundation.org ownership for Facebook - https://phabricator.wikimedia.org/T258284 (10Varnent) Thank you @CDanis!! [16:43:15] 10Operations, 10ops-codfw, 10RESTBase: restbase2009 down - https://phabricator.wikimedia.org/T256863 (10wiki_willy) >>! In T256863#6315360, @Eevans wrote: >>>! In T256863#6313866, @wiki_willy wrote: >> Hi @Eevans - it looks like this was originally scheduled to be refreshed this fiscal year during the annual... [16:46:42] (03CR) 10BryanDavis: [C: 03+1] "Generated nginx config tested on tools-proxy-06 by manually changing /etc/nginx/sites-available/proxy to match PCC output, restarting ngin" [puppet] - 10https://gerrit.wikimedia.org/r/612948 (https://phabricator.wikimedia.org/T102367) (owner: 10BryanDavis) [16:50:18] (03CR) 10Andrew Bogott: [C: 03+2] toolforge: Perform HTTPS redirects unconditionally [puppet] - 10https://gerrit.wikimedia.org/r/612948 (https://phabricator.wikimedia.org/T102367) (owner: 10BryanDavis) [16:50:29] (03PS2) 10Andrew Bogott: toolforge: Perform HTTPS redirects unconditionally [puppet] - 10https://gerrit.wikimedia.org/r/612948 (https://phabricator.wikimedia.org/T102367) (owner: 10BryanDavis) [16:56:26] 10Operations, 10Toolforge, 10Traffic, 10HTTPS, and 4 others: Migrate tools.wmflabs.org to https only (and set HSTS) - https://phabricator.wikimedia.org/T102367 (10bd808) 05Open→03Resolved mean-time-to-implement: 5 years. Ouch. But it is done now! [17:00:45] (03PS4) 10ArielGlenn: start restructure of dumps rsync [puppet] - 10https://gerrit.wikimedia.org/r/613639 (https://phabricator.wikimedia.org/T254856) [17:05:54] (03PS1) 10Zfilipin: Add .gitreview file [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/613657 (https://phabricator.wikimedia.org/T255761) [17:07:09] (03PS1) 10Tks4Fish: Adding 'rollbacker' group for arzwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613658 (https://phabricator.wikimedia.org/T258100) [17:09:59] 10Operations, 10ops-eqiad, 10DBA: db1145 crashed - memory issues - https://phabricator.wikimedia.org/T258249 (10wiki_willy) a:05wiki_willy→03Jclark-ctr @Jclark-ctr - can you check this one out when you're onsite next? It was only installed a few months ago, so we should be able to RMA the part pretty ea... [17:10:58] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [17:12:32] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [17:14:28] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Degraded RAID on db1131 - https://phabricator.wikimedia.org/T257253 (10wiki_willy) @Marostegui - here are the details below on what Dell replaced. The DIMM A10, the SSD in slot 0, and the system board (though the board wasn't bad...it was just the CMOS... [17:19:13] (03CR) 10Ppchelko: api-gateway: Basic envoy chart WIP (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [17:27:06] 10Operations, 10Toolforge, 10Traffic, 10HTTPS, and 4 others: Migrate tools.wmflabs.org to https only (and set HSTS) - https://phabricator.wikimedia.org/T102367 (10bd808) Announced to community: https://lists.wikimedia.org/pipermail/cloud-announce/2020-July/000304.html [17:28:37] (03PS3) 10Dzahn: phabricator: create separate role/profile for aphlict (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/613205 (https://phabricator.wikimedia.org/T257617) [17:28:52] (03CR) 10jerkins-bot: [V: 04-1] phabricator: create separate role/profile for aphlict (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/613205 (https://phabricator.wikimedia.org/T257617) (owner: 10Dzahn) [17:32:21] (03PS4) 10Dzahn: phabricator: create separate role/profile for aphlict (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/613205 (https://phabricator.wikimedia.org/T257617) [17:33:35] (03CR) 10jerkins-bot: [V: 04-1] phabricator: create separate role/profile for aphlict (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/613205 (https://phabricator.wikimedia.org/T257617) (owner: 10Dzahn) [17:38:58] !log dpifke@deploy1001 Started deploy [performance/arc-lamp@a5d2fd3]: (no justification provided) [17:39:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:03] !log dpifke@deploy1001 Finished deploy [performance/arc-lamp@a5d2fd3]: (no justification provided) (duration: 00m 05s) [17:39:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:21] (03PS5) 10Dzahn: phabricator: create separate role/profile for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/613205 (https://phabricator.wikimedia.org/T257617) [17:43:54] 10Operations, 10ops-eqiad: please connect eqiad's RIPE Atlas anchor to one of the SCSes - https://phabricator.wikimedia.org/T258221 (10wiki_willy) a:03Cmjohnson [17:47:23] 10Operations, 10observability, 10Patch-For-Review: Move Prometheus off eqsin/ulsfo/esams bastions - https://phabricator.wikimedia.org/T243057 (10herron) [17:52:05] (03PS1) 10Herron: prometheus[123]001 assign role::prometheus, add to prometheus_nodes [puppet] - 10https://gerrit.wikimedia.org/r/613662 (https://phabricator.wikimedia.org/T243057) [17:53:34] (03PS2) 10Herron: prometheus[345]001 assign role::prometheus, add to prometheus_nodes [puppet] - 10https://gerrit.wikimedia.org/r/613662 (https://phabricator.wikimedia.org/T243057) [17:59:23] (03PS5) 10Cwhite: debianization [debs/grafana-loki] (debian/sid) - 10https://gerrit.wikimedia.org/r/610864 [18:01:01] 10Operations, 10Arc-Lamp, 10Performance-Team: webperf1002 server close to have /srv partition full - https://phabricator.wikimedia.org/T257931 (10dpifke) Is there an easy way to grow this partition? Another 50-100GB would see us through until this data starts living in Swift later this quarter. (Patches to... [18:01:49] (03CR) 10Cwhite: "comments inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/613662 (https://phabricator.wikimedia.org/T243057) (owner: 10Herron) [18:03:08] (03PS3) 10Herron: prometheus[123]001 assign role::prometheus, add to prometheus_nodes [puppet] - 10https://gerrit.wikimedia.org/r/613662 (https://phabricator.wikimedia.org/T243057) [18:04:52] 10Operations, 10Arc-Lamp, 10Performance-Team: webperf1002 server close to have /srv partition full - https://phabricator.wikimedia.org/T257931 (10Dzahn) Adding a second virtual disk and mounting it is relatively easy. Resizing the existing partition is not so much. [18:05:44] (03CR) 10Cwhite: debianization (032 comments) [debs/grafana-loki] (debian/sid) - 10https://gerrit.wikimedia.org/r/610864 (owner: 10Cwhite) [18:07:28] (03CR) 10Herron: "good catch thanks 👓" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/613662 (https://phabricator.wikimedia.org/T243057) (owner: 10Herron) [18:07:58] 10Operations, 10Arc-Lamp, 10Performance-Team: webperf1002 server close to have /srv partition full - https://phabricator.wikimedia.org/T257931 (10dpifke) ArcLamp is a batch process, so brief downtime while we move the data over onto a new partition isn't a big deal. We could even shut down the instance if n... [18:16:56] (03PS1) 10Ryan Kemper: cirrussearch: Allow 2 dewiki->content shards per node [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613669 [18:17:44] (03CR) 10jerkins-bot: [V: 04-1] cirrussearch: Allow 2 dewiki->content shards per node [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613669 (owner: 10Ryan Kemper) [18:18:56] (03CR) 10Ebernhardson: [C: 03+1] cirrussearch: Allow 2 dewiki->content shards per node [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613669 (owner: 10Ryan Kemper) [18:19:48] ACKNOWLEDGEMENT - ElasticSearch unassigned shard check - 9243 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - dewiki_content_1594699515[0](2020-07-14T04:05:15.243Z) Ryan Kemper Failure to assign shard due to unforeseen implications of row awareness. Working on a patch for this upcoming monday https://wikitech.wikimedia.org/wiki/Search%23Administration [18:21:43] (03PS5) 10ArielGlenn: start restructure of dumps rsync [puppet] - 10https://gerrit.wikimedia.org/r/613639 (https://phabricator.wikimedia.org/T254856) [18:24:43] 10Operations, 10Arc-Lamp, 10Performance-Team: webperf1002 server close to have /srv partition full - https://phabricator.wikimedia.org/T257931 (10Dzahn) I am adding a second virtual disk right now... in progress.... [18:29:33] (03CR) 10Dzahn: [C: 03+2] phabricator: create separate role/profile for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/613205 (https://phabricator.wikimedia.org/T257617) (owner: 10Dzahn) [18:34:37] (03PS1) 10Dzahn: aphlict: comment-out envoy TLS part for now [puppet] - 10https://gerrit.wikimedia.org/r/613672 [18:34:44] 10Operations, 10Phabricator, 10Security-Team: HTTP 500 error trying to access phabricator from my server - https://phabricator.wikimedia.org/T257507 (10Isarra) Heeey, it works. I have no idea what you did or what's going on so I cannot comment further. Good day, but thank you for whatever it was! (Sorry, ha... [18:34:51] 10Operations, 10User-jbond: update profile::waf::apache2::administrative to use the new abuse_networks hiera key - https://phabricator.wikimedia.org/T253632 (10Isarra) [18:34:53] 10Operations, 10Phabricator, 10Security-Team: HTTP 500 error trying to access phabricator from my server - https://phabricator.wikimedia.org/T257507 (10Isarra) 05Open→03Resolved [18:39:22] (03CR) 10Ebernhardson: [C: 03+1] cirrussearch: Allow 2 dewiki->content shards per node (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613669 (owner: 10Ryan Kemper) [18:41:39] (03CR) 10Dzahn: [C: 03+2] aphlict: comment-out envoy TLS part for now [puppet] - 10https://gerrit.wikimedia.org/r/613672 (owner: 10Dzahn) [18:45:43] 10Operations, 10ops-eqiad, 10DC-Ops, 10observability: eqiad: PDU Upgrade in C8 (July 14, 2pm-4pm UTC)) - https://phabricator.wikimedia.org/T257871 (10Jclark-ctr) updated netbox [18:46:01] (03CR) 10Volans: [C: 03+2] setup.py: add upper limit to prospector version [software/homer] - 10https://gerrit.wikimedia.org/r/613612 (owner: 10Volans) [18:46:29] (03CR) 10Volans: [C: 03+2] netbox: make Netbox errors surface through Jinja [software/homer] - 10https://gerrit.wikimedia.org/r/613613 (owner: 10Volans) [18:46:31] 10Operations, 10ops-eqiad, 10DC-Ops, 10observability: eqiad: PDU Upgrade in C8 (July 14, 2pm-4pm UTC)) - https://phabricator.wikimedia.org/T257871 (10Jclark-ctr) 05Open→03Resolved [18:46:33] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By:TBD) rack/setup/install rows C and D new PDUs - https://phabricator.wikimedia.org/T253694 (10Jclark-ctr) [18:47:13] (03Merged) 10jenkins-bot: setup.py: add upper limit to prospector version [software/homer] - 10https://gerrit.wikimedia.org/r/613612 (owner: 10Volans) [18:47:37] (03Merged) 10jenkins-bot: netbox: make Netbox errors surface through Jinja [software/homer] - 10https://gerrit.wikimedia.org/r/613613 (owner: 10Volans) [18:49:49] (03PS1) 10Dzahn: aphlict: no longer require Phabricator class itself [puppet] - 10https://gerrit.wikimedia.org/r/613674 [18:50:31] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Degraded RAID on db1131 - https://phabricator.wikimedia.org/T257253 (10Jclark-ctr) Thanks willy Yes. DIMM A10 SSD slot 0 Main board [19:02:53] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/613662 (https://phabricator.wikimedia.org/T243057) (owner: 10Herron) [19:06:25] (03PS1) 10Ammarpad: Switch $wgUrlShortenerDomainsWhitelist -> $wgUrlShortenerAllowedDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613682 (https://phabricator.wikimedia.org/T258134) [19:08:03] (03PS2) 10Ammarpad: Switch $wgUrlShortenerDomainsWhitelist -> $wgUrlShortenerAllowedDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613682 (https://phabricator.wikimedia.org/T255491) [19:10:31] (03CR) 10Ammarpad: "Hi @JdForester, can you comment whether it's OK to switch this directly like this?." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613682 (https://phabricator.wikimedia.org/T255491) (owner: 10Ammarpad) [19:15:32] (03PS1) 10Ammarpad: Remove wgPopupsPageBlacklist config setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613684 (https://phabricator.wikimedia.org/T254676) [19:20:04] (03CR) 10Jdlrobson: [C: 03+1] "I guess this will have to wait till Monday now? LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613684 (https://phabricator.wikimedia.org/T254676) (owner: 10Ammarpad) [19:24:41] (03CR) 10Ammarpad: "> I guess this will have to wait till Monday now? LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613684 (https://phabricator.wikimedia.org/T254676) (owner: 10Ammarpad) [19:27:45] (03CR) 10Dzahn: [C: 03+2] "noop in prod: https://puppet-compiler.wmflabs.org/compiler1002/23973/" [puppet] - 10https://gerrit.wikimedia.org/r/613674 (owner: 10Dzahn) [19:42:14] (03PS1) 10Dzahn: aphlict: use default basedir, not custom [puppet] - 10https://gerrit.wikimedia.org/r/613697 [20:12:57] (03CR) 10Dzahn: [C: 03+2] "noop in prod https://puppet-compiler.wmflabs.org/compiler1003/23974/" [puppet] - 10https://gerrit.wikimedia.org/r/613697 (owner: 10Dzahn) [20:18:12] arr, got disconnected while having puppet-merge open and now blocking myself "failed to lock" [20:24:43] mutante: I am curious what to do when that happens, so please share :P [20:26:42] you should be able to rm the file [20:27:17] although the script was written in such a way that it *should* have been cleaned up on any termination.. [20:27:56] ok, great. I would be very hesitant to rm anything on puppetmaster otherwise :) [20:28:38] sukhe: ok, update. i just could not get online at all for a while and now when i did it was back to normal by itself [20:28:48] and i could repeat puppet-merge normally [20:29:07] mutante: ah! so maybe it was the clean termination cdanis mentioned [20:29:22] i also had a process running to add a virtual disk on a ganeti VM and it wasn't in screen ..so let's see about that... [20:29:28] sukhe: looks like it, yes [20:31:22] well, it had to timeout [20:31:42] sure, that makes sense -- once the connection got reset, it probably cleaned up [20:32:07] *nod* [20:32:42] ok, now that puppetmaster is clean.. i gotta run and bbl, thx [21:16:16] !log Removing MongoDB packages and data from webperf1002. [21:16:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:45] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 46 probes of 649 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [21:35:23] (03PS1) 10Ladsgroup: osm: Rename "slave" to "replica" [puppet] - 10https://gerrit.wikimedia.org/r/613727 (https://phabricator.wikimedia.org/T254646) [21:44:31] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 8 probes of 649 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [21:48:51] (03PS5) 10Urbanecm: Initial configuration for lijwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611435 (https://phabricator.wikimedia.org/T257672) (owner: 10RhinosF1) [21:49:46] (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for lijwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611435 (https://phabricator.wikimedia.org/T257672) (owner: 10RhinosF1) [21:50:30] (03PS1) 10Ladsgroup: mariadb: Changing link to section from "slave" to "replica" [puppet] - 10https://gerrit.wikimedia.org/r/613732 (https://phabricator.wikimedia.org/T254646) [21:51:20] Ty Urbanecm [21:57:22] (03PS2) 10Urbanecm: Initial configuration for arywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611934 (https://phabricator.wikimedia.org/T257674) [21:58:15] (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for arywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611934 (https://phabricator.wikimedia.org/T257674) (owner: 10Urbanecm) [22:00:50] (03PS3) 10Urbanecm: Initial configuration for arywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611934 (https://phabricator.wikimedia.org/T257674) [22:03:31] (03PS3) 10Urbanecm: Initial configuration for avkwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612897 (https://phabricator.wikimedia.org/T257943) [22:05:12] (03PS6) 10Urbanecm: Initial configuration for lijwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611435 (https://phabricator.wikimedia.org/T257672) (owner: 10RhinosF1) [22:06:11] (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for lijwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611435 (https://phabricator.wikimedia.org/T257672) (owner: 10RhinosF1) [22:07:23] (03PS7) 10Urbanecm: Initial configuration for lijwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611435 (https://phabricator.wikimedia.org/T257672) (owner: 10RhinosF1) [22:32:35] 10Operations, 10Project-Admins: Rename #Operations Phab project to #WMF-SRE (or so) - https://phabricator.wikimedia.org/T258305 (10Aklapper) [22:40:44] RhinosF1: 3378 miraheze wikis, is that possible? [23:08:46] 10Operations, 10Arc-Lamp, 10Performance-Team: webperf1002 server close to have /srv partition full - https://phabricator.wikimedia.org/T257931 (10Dzahn) I got disconnected while the command was still running and failed to run it in screen. And now "gnt-instance info" just hangs.. ugh... [23:16:21] dpifke: is it ok to gzip some older logs in /srv/xenon on webperf1002? [23:17:23] Not until https://gerrit.wikimedia.org/r/c/performance/arc-lamp/+/613740 lands. [23:17:45] Last I heard was that we were going to grow /srv - see https://phabricator.wikimedia.org/T257931 [23:19:24] dpifke: i tried to add a second disk but right now i can't even run the "info" command on it [23:19:59] how about moving some of the oldest files to somewhere in / ..hrmm [23:20:40] trying it again with a smaller disk and running the command in a screen [23:23:15] We could compress some old files so long as we leave zero-length placeholders with the same name so the SVGs don't get deleted. (It removes any SVG which doesn't have a corresponding log.) [23:23:36] That'll break anyone grepping through the logfiles over the weekend, but probably not a big deal. [23:25:01] dpifke: so if i'd add /srv2 with, for example 20 more Gigabytes, that would not even help because the /srv/xenon needs to be in a single place? [23:25:09] then let's do what you suggested [23:25:25] Correct (unless we get tricky with bind mounts). [23:26:01] If the problem adding a disk is that the Ganeti instance is running, it's OK to shut it down briefly. It'll catch up when it comes back. [23:26:56] i don't think the issue is that it's running [23:27:04] Also, a bunch of old files will expire when midnight UTC rolls around in ~30 min - we're just about at the peak for today. [23:27:10] creating the virtual disk should be separate [23:27:17] ah, that's good [23:27:31] about 5G left until then [23:32:44] dpifke: find /srv/xenon/logs/daily -mtime +30 -size +100M -exec gzip {} \; -exec touch {}; ? [23:33:16] well, another \ at the end. but the second -exec should only run if the first is succesful [23:36:22] Hmm... I just realized that's going to cause the SVGs to be regenerated against the empty logs. [23:36:46] (Because the timestamp will change, and we were using ctime instead of mtime.) [23:36:50] hmm, ok, if we can survive until midnight with those 5G that would be easiest then [23:37:03] We should definitely last another few days. [23:37:13] ok, then let's do .. nothing [23:37:45] for today [23:37:58] But I think I'm OK with expiring (deleting) 2-3 days of logs early, so we know for sure it's not going to run into problems over the weekend. [23:38:48] ok, want me do do anything or you got it? [23:40:13] meanwhile i tried to create just a 20G disk to see what happens and it's .. just sitting there. i'll check on it again later and if all fails ask Alex for help [23:41:19] I'll see how much is freed in 20 min, and delete a couple of additional days if it looks like not enough. [23:42:12] oh right, midnight UTC :) ack [23:44:29] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:46:21] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:58:59] 10Operations, 10Pywikibot, 10cloud-services-team (Kanban): http://pywikibot.org/ is displaying Wikimedia error page - https://phabricator.wikimedia.org/T257536 (10Dzahn) Currently I don't see any NS record for pywikbot.org ` dig NS pywikibot.org ... ; <<>> DiG 9.11.5-P4-5.1+deb10u1-Debian <<>> NS pywikibot...