[00:13:57] <wikibugs>	 (03PS1) 10Dave Pifke: arclamp: Run & scrape Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/613359 (https://phabricator.wikimedia.org/T256035)
[00:16:13] <wikibugs>	 (03PS1) 10Reedy: Remove line saying ldaplist will be removed 30 August 2016 [puppet] - 10https://gerrit.wikimedia.org/r/613360
[00:51:17] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:30:30] <wikibugs>	 10Operations, 10Performance-Team, 10serviceops, 10Sustainability (Incident Prevention): Test gutter pool failover in production  and memcached 1.5.x - https://phabricator.wikimedia.org/T240684 (10aaron) We don't really need purges to go to the gutter cache, given the low TTL there.   Lost purges during con...
[03:15:27] <icinga-wm>	 RECOVERY - Ensure hosts are not performing a change on every puppet run on puppetdb1002 is OK: OK: all nodes running as expected https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes
[04:37:47] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[04:39:37] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[04:46:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1092', diff saved to https://phabricator.wikimedia.org/P11933 and previous config saved to /var/cache/conftool/dbconfig/20200717-044658-marostegui.json
[04:47:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:47:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1111', diff saved to https://phabricator.wikimedia.org/P11934 and previous config saved to /var/cache/conftool/dbconfig/20200717-044748-marostegui.json
[04:47:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:50:00] <wikibugs>	 10Operations, 10DBA, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Request new database for idp-test.wikimedia.org - https://phabricator.wikimedia.org/T256120 (10Marostegui) @jbond anything pending here or can this be closed?
[04:58:53] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[05:00:39] <icinga-wm>	 RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[05:01:55] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review, 10User-Kormat: Set up replication for zarcillo - https://phabricator.wikimedia.org/T257816 (10Marostegui) a:03Kormat
[05:03:15] <wikibugs>	 (03PS1) 10Marostegui: db1124: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/613463
[05:04:01] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1124: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/613463 (owner: 10Marostegui)
[05:06:45] <wikibugs>	 (03PS1) 10Marostegui: db1084: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/613466 (https://phabricator.wikimedia.org/T257540)
[05:08:04] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1084: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/613466 (https://phabricator.wikimedia.org/T257540) (owner: 10Marostegui)
[05:15:22] <wikibugs>	 10Operations, 10DBA, 10Sustainability (Incident Prevention): Disallow 'weight: 0' for MW db config in dbctl - https://phabricator.wikimedia.org/T239901 (10Marostegui) 05Open→03Declined Going to close this as declined for now, as looks like we are not going to proceed with this so far.
[05:24:25] <wikibugs>	 10Operations, 10DBA, 10Traffic, 10Patch-For-Review: Framework to transfer files over the LAN - https://phabricator.wikimedia.org/T156462 (10Marostegui) a:05Rduran→03None
[05:26:19] <wikibugs>	 10Operations, 10DBA, 10Epic, 10Performance-Team (Radar), 10Sustainability (Incident Prevention): Decide how to improve  parsercache replication, sharding and HA - https://phabricator.wikimedia.org/T133523 (10Marostegui)
[06:14:06] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 59 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[06:18:20] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[06:19:56] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 46 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[06:20:14] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[06:28:19] <XioNoX>	 !log rename msw1-eqiad interface range
[06:28:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:30:52] <XioNoX>	 !log rename msw1-codfw interface range
[06:30:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:37:14] <wikibugs>	 (03PS2) 10Ayounsi: Add msw interfaces support [homer/public] - 10https://gerrit.wikimedia.org/r/606375
[06:38:24] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Add an-tool1008 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/613141 (owner: 10Muehlenhoff)
[06:38:37] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Add msw interfaces support [homer/public] - 10https://gerrit.wikimedia.org/r/606375 (owner: 10Ayounsi)
[06:39:00] <wikibugs>	 (03Merged) 10jenkins-bot: Add msw interfaces support [homer/public] - 10https://gerrit.wikimedia.org/r/606375 (owner: 10Ayounsi)
[06:40:34] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Add separate role for Yarn [puppet] - 10https://gerrit.wikimedia.org/r/613156 (https://phabricator.wikimedia.org/T258152) (owner: 10Muehlenhoff)
[06:43:16] <wikibugs>	 10Operations, 10Analytics-Clusters, 10Discovery-Search, 10vm-requests: VM request for Analytics -> Elastic Search ML models update - https://phabricator.wikimedia.org/T258189 (10elukey)
[06:47:25] <wikibugs>	 (03CR) 10Muehlenhoff: debianization (032 comments) [debs/grafana-loki] (debian/sid) - 10https://gerrit.wikimedia.org/r/610864 (owner: 10Cwhite)
[06:50:56] <wikibugs>	 10Operations, 10Analytics-Clusters, 10Discovery-Search, 10vm-requests: VM request for Analytics -> Elastic Search ML models update - https://phabricator.wikimedia.org/T258189 (10elukey) @RKemper I am following https://wikitech.wikimedia.org/wiki/Ganeti#Create_a_VM, I'll list all the steps/details etc.. in...
[07:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200717T0700)
[07:00:41] <wikibugs>	 10Operations, 10Analytics-Clusters, 10Discovery-Search, 10vm-requests: VM request for Analytics -> Elastic Search ML models update - https://phabricator.wikimedia.org/T258189 (10elukey) ` elukey@ganeti1011:~$ sudo gnt-group list Group Nodes Instances AllocPolicy NDParams row_A     4        39 preferred   o...
[07:03:54] <icinga-wm>	 PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:03:58] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 50, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:04:20] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 130, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:04:47] <elukey>	 this must be Telia's maintenance
[07:05:17] <elukey>	 yep seems so from the gcal
[07:09:56] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:11:24] <icinga-wm>	 RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:11:26] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 54, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:17:49] <wikibugs>	 (03PS1) 10Elukey: Add PTR/A/AAAA records for search-loader[12]001 VMs [dns] - 10https://gerrit.wikimedia.org/r/613581 (https://phabricator.wikimedia.org/T258189)
[07:18:19] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add separate role for Yarn [puppet] - 10https://gerrit.wikimedia.org/r/613156 (https://phabricator.wikimedia.org/T258152) (owner: 10Muehlenhoff)
[07:20:43] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Add PTR/A/AAAA records for search-loader[12]001 VMs [dns] - 10https://gerrit.wikimedia.org/r/613581 (https://phabricator.wikimedia.org/T258189) (owner: 10Elukey)
[07:23:17] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] mobileapps: add request template for load.php requests to config [deployment-charts] - 10https://gerrit.wikimedia.org/r/613237 (https://phabricator.wikimedia.org/T258186) (owner: 10Mholloway)
[07:24:21] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: add request template for load.php requests to config [deployment-charts] - 10https://gerrit.wikimedia.org/r/613237 (https://phabricator.wikimedia.org/T258186) (owner: 10Mholloway)
[07:24:28] <icinga-wm>	 PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:24:30] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 52, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:28:18] <wikibugs>	 (03PS1) 10Muehlenhoff: Apply Yarn role to an-tool1008 [puppet] - 10https://gerrit.wikimedia.org/r/613585
[07:28:43] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Apply Yarn role to an-tool1008 [puppet] - 10https://gerrit.wikimedia.org/r/613585 (owner: 10Muehlenhoff)
[07:30:28] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.ganeti.makevm
[07:30:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:32:06] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'mobileapps' for release 'staging' .
[07:32:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:32:20] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 130, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:33:20] <wikibugs>	 (03PS2) 10Muehlenhoff: Apply Yarn role to an-tool1008 [puppet] - 10https://gerrit.wikimedia.org/r/613585 (https://phabricator.wikimedia.org/T258152)
[07:33:34] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' .
[07:33:34] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
[07:33:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:33:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:34:12] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:34:23] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
[07:34:24] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' .
[07:34:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:34:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:41:14] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "Nice. Thanks for working this out!" [puppet] - 10https://gerrit.wikimedia.org/r/609403 (owner: 10Alexandros Kosiaris)
[07:43:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1111', diff saved to https://phabricator.wikimedia.org/P11935 and previous config saved to /var/cache/conftool/dbconfig/20200717-074335-marostegui.json
[07:43:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:45:22] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 130, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:49:38] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 67 probes of 567 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[07:50:30] <icinga-wm>	 RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:50:34] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 54, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:50:54] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:51:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1104', diff saved to https://phabricator.wikimedia.org/P11936 and previous config saved to /var/cache/conftool/dbconfig/20200717-075124-marostegui.json
[07:51:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:54:20] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
[07:54:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:55:30] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 45 probes of 567 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[07:59:29] <wikibugs>	 (03PS2) 10ZPapierski: add logout config for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/612681 (https://phabricator.wikimedia.org/T257314) (owner: 10Mstyles)
[08:05:23] <wikibugs>	 10Operations, 10Analytics-Clusters, 10Discovery-Search, 10vm-requests, 10Patch-For-Review: VM request for Analytics -> Elastic Search ML models update - https://phabricator.wikimedia.org/T258189 (10elukey) ` elukey@cumin1001:~$ sudo cookbook sre.ganeti.makevm eqiad_D search-loader1001.eqiad.wmnet --vcpus...
[08:05:53] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.ganeti.makevm
[08:05:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:09:46] <wikibugs>	 (03PS3) 10ZPapierski: add logout config for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/612681 (https://phabricator.wikimedia.org/T257314) (owner: 10Mstyles)
[08:25:50] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "Sigh, icinga1001 isn't picking up the changes (https://puppet-compiler.wmflabs.org/compiler1003/23917/) This needs some more work." [puppet] - 10https://gerrit.wikimedia.org/r/609403 (owner: 10Alexandros Kosiaris)
[08:27:02] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[08:29:04] <wikibugs>	 10Operations, 10DBA, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Request new database for idp-test.wikimedia.org - https://phabricator.wikimedia.org/T256120 (10jbond) 05Open→03Resolved nothing pending on this task, resolving and thanks
[08:29:06] <wikibugs>	 10Operations, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: CAS Store U2f tokens in a database - https://phabricator.wikimedia.org/T256113 (10jbond)
[08:29:28] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
[08:29:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:30:39] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/613146 (https://phabricator.wikimedia.org/T247967) (owner: 10Muehlenhoff)
[08:30:48] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[08:30:59] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "> Patch Set 3:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611393 (owner: 10Lucas Werkmeister (WMDE))
[08:32:13] <wikibugs>	 (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/612514 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond)
[08:33:26] <wikibugs>	 10Operations, 10Analytics-Clusters, 10Discovery-Search, 10vm-requests, 10Patch-For-Review: VM request for Analytics -> Elastic Search ML models update - https://phabricator.wikimedia.org/T258189 (10elukey) ` elukey@cumin1001:~$ sudo cookbook sre.ganeti.makevm codfw_D search-loader2001.codfw.wmnet --vcpus...
[08:34:56] <wikibugs>	 (03PS1) 10Elukey: Introduce search-loader[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/613599 (https://phabricator.wikimedia.org/T258189)
[08:34:58] <wikibugs>	 (03PS10) 10Jbond: profile::mediawiki::mcrouter_wancache: refactor [puppet] - 10https://gerrit.wikimedia.org/r/612514 (https://phabricator.wikimedia.org/T247956)
[08:38:04] <wikibugs>	 (03PS2) 10Elukey: Introduce search-loader[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/613599 (https://phabricator.wikimedia.org/T258189)
[08:38:13] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] prometheus::memcached_exporter: fix arguments hiera call [puppet] - 10https://gerrit.wikimedia.org/r/612507 (owner: 10Jbond)
[08:38:53] <wikibugs>	 (03PS15) 10Jbond: P:mediawiki::mcrouter_wancache: refactor parameters [puppet] - 10https://gerrit.wikimedia.org/r/612523 (https://phabricator.wikimedia.org/T247956)
[08:39:04] <wikibugs>	 (03PS4) 10Ema: dnsdist: reload the certificates instead of restarting the service [puppet] - 10https://gerrit.wikimedia.org/r/613187 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh)
[08:39:17] <wikibugs>	 (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/613187 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh)
[08:39:45] <wikibugs>	 (03PS1) 10Muehlenhoff: Bump changelog for buster rebuild [debs/prometheus-atlas-exporter] - 10https://gerrit.wikimedia.org/r/613600
[08:40:38] <wikibugs>	 (03PS9) 10Jbond: mcrouter: store defaults in module not in hiera [puppet] - 10https://gerrit.wikimedia.org/r/612532 (https://phabricator.wikimedia.org/T247956)
[08:40:50] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Introduce search-loader[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/613599 (https://phabricator.wikimedia.org/T258189) (owner: 10Elukey)
[08:41:27] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] prometheus::memcached_exporter: fix arguments hiera call [puppet] - 10https://gerrit.wikimedia.org/r/612507 (owner: 10Jbond)
[08:46:38] <wikibugs>	 (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Bump changelog for buster rebuild [debs/prometheus-atlas-exporter] - 10https://gerrit.wikimedia.org/r/613600 (owner: 10Muehlenhoff)
[08:48:28] <moritzm>	 !log imported prometheus-atlas-exporter 1.0+git20191204.ffafab7-2 to buster-wikimedia T247967
[08:48:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:48:33] <stashbot>	 T247967: Migrate role::netmon to Buster - https://phabricator.wikimedia.org/T247967
[08:57:11] <wikibugs>	 (03PS1) 10Jbond: debug_host: add entry point for debug_host script [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/613601
[08:58:10] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] debug_host: add entry point for debug_host script [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/613601 (owner: 10Jbond)
[08:59:17] <wikibugs>	 (03PS3) 10Muehlenhoff: Apply Yarn role to an-tool1008 [puppet] - 10https://gerrit.wikimedia.org/r/613585 (https://phabricator.wikimedia.org/T258152)
[09:01:30] <wikibugs>	 (03CR) 10Ema: dnsdist: reload the certificates instead of restarting the service (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/613187 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh)
[09:02:04] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:03:54] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:04:13] <wikibugs>	 (03PS6) 10Ema: varnish: handle checks at healthcheck.wm.org/varnish-fe too [puppet] - 10https://gerrit.wikimedia.org/r/610046 (https://phabricator.wikimedia.org/T255015)
[09:04:39] <wikibugs>	 (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/610046 (https://phabricator.wikimedia.org/T255015) (owner: 10Ema)
[09:06:58] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Apply Yarn role to an-tool1008 [puppet] - 10https://gerrit.wikimedia.org/r/613585 (https://phabricator.wikimedia.org/T258152) (owner: 10Muehlenhoff)
[09:09:58] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[09:10:48] <wikibugs>	 (03PS1) 10Addshore: Revert "admins: Setup home dir for addshore" [puppet] - 10https://gerrit.wikimedia.org/r/613602
[09:11:36] <wikibugs>	 (03PS2) 10Addshore: Revert "admins: Setup home dir for addshore" [puppet] - 10https://gerrit.wikimedia.org/r/613602
[09:12:00] <logmsgbot>	 !log elukey@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=mw1280.eqiad.wmnet
[09:12:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:12:42] <elukey>	 this is a mistake sorry
[09:14:46] <jbond42>	 q
[09:15:02] <logmsgbot>	 !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=mw1280.eqiad.wmnet
[09:15:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:15:34] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[09:20:01] <wikibugs>	 10Operations, 10ops-eqiad: upgrade memory in ganeti100[5-8].eqiad.wmnet - https://phabricator.wikimedia.org/T244530 (10akosiaris) >>! In T244530#6313254, @Dzahn wrote: > @Jclark-ctr The mgmt interface of ganeti1008 just went down. Could you please check the cable?  Looks ok now in icinga, so this was transient...
[09:22:47] <wikibugs>	 10Operations, 10Analytics-Clusters, 10Discovery-Search, 10vm-requests, 10Patch-For-Review: VM request for Analytics -> Elastic Search ML models update - https://phabricator.wikimedia.org/T258189 (10elukey) VMs ready!
[09:23:17] <wikibugs>	 (03PS1) 10Muehlenhoff: Require profile::hadoop::common in the yarn_proxy profile [puppet] - 10https://gerrit.wikimedia.org/r/613603 (https://phabricator.wikimedia.org/T258152)
[09:23:31] <wikibugs>	 (03CR) 10Ema: [C: 03+2] varnish: handle checks at healthcheck.wm.org/varnish-fe too [puppet] - 10https://gerrit.wikimedia.org/r/610046 (https://phabricator.wikimedia.org/T255015) (owner: 10Ema)
[09:23:42] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Require profile::hadoop::common in the yarn_proxy profile [puppet] - 10https://gerrit.wikimedia.org/r/613603 (https://phabricator.wikimedia.org/T258152) (owner: 10Muehlenhoff)
[09:23:55] <wikibugs>	 10Operations, 10ops-eqiad: upgrade memory in ganeti100[5-8].eqiad.wmnet - https://phabricator.wikimedia.org/T244530 (10akosiaris) 05Open→03Resolved Host added back in the cluster, VMs being migrated back to it.   @Jclark-ctr many thanks!
[09:24:24] <icinga-wm>	 RECOVERY - Host etcd1002 is UP: PING OK - Packet loss = 0%, RTA = 3.09 ms
[09:24:42] <icinga-wm>	 RECOVERY - Host kubetcd1005 is UP: PING OK - Packet loss = 0%, RTA = 0.42 ms
[09:25:56] <wikibugs>	 (03PS2) 10Muehlenhoff: Require profile::hadoop::common in the yarn_proxy profile [puppet] - 10https://gerrit.wikimedia.org/r/613603 (https://phabricator.wikimedia.org/T258152)
[09:25:57] <wikibugs>	 10Operations, 10Analytics-Clusters, 10Discovery-Search, 10vm-requests, 10Patch-For-Review: VM request for Analytics -> Elastic Search ML models update - https://phabricator.wikimedia.org/T258189 (10elukey) 05Open→03Resolved a:03elukey
[09:29:24] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Require profile::hadoop::common in the yarn_proxy profile [puppet] - 10https://gerrit.wikimedia.org/r/613603 (https://phabricator.wikimedia.org/T258152) (owner: 10Muehlenhoff)
[09:40:37] <wikibugs>	 (03PS5) 10Elukey: Add custom ferm srange to Kafka Jumbo brokers [puppet] - 10https://gerrit.wikimedia.org/r/611168
[09:43:30] <wikibugs>	 (03CR) 10Elukey: [C: 04-2] "Don't merge until https://phabricator.wikimedia.org/T258245 is done" [puppet] - 10https://gerrit.wikimedia.org/r/611168 (owner: 10Elukey)
[09:44:58] <wikibugs>	 (03PS3) 10Addshore: Update several Wikidata-related configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612918 (owner: 10Matěj Suchánek)
[09:45:08] <wikibugs>	 (03CR) 10Addshore: [C: 03+1] "Looks good" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612918 (owner: 10Matěj Suchánek)
[09:50:33] <icinga-wm>	 PROBLEM - Host db1145 is DOWN: PING CRITICAL - Packet loss = 100%
[09:50:49] <marostegui>	 errr
[09:50:49] <marostegui>	 what?
[09:51:15] <marostegui>	 jynus: ^ that is a backupsource
[09:51:50] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch over Yarn to an-tool1008 in ATS [puppet] - 10https://gerrit.wikimedia.org/r/613606 (https://phabricator.wikimedia.org/T258152)
[09:51:54] <marostegui>	 It is rebooting
[09:52:04] <marostegui>	 UEFI0079: One or more uncorrectable Memory errors occurred in the previous
[09:52:04] <marostegui>	 boot.
[09:52:04] <marostegui>	 Check the System Event Log (SEL) to identify the non-functional DIMM, and then
[09:53:22] <wikibugs>	 (03PS1) 10Jbond: DO NOT MERGE: testing [puppet] - 10https://gerrit.wikimedia.org/r/613607
[09:53:32] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: db1145 crashed - memory issues - https://phabricator.wikimedia.org/T258249 (10Marostegui)
[09:53:39] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloud: introduce very basic nftables setup in cloudnet servers [puppet] - 10https://gerrit.wikimedia.org/r/613608
[09:54:11] <wikibugs>	 (03CR) 10Jbond: [C: 04-2] "testing PCC" [puppet] - 10https://gerrit.wikimedia.org/r/613607 (owner: 10Jbond)
[09:55:12] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: db1145 crashed - memory issues - https://phabricator.wikimedia.org/T258249 (10Marostegui) More HW logs - we've got a broken DIMM apparently ` Record:      1 Date/Time:   03/30/2020 23:27:03 Source:      system Severity:    Ok Description: Log cleared. -----------------------...
[09:55:56] <wikibugs>	 (03PS1) 10Marostegui: db1145: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/613609 (https://phabricator.wikimedia.org/T258249)
[09:56:23] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1145: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/613609 (https://phabricator.wikimedia.org/T258249) (owner: 10Marostegui)
[09:56:45] <icinga-wm>	 RECOVERY - Host db1145 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms
[09:58:28] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1145 crashed - memory issues - https://phabricator.wikimedia.org/T258249 (10Marostegui) I have hit F1, to continue its boot, so we can see the OS. The memory on the OS looks ok: ` root@db1145:~# free -g               total        used        free...
[09:58:48] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1145 crashed - memory issues - https://phabricator.wikimedia.org/T258249 (10Marostegui) p:05Triage→03Medium
[09:59:23] <icinga-wm>	 PROBLEM - MariaDB read only s5 on db1145 is CRITICAL: Could not connect to localhost:3315 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[09:59:43] <icinga-wm>	 PROBLEM - mysqld processes on db1145 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[09:59:45] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1145 crashed - memory issues - https://phabricator.wikimedia.org/T258249 (10Marostegui)
[10:00:41] <wikibugs>	 (03Abandoned) 10Jbond: DO NOT MERGE: testing [puppet] - 10https://gerrit.wikimedia.org/r/613607 (owner: 10Jbond)
[10:02:35] <wikibugs>	 (03PS1) 10Jbond: DO NOT MERGE: Testing pcc [puppet] - 10https://gerrit.wikimedia.org/r/613610
[10:04:24] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: cloud: introduce very basic nftables setup in cloudnet servers [puppet] - 10https://gerrit.wikimedia.org/r/613608
[10:06:52] <wikibugs>	 (03PS1) 10Muehlenhoff: Disable backports on stretch [puppet] - 10https://gerrit.wikimedia.org/r/613611 (https://phabricator.wikimedia.org/T256877)
[10:08:37] <wikibugs>	 (03Abandoned) 10Muehlenhoff: Stop including backports on Stretch production hosts [puppet] - 10https://gerrit.wikimedia.org/r/612612 (https://phabricator.wikimedia.org/T256881) (owner: 10Muehlenhoff)
[10:09:26] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Install the fcgid package on Netmon [puppet] - 10https://gerrit.wikimedia.org/r/613146 (https://phabricator.wikimedia.org/T247967) (owner: 10Muehlenhoff)
[10:11:34] <wikibugs>	 (03CR) 10Muehlenhoff: "Superceded by https://gerrit.wikimedia.org/r/c/operations/puppet/+/612865 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/613146" [puppet] - 10https://gerrit.wikimedia.org/r/611318 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi)
[10:15:36] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: cloud: introduce very basic nftables setup in cloudnet servers [puppet] - 10https://gerrit.wikimedia.org/r/613608
[10:17:39] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "PCC https://puppet-compiler.wmflabs.org/compiler1002/23949/" [puppet] - 10https://gerrit.wikimedia.org/r/613608 (owner: 10Arturo Borrero Gonzalez)
[10:18:20] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud: introduce very basic nftables setup in cloudnet servers [puppet] - 10https://gerrit.wikimedia.org/r/613608 (owner: 10Arturo Borrero Gonzalez)
[10:27:34] <wikibugs>	 (03PS1) 10Volans: setup.py: add upper limit to prospector version [software/homer] - 10https://gerrit.wikimedia.org/r/613612
[10:27:36] <wikibugs>	 (03PS1) 10Volans: netbox: make Netbox errors surface through Jinja [software/homer] - 10https://gerrit.wikimedia.org/r/613613
[11:06:17] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: db1145 crashed - memory issues - https://phabricator.wikimedia.org/T258249 (10jcrespo) I've started a restore process from yesterday's backups.
[11:11:23] <wikibugs>	 (03CR) 10Muehlenhoff: "Approach looks fine, sans the nits left by Ema and myself" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/613187 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh)
[11:19:45] <wikibugs>	 (03CR) 10Ssingh: "> Patch Set 4:" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/613187 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh)
[11:21:13] <wikibugs>	 (03PS5) 10Ssingh: dnsdist: reload the certificates instead of restarting the service [puppet] - 10https://gerrit.wikimedia.org/r/613187 (https://phabricator.wikimedia.org/T252132)
[11:24:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1104', diff saved to https://phabricator.wikimedia.org/P11937 and previous config saved to /var/cache/conftool/dbconfig/20200717-112413-marostegui.json
[11:24:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:24:27] <wikibugs>	 (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1002/23950/malmok.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/613187 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh)
[11:27:47] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[11:27:56] <icinga-wm>	 PROBLEM - Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad #page on icinga1001 is CRITICAL: 0.2188 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver
[11:28:07] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m
[11:28:53] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[11:29:46] <marostegui>	 I think it is the server I just repooled, the spike was too much for it
[11:29:50] <icinga-wm>	 RECOVERY - Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad #page on icinga1001 is OK: (C)0.3 lt (W)0.5 lt 0.8382 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver
[11:29:57] <jbond42>	 it is allready going down
[11:29:59] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[11:30:05] <marostegui>	 yeah, the server is ok now
[11:30:13] <apergos>	 yea recovery in the graphs too
[11:30:30] <marostegui>	 https://grafana.wikimedia.org/d/000000273/mysql?panelId=40&fullscreen&orgId=1&from=1594978415374&to=1594985374717&var-server=db1104&var-port=9104
[11:30:30] <apergos>	 and there's te recovery page
[11:30:41] <elukey>	 marostegui: so php-fpm workers hanging on db conns? (just saw the alerts)
[11:30:43] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[11:30:51] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1104', diff saved to https://phabricator.wikimedia.org/P11938 and previous config saved to /var/cache/conftool/dbconfig/20200717-113050-marostegui.json
[11:30:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:30:59] <marostegui>	 going to depool it and then once it is fully fine, slowly repool it back
[11:31:16] <apergos>	 how do you slowly repool it?
[11:31:28] <marostegui>	 with small weight changes
[11:31:31] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[11:31:34] <apergos>	 multiple changes, I see
[11:31:35] <apergos>	 ok
[11:31:37] <marostegui>	 yep
[11:32:33] <effie>	 hmm
[11:33:10] <effie>	 it is ALWAYS the DBA 
[11:33:16] <marostegui>	 :(
[11:33:20] <effie>	 <3
[11:38:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1104', diff saved to https://phabricator.wikimedia.org/P11939 and previous config saved to /var/cache/conftool/dbconfig/20200717-113800-marostegui.json
[11:38:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:47:02] <apergos>	 sure looks good from here so far
[11:47:39] * volans here if needed
[11:48:01] * addshore reads up
[11:51:40] <wikibugs>	 (03PS1) 10Elukey: profile::oozie::server: allow to specific jdbc host:port [puppet] - 10https://gerrit.wikimedia.org/r/613619 (https://phabricator.wikimedia.org/T257412)
[11:51:48] <wikibugs>	 10Operations, 10MediaWiki-extensions-Babel: Two user pages using 304 and 291 Babel language boxes on meta can't be rendered: "request has exceeded memory limit" - https://phabricator.wikimedia.org/T231522 (10Ammarpad)
[11:51:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1104', diff saved to https://phabricator.wikimedia.org/P11940 and previous config saved to /var/cache/conftool/dbconfig/20200717-115155-marostegui.json
[11:51:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:52:36] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] "Will merge on Monday" [puppet] - 10https://gerrit.wikimedia.org/r/612514 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond)
[12:00:26] <wikibugs>	 10Operations, 10Arc-Lamp, 10Performance-Team: webperf1002 server close to have /srv partition full - https://phabricator.wikimedia.org/T257931 (10jcrespo) free space: /srv 7134 MB / Usage: 98.2%
[12:01:07] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:01:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1104', diff saved to https://phabricator.wikimedia.org/P11941 and previous config saved to /var/cache/conftool/dbconfig/20200717-120126-marostegui.json
[12:01:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:02:33] <wikibugs>	 (03PS1) 10Elukey: profile::hive::client: allow to specify jdbc host:port [puppet] - 10https://gerrit.wikimedia.org/r/613620 (https://phabricator.wikimedia.org/T257412)
[12:02:59] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:03:48] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] profile::hive::client: allow to specify jdbc host:port [puppet] - 10https://gerrit.wikimedia.org/r/613620 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey)
[12:04:19] <cdanis>	 marostegui: what did you do to s8?
[12:04:24] <cdanis>	 ;)
[12:05:01] <marostegui>	 it wasn't me, it was the server!
[12:09:39] <wikibugs>	 (03PS2) 10Elukey: profile::hive::client: allow to specify jdbc host:port [puppet] - 10https://gerrit.wikimedia.org/r/613620 (https://phabricator.wikimedia.org/T257412)
[12:11:10] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] profile::oozie::server: allow to specific jdbc host:port [puppet] - 10https://gerrit.wikimedia.org/r/613619 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey)
[12:11:31] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] profile::hive::client: allow to specify jdbc host:port [puppet] - 10https://gerrit.wikimedia.org/r/613620 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey)
[12:16:56] <wikibugs>	 10Operations, 10MediaWiki-extensions-Babel: Two user pages using 304 and 291 Babel language boxes on meta can't be rendered: "request has exceeded memory limit" - https://phabricator.wikimedia.org/T231522 (10Ammarpad) >>! In T231522#6313215, @Urbanecm wrote: > @ammarpad Does that mean it should work in theory?...
[12:24:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1104', diff saved to https://phabricator.wikimedia.org/P11944 and previous config saved to /var/cache/conftool/dbconfig/20200717-122400-marostegui.json
[12:24:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:24:22] <wikibugs>	 (03PS1) 10Elukey: Deprecate pivot.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/613624
[12:24:25] <elukey>	 moritzm: --^
[12:31:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/613624 (owner: 10Elukey)
[12:32:22] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Deprecate pivot.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/613624 (owner: 10Elukey)
[12:36:09] <wikibugs>	 (03CR) 10Elukey: "After a review of mariadb::config with some coffee and a fresh mind, I see two things:" [puppet] - 10https://gerrit.wikimedia.org/r/612821 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey)
[12:37:03] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] profile::analytics::database::meta: enable TLS [puppet] - 10https://gerrit.wikimedia.org/r/612821 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey)
[12:39:26] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch Turnilo to CAS [puppet] - 10https://gerrit.wikimedia.org/r/613626 (https://phabricator.wikimedia.org/T159584)
[12:46:03] <elukey>	 \o/ \o/ \o/
[12:46:10] <apergos>	 oohhh another one down
[12:46:49] <wikibugs>	 (03PS2) 10Jbond: DO NOT MERGE: Testing pcc [puppet] - 10https://gerrit.wikimedia.org/r/613610
[12:46:53] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Switch Turnilo to CAS [puppet] - 10https://gerrit.wikimedia.org/r/613626 (https://phabricator.wikimedia.org/T159584) (owner: 10Muehlenhoff)
[12:47:37] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] DO NOT MERGE: Testing pcc [puppet] - 10https://gerrit.wikimedia.org/r/613610 (owner: 10Jbond)
[12:48:06] <wikibugs>	 (03PS3) 10Jbond: DO NOT MERGE: Testing pcc [puppet] - 10https://gerrit.wikimedia.org/r/613610
[12:48:45] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] DO NOT MERGE: Testing pcc [puppet] - 10https://gerrit.wikimedia.org/r/613610 (owner: 10Jbond)
[12:49:21] * elukey so tempted to +2 John's patch
[12:49:29] <jbond42>	 :D lol
[12:50:23] <wikibugs>	 (03PS2) 10Muehlenhoff: Switch Turnilo to CAS [puppet] - 10https://gerrit.wikimedia.org/r/613626 (https://phabricator.wikimedia.org/T159584)
[12:51:14] <wikibugs>	 (03PS1) 10Jbond: Diffs: add ability to create a full diff report [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/613629 (https://phabricator.wikimedia.org/T256448)
[12:51:47] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Diffs: add ability to create a full diff report [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/613629 (https://phabricator.wikimedia.org/T256448) (owner: 10Jbond)
[12:59:59] <wikibugs>	 (03PS3) 10Muehlenhoff: Switch Turnilo to CAS [puppet] - 10https://gerrit.wikimedia.org/r/613626 (https://phabricator.wikimedia.org/T159584)
[13:00:15] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "Tested and works as expected!" [software/homer] - 10https://gerrit.wikimedia.org/r/613613 (owner: 10Volans)
[13:05:54] <wikibugs>	 (03PS3) 10Privacybatm: Make transferpy configurable using a configuration file [software/transferpy] - 10https://gerrit.wikimedia.org/r/613128 (https://phabricator.wikimedia.org/T257602)
[13:09:13] <wikibugs>	 (03PS2) 10Jbond: Diffs: add ability to create a full diff report [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/613629 (https://phabricator.wikimedia.org/T256448)
[13:11:51] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Diffs: add ability to create a full diff report [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/613629 (https://phabricator.wikimedia.org/T256448) (owner: 10Jbond)
[13:15:03] <wikibugs>	 (03PS4) 10Jbond: DO NOT MERGE: Testing pcc [puppet] - 10https://gerrit.wikimedia.org/r/613610
[13:15:47] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] DO NOT MERGE: Testing pcc [puppet] - 10https://gerrit.wikimedia.org/r/613610 (owner: 10Jbond)
[13:19:19] <wikibugs>	 (03PS5) 10Jbond: DO NOT MERGE: Testing pcc [puppet] - 10https://gerrit.wikimedia.org/r/613610
[13:20:09] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] DO NOT MERGE: Testing pcc [puppet] - 10https://gerrit.wikimedia.org/r/613610 (owner: 10Jbond)
[13:20:48] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: db1145 crashed - memory issues - https://phabricator.wikimedia.org/T258249 (10jcrespo) a:03wiki_willy The service is back up from backups so the backup service continues uninterrupted during the weekend. @wiki_willy let us know what is the next step as mentioned by Maroste...
[13:22:10] <wikibugs>	 (03PS1) 10Privacybatm: Transferer.py: Resolve concurrency issue with checksum file names [software/transferpy] - 10https://gerrit.wikimedia.org/r/613633 (https://phabricator.wikimedia.org/T256450)
[13:24:27] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: db1145 crashed - memory issues - https://phabricator.wikimedia.org/T258249 (10Marostegui) @jcrespo remember I disabled notifications via puppet, I guess we should leave them disabled until the maintenance is done?
[13:24:51] <wikibugs>	 (03PS2) 10Privacybatm: Transferer.py: Resolve concurrency issue with checksum file names [software/transferpy] - 10https://gerrit.wikimedia.org/r/613633 (https://phabricator.wikimedia.org/T256450)
[13:26:02] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: db1145 crashed - memory issues - https://phabricator.wikimedia.org/T258249 (10jcrespo) Yes, I agree with that option. Thanks for creating the ticket and doing the initial triage!
[13:26:32] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: db1145 crashed - memory issues - https://phabricator.wikimedia.org/T258249 (10Marostegui) Cool! <3
[13:27:05] <wikibugs>	 (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/23959/" [puppet] - 10https://gerrit.wikimedia.org/r/613626 (https://phabricator.wikimedia.org/T159584) (owner: 10Muehlenhoff)
[13:28:59] <wikibugs>	 (03PS3) 10Privacybatm: Transferer.py: Resolve concurrency issue with checksum file names [software/transferpy] - 10https://gerrit.wikimedia.org/r/613633 (https://phabricator.wikimedia.org/T256450)
[13:30:20] <wikibugs>	 (03CR) 10Privacybatm: "Please give me some initial review on this." [software/transferpy] - 10https://gerrit.wikimedia.org/r/613633 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm)
[13:30:22] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [software/homer] - 10https://gerrit.wikimedia.org/r/613612 (owner: 10Volans)
[13:37:14] <wikibugs>	 (03PS1) 10JMeybohm: modules/systemd: Allow to define EnvironmentFile for timers [puppet] - 10https://gerrit.wikimedia.org/r/613634 (https://phabricator.wikimedia.org/T253843)
[13:37:16] <wikibugs>	 (03PS1) 10JMeybohm: charmuseum: Add systemd timer to package and push charts [puppet] - 10https://gerrit.wikimedia.org/r/613635 (https://phabricator.wikimedia.org/T253843)
[13:41:02] <wikibugs>	 (03PS14) 10Alexandros Kosiaris: Add discovery records for chartmuseum [puppet] - 10https://gerrit.wikimedia.org/r/609403
[13:42:19] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add discovery records for chartmuseum [puppet] - 10https://gerrit.wikimedia.org/r/609403 (owner: 10Alexandros Kosiaris)
[13:43:49] <wikibugs>	 (03PS15) 10Alexandros Kosiaris: Add discovery records for chartmuseum [puppet] - 10https://gerrit.wikimedia.org/r/609403
[13:43:51] <wikibugs>	 (03PS2) 10JMeybohm: charmuseum: Add systemd timer to package and push charts [puppet] - 10https://gerrit.wikimedia.org/r/613635 (https://phabricator.wikimedia.org/T253843)
[13:45:05] <icinga-wm>	 PROBLEM - Host kubernetes2002 is DOWN: PING CRITICAL - Packet loss = 100%
[13:45:54] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] "minor comment in commit message, otherwise LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/613635 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm)
[13:46:10] <akosiaris>	 wait, what? kubernetes2002 down?
[13:48:05] * akosiaris looking
[13:49:05] <wikibugs>	 (03PS3) 10JMeybohm: chartmuseum: Add systemd timer to package and push charts [puppet] - 10https://gerrit.wikimedia.org/r/613635 (https://phabricator.wikimedia.org/T253843)
[13:49:53] <akosiaris>	 kubernetes detected this btw ^, it's already scheduling pods in other nodes
[13:51:17] <wikibugs>	 (03PS4) 10JMeybohm: chartmuseum: Add systemd timer to package and push charts [puppet] - 10https://gerrit.wikimedia.org/r/613635 (https://phabricator.wikimedia.org/T253843)
[13:51:40] <icinga-wm>	 PROBLEM - ElasticSearch unassigned shard check - 9243 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - dewiki_content_1594699515[0](2020-07-14T04:05:15.243Z) https://wikitech.wikimedia.org/wiki/Search%23Administration
[13:52:59] <wikibugs>	 (03PS1) 10ArielGlenn: start restructure of dumps rsync [puppet] - 10https://gerrit.wikimedia.org/r/613639 (https://phabricator.wikimedia.org/T254856)
[13:53:04] <akosiaris>	 console's empty
[13:53:09] * akosiaris gonna force a reboot
[13:53:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] start restructure of dumps rsync [puppet] - 10https://gerrit.wikimedia.org/r/613639 (https://phabricator.wikimedia.org/T254856) (owner: 10ArielGlenn)
[13:53:28] <akosiaris>	 !log powercycle kubernetes2002
[13:53:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:56:05] <icinga-wm>	 RECOVERY - Host kubernetes2002 is UP: PING OK - Packet loss = 0%, RTA = 36.20 ms
[13:56:07] <wikibugs>	 (03PS5) 10JMeybohm: chartmuseum: Add systemd timer to package and push charts [puppet] - 10https://gerrit.wikimedia.org/r/613635 (https://phabricator.wikimedia.org/T253843)
[14:08:21] <wikibugs>	 (03PS6) 10JMeybohm: chartmuseum: Add systemd timer to package and push charts [puppet] - 10https://gerrit.wikimedia.org/r/613635 (https://phabricator.wikimedia.org/T253843)
[14:13:12] <wikibugs>	 (03PS1) 10Ayounsi: Routers interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/613641
[14:14:18] <wikibugs>	 (03PS1) 10Ayounsi: Add routers interfaces support to wmf-netbox plugin [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/613642
[14:16:24] <wikibugs>	 (03PS7) 10JMeybohm: chartmuseum: Add systemd timer to package and push charts [puppet] - 10https://gerrit.wikimedia.org/r/613635 (https://phabricator.wikimedia.org/T253843)
[14:19:01] <wikibugs>	 (03PS1) 10Ssingh: dnsdist: add a parameter to set the size of the ring buffers [puppet] - 10https://gerrit.wikimedia.org/r/613643 (https://phabricator.wikimedia.org/T252132)
[14:19:08] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] modules/systemd: Allow to define EnvironmentFile for timers [puppet] - 10https://gerrit.wikimedia.org/r/613634 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm)
[14:20:55] <wikibugs>	 (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1001/23969/malmok.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/613643 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh)
[14:21:06] <wikibugs>	 (03PS16) 10Alexandros Kosiaris: Add discovery records for chartmuseum [puppet] - 10https://gerrit.wikimedia.org/r/609403
[14:23:05] <wikibugs>	 (03CR) 10JMeybohm: "Thanks. There where still lot's of "puppet beginner" errors, syntax errors etc. 😊" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/613635 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm)
[14:25:37] <wikibugs>	 (03CR) 10Muehlenhoff: modules/systemd: Allow to define EnvironmentFile for timers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/613634 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm)
[14:25:55] <wikibugs>	 (03PS1) 10Mholloway: Add mw_resource_loader_uri to Node.js service config vars [puppet] - 10https://gerrit.wikimedia.org/r/613645 (https://phabricator.wikimedia.org/T258186)
[14:26:56] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "PCC at https://puppet-compiler.wmflabs.org/compiler1001/23970/, I 'll run a full fleet PCC as well" [puppet] - 10https://gerrit.wikimedia.org/r/609403 (owner: 10Alexandros Kosiaris)
[14:29:16] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "https://puppet-compiler.wmflabs.org/compiler1003/23971/ full fleet PCC, will run for the next 2h or so" [puppet] - 10https://gerrit.wikimedia.org/r/609403 (owner: 10Alexandros Kosiaris)
[14:32:06] <wikibugs>	 (03CR) 10JMeybohm: modules/systemd: Allow to define EnvironmentFile for timers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/613634 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm)
[14:32:46] <wikibugs>	 (03PS2) 10JMeybohm: modules/systemd: Allow to define EnvironmentFile for timers [puppet] - 10https://gerrit.wikimedia.org/r/613634 (https://phabricator.wikimedia.org/T253843)
[14:32:48] <wikibugs>	 (03PS8) 10JMeybohm: chartmuseum: Add systemd timer to package and push charts [puppet] - 10https://gerrit.wikimedia.org/r/613635 (https://phabricator.wikimedia.org/T253843)
[14:34:33] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/613634 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm)
[14:35:23] <wikibugs>	 10Operations, 10ops-codfw, 10RESTBase: restbase2009 down - https://phabricator.wikimedia.org/T256863 (10Eevans) >>! In T256863#6313866, @wiki_willy wrote: > Hi @Eevans - it looks like this was originally scheduled to be refreshed this fiscal year during the annual CapEx planning, but then someone decided to...
[14:45:31] <wikibugs>	 (03PS2) 10ArielGlenn: start restructure of dumps rsync [puppet] - 10https://gerrit.wikimedia.org/r/613639 (https://phabricator.wikimedia.org/T254856)
[14:54:04] <wikibugs>	 (03CR) 10Ema: [C: 03+1] "ship it!" [puppet] - 10https://gerrit.wikimedia.org/r/613187 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh)
[14:58:40] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/613187 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh)
[15:08:21] <wikibugs>	 (03PS3) 10ArielGlenn: start restructure of dumps rsync [puppet] - 10https://gerrit.wikimedia.org/r/613639 (https://phabricator.wikimedia.org/T254856)
[15:09:56] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/613643 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh)
[15:13:31] <wikibugs>	 (03PS1) 10Hnowlan: api-gateway: Restrict unauthenticated write HTTP methods, permit read HTTP methods [deployment-charts] - 10https://gerrit.wikimedia.org/r/613650 (https://phabricator.wikimedia.org/T254906)
[15:16:36] <wikibugs>	 (03PS2) 10Hnowlan: api-gateway: Restrict unauthenticated write HTTP methods, permit read HTTP methods [deployment-charts] - 10https://gerrit.wikimedia.org/r/613650 (https://phabricator.wikimedia.org/T256769)
[15:21:51] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:23:43] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:40:18] <wikibugs>	 (03PS1) 10Elukey: profile::piwik::database: add TLS config for mariadb [puppet] - 10https://gerrit.wikimedia.org/r/613651 (https://phabricator.wikimedia.org/T234826)
[15:45:31] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/23972/matomo1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/613651 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey)
[15:46:09] <wikibugs>	 10Operations, 10Epic, 10Goal: automatically collect network error reports from users' browsers - https://phabricator.wikimedia.org/T257527 (10CDanis) p:05Triage→03Medium a:03CDanis
[16:03:53] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:04:08] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] dnsdist: reload the certificates instead of restarting the service [puppet] - 10https://gerrit.wikimedia.org/r/613187 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh)
[16:06:06] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] toolforge: Set Strict-Transport-Security to 366 days [puppet] - 10https://gerrit.wikimedia.org/r/612947 (https://phabricator.wikimedia.org/T102367) (owner: 10BryanDavis)
[16:08:01] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] dnsdist: add a parameter to set the size of the ring buffers [puppet] - 10https://gerrit.wikimedia.org/r/613643 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh)
[16:10:17] <wikibugs>	 (03PS2) 10Ssingh: dnsdist: add a parameter to set the size of the ring buffers [puppet] - 10https://gerrit.wikimedia.org/r/613643 (https://phabricator.wikimedia.org/T252132)
[16:10:17] <wikibugs>	 (03CR) 10Ssingh: [V: 03+2 C: 03+2] dnsdist: add a parameter to set the size of the ring buffers [puppet] - 10https://gerrit.wikimedia.org/r/613643 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh)
[16:17:06] <wikibugs>	 10Operations, 10DNS, 10Traffic: Verify WikimediaFoundation.org ownership for Facebook - https://phabricator.wikimedia.org/T258284 (10Varnent)
[16:21:05] <wikibugs>	 (03PS1) 10CDanis: wikimediafoundation.org domain verification for Facebook [dns] - 10https://gerrit.wikimedia.org/r/613653 (https://phabricator.wikimedia.org/T258284)
[16:29:47] <wikibugs>	 (03CR) 10Hnowlan: api-gateway: Basic envoy chart WIP (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan)
[16:30:19] <wikibugs>	 (03PS2) 10CDanis: wikimediafoundation.org domain verification for Facebook [dns] - 10https://gerrit.wikimedia.org/r/613653 (https://phabricator.wikimedia.org/T258284)
[16:32:17] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Revert "admins: Setup home dir for addshore" [puppet] - 10https://gerrit.wikimedia.org/r/613602 (owner: 10Addshore)
[16:32:49] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] wikimediafoundation.org domain verification for Facebook [dns] - 10https://gerrit.wikimedia.org/r/613653 (https://phabricator.wikimedia.org/T258284) (owner: 10CDanis)
[16:34:52] <wikibugs>	 10Operations, 10DNS, 10Traffic, 10Patch-For-Review: Verify WikimediaFoundation.org ownership for Facebook - https://phabricator.wikimedia.org/T258284 (10CDanis) 05Open→03Resolved a:03CDanis `✔️ cdanis@authdns1001.wikimedia.org ~ 🕧☕ host -t txt wikimediafoundation.org wikimediafoundation.org descripti...
[16:41:19] <wikibugs>	 10Operations, 10DNS, 10Traffic, 10Patch-For-Review: Verify WikimediaFoundation.org ownership for Facebook - https://phabricator.wikimedia.org/T258284 (10Varnent) Thank you @CDanis!!
[16:43:15] <wikibugs>	 10Operations, 10ops-codfw, 10RESTBase: restbase2009 down - https://phabricator.wikimedia.org/T256863 (10wiki_willy) >>! In T256863#6315360, @Eevans wrote: >>>! In T256863#6313866, @wiki_willy wrote: >> Hi @Eevans - it looks like this was originally scheduled to be refreshed this fiscal year during the annual...
[16:46:42] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+1] "Generated nginx config tested on tools-proxy-06 by manually changing /etc/nginx/sites-available/proxy to match PCC output, restarting ngin" [puppet] - 10https://gerrit.wikimedia.org/r/612948 (https://phabricator.wikimedia.org/T102367) (owner: 10BryanDavis)
[16:50:18] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] toolforge: Perform HTTPS redirects unconditionally [puppet] - 10https://gerrit.wikimedia.org/r/612948 (https://phabricator.wikimedia.org/T102367) (owner: 10BryanDavis)
[16:50:29] <wikibugs>	 (03PS2) 10Andrew Bogott: toolforge: Perform HTTPS redirects unconditionally [puppet] - 10https://gerrit.wikimedia.org/r/612948 (https://phabricator.wikimedia.org/T102367) (owner: 10BryanDavis)
[16:56:26] <wikibugs>	 10Operations, 10Toolforge, 10Traffic, 10HTTPS, and 4 others: Migrate tools.wmflabs.org to https only (and set HSTS) - https://phabricator.wikimedia.org/T102367 (10bd808) 05Open→03Resolved mean-time-to-implement: 5 years. Ouch. But it is done now!
[17:00:45] <wikibugs>	 (03PS4) 10ArielGlenn: start restructure of dumps rsync [puppet] - 10https://gerrit.wikimedia.org/r/613639 (https://phabricator.wikimedia.org/T254856)
[17:05:54] <wikibugs>	 (03PS1) 10Zfilipin: Add .gitreview file [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/613657 (https://phabricator.wikimedia.org/T255761)
[17:07:09] <wikibugs>	 (03PS1) 10Tks4Fish: Adding 'rollbacker' group for arzwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613658 (https://phabricator.wikimedia.org/T258100)
[17:09:59] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: db1145 crashed - memory issues - https://phabricator.wikimedia.org/T258249 (10wiki_willy) a:05wiki_willy→03Jclark-ctr @Jclark-ctr - can you check this one out when you're onsite next?  It was only installed a few months ago, so we should be able to RMA the part pretty ea...
[17:10:58] <icinga-wm>	 PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX
[17:12:32] <icinga-wm>	 RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[17:14:28] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Degraded RAID on db1131 - https://phabricator.wikimedia.org/T257253 (10wiki_willy) @Marostegui - here are the details below on what Dell replaced.  The DIMM A10, the SSD in slot 0, and the system board (though the board wasn't bad...it was just the CMOS...
[17:19:13] <wikibugs>	 (03CR) 10Ppchelko: api-gateway: Basic envoy chart WIP (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan)
[17:27:06] <wikibugs>	 10Operations, 10Toolforge, 10Traffic, 10HTTPS, and 4 others: Migrate tools.wmflabs.org to https only (and set HSTS) - https://phabricator.wikimedia.org/T102367 (10bd808) Announced to community: https://lists.wikimedia.org/pipermail/cloud-announce/2020-July/000304.html
[17:28:37] <wikibugs>	 (03PS3) 10Dzahn: phabricator: create separate role/profile for aphlict (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/613205 (https://phabricator.wikimedia.org/T257617)
[17:28:52] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] phabricator: create separate role/profile for aphlict (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/613205 (https://phabricator.wikimedia.org/T257617) (owner: 10Dzahn)
[17:32:21] <wikibugs>	 (03PS4) 10Dzahn: phabricator: create separate role/profile for aphlict (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/613205 (https://phabricator.wikimedia.org/T257617)
[17:33:35] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] phabricator: create separate role/profile for aphlict (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/613205 (https://phabricator.wikimedia.org/T257617) (owner: 10Dzahn)
[17:38:58] <logmsgbot>	 !log dpifke@deploy1001 Started deploy [performance/arc-lamp@a5d2fd3]: (no justification provided)
[17:39:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:39:03] <logmsgbot>	 !log dpifke@deploy1001 Finished deploy [performance/arc-lamp@a5d2fd3]: (no justification provided) (duration: 00m 05s)
[17:39:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:40:21] <wikibugs>	 (03PS5) 10Dzahn: phabricator: create separate role/profile for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/613205 (https://phabricator.wikimedia.org/T257617)
[17:43:54] <wikibugs>	 10Operations, 10ops-eqiad: please connect eqiad's RIPE Atlas anchor to one of the SCSes - https://phabricator.wikimedia.org/T258221 (10wiki_willy) a:03Cmjohnson
[17:47:23] <wikibugs>	 10Operations, 10observability, 10Patch-For-Review: Move Prometheus off eqsin/ulsfo/esams bastions - https://phabricator.wikimedia.org/T243057 (10herron)
[17:52:05] <wikibugs>	 (03PS1) 10Herron: prometheus[123]001 assign role::prometheus, add to prometheus_nodes [puppet] - 10https://gerrit.wikimedia.org/r/613662 (https://phabricator.wikimedia.org/T243057)
[17:53:34] <wikibugs>	 (03PS2) 10Herron: prometheus[345]001 assign role::prometheus, add to prometheus_nodes [puppet] - 10https://gerrit.wikimedia.org/r/613662 (https://phabricator.wikimedia.org/T243057)
[17:59:23] <wikibugs>	 (03PS5) 10Cwhite: debianization [debs/grafana-loki] (debian/sid) - 10https://gerrit.wikimedia.org/r/610864
[18:01:01] <wikibugs>	 10Operations, 10Arc-Lamp, 10Performance-Team: webperf1002 server close to have /srv partition full - https://phabricator.wikimedia.org/T257931 (10dpifke) Is there an easy way to grow this partition?  Another 50-100GB would see us through until this data starts living in Swift later this quarter.  (Patches to...
[18:01:49] <wikibugs>	 (03CR) 10Cwhite: "comments inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/613662 (https://phabricator.wikimedia.org/T243057) (owner: 10Herron)
[18:03:08] <wikibugs>	 (03PS3) 10Herron: prometheus[123]001 assign role::prometheus, add to prometheus_nodes [puppet] - 10https://gerrit.wikimedia.org/r/613662 (https://phabricator.wikimedia.org/T243057)
[18:04:52] <wikibugs>	 10Operations, 10Arc-Lamp, 10Performance-Team: webperf1002 server close to have /srv partition full - https://phabricator.wikimedia.org/T257931 (10Dzahn) Adding a second virtual disk and mounting it is relatively easy. Resizing the existing partition is not so much.
[18:05:44] <wikibugs>	 (03CR) 10Cwhite: debianization (032 comments) [debs/grafana-loki] (debian/sid) - 10https://gerrit.wikimedia.org/r/610864 (owner: 10Cwhite)
[18:07:28] <wikibugs>	 (03CR) 10Herron: "good catch thanks 👓" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/613662 (https://phabricator.wikimedia.org/T243057) (owner: 10Herron)
[18:07:58] <wikibugs>	 10Operations, 10Arc-Lamp, 10Performance-Team: webperf1002 server close to have /srv partition full - https://phabricator.wikimedia.org/T257931 (10dpifke) ArcLamp is a batch process, so brief downtime while we move the data over onto a new partition isn't a big deal.  We could even shut down the instance if n...
[18:16:56] <wikibugs>	 (03PS1) 10Ryan Kemper: cirrussearch: Allow 2 dewiki->content shards per node [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613669
[18:17:44] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cirrussearch: Allow 2 dewiki->content shards per node [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613669 (owner: 10Ryan Kemper)
[18:18:56] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+1] cirrussearch: Allow 2 dewiki->content shards per node [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613669 (owner: 10Ryan Kemper)
[18:19:48] <icinga-wm>	 ACKNOWLEDGEMENT - ElasticSearch unassigned shard check - 9243 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - dewiki_content_1594699515[0](2020-07-14T04:05:15.243Z) Ryan Kemper Failure to assign shard due to unforeseen implications of row awareness. Working on a patch for this upcoming monday https://wikitech.wikimedia.org/wiki/Search%23Administration
[18:21:43] <wikibugs>	 (03PS5) 10ArielGlenn: start restructure of dumps rsync [puppet] - 10https://gerrit.wikimedia.org/r/613639 (https://phabricator.wikimedia.org/T254856)
[18:24:43] <wikibugs>	 10Operations, 10Arc-Lamp, 10Performance-Team: webperf1002 server close to have /srv partition full - https://phabricator.wikimedia.org/T257931 (10Dzahn) I am adding a second virtual disk right now... in progress....
[18:29:33] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] phabricator: create separate role/profile for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/613205 (https://phabricator.wikimedia.org/T257617) (owner: 10Dzahn)
[18:34:37] <wikibugs>	 (03PS1) 10Dzahn: aphlict: comment-out envoy TLS part for now [puppet] - 10https://gerrit.wikimedia.org/r/613672
[18:34:44] <wikibugs>	 10Operations, 10Phabricator, 10Security-Team: HTTP 500 error trying to access phabricator from my server - https://phabricator.wikimedia.org/T257507 (10Isarra) Heeey, it works. I have no idea what you did or what's going on so I cannot comment further. Good day, but thank you for whatever it was!  (Sorry, ha...
[18:34:51] <wikibugs>	 10Operations, 10User-jbond: update profile::waf::apache2::administrative to use the new abuse_networks hiera key - https://phabricator.wikimedia.org/T253632 (10Isarra)
[18:34:53] <wikibugs>	 10Operations, 10Phabricator, 10Security-Team: HTTP 500 error trying to access phabricator from my server - https://phabricator.wikimedia.org/T257507 (10Isarra) 05Open→03Resolved
[18:39:22] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+1] cirrussearch: Allow 2 dewiki->content shards per node (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613669 (owner: 10Ryan Kemper)
[18:41:39] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] aphlict: comment-out envoy TLS part for now [puppet] - 10https://gerrit.wikimedia.org/r/613672 (owner: 10Dzahn)
[18:45:43] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10observability: eqiad: PDU Upgrade in C8 (July 14, 2pm-4pm UTC)) - https://phabricator.wikimedia.org/T257871 (10Jclark-ctr) updated netbox
[18:46:01] <wikibugs>	 (03CR) 10Volans: [C: 03+2] setup.py: add upper limit to prospector version [software/homer] - 10https://gerrit.wikimedia.org/r/613612 (owner: 10Volans)
[18:46:29] <wikibugs>	 (03CR) 10Volans: [C: 03+2] netbox: make Netbox errors surface through Jinja [software/homer] - 10https://gerrit.wikimedia.org/r/613613 (owner: 10Volans)
[18:46:31] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10observability: eqiad: PDU Upgrade in C8 (July 14, 2pm-4pm UTC)) - https://phabricator.wikimedia.org/T257871 (10Jclark-ctr) 05Open→03Resolved
[18:46:33] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: (Need By:TBD) rack/setup/install rows C and D new PDUs - https://phabricator.wikimedia.org/T253694 (10Jclark-ctr)
[18:47:13] <wikibugs>	 (03Merged) 10jenkins-bot: setup.py: add upper limit to prospector version [software/homer] - 10https://gerrit.wikimedia.org/r/613612 (owner: 10Volans)
[18:47:37] <wikibugs>	 (03Merged) 10jenkins-bot: netbox: make Netbox errors surface through Jinja [software/homer] - 10https://gerrit.wikimedia.org/r/613613 (owner: 10Volans)
[18:49:49] <wikibugs>	 (03PS1) 10Dzahn: aphlict: no longer require Phabricator class itself [puppet] - 10https://gerrit.wikimedia.org/r/613674
[18:50:31] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Degraded RAID on db1131 - https://phabricator.wikimedia.org/T257253 (10Jclark-ctr) Thanks willy   Yes.  DIMM A10 SSD slot 0  Main board
[19:02:53] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/613662 (https://phabricator.wikimedia.org/T243057) (owner: 10Herron)
[19:06:25] <wikibugs>	 (03PS1) 10Ammarpad: Switch  $wgUrlShortenerDomainsWhitelist -> $wgUrlShortenerAllowedDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613682 (https://phabricator.wikimedia.org/T258134)
[19:08:03] <wikibugs>	 (03PS2) 10Ammarpad: Switch  $wgUrlShortenerDomainsWhitelist -> $wgUrlShortenerAllowedDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613682 (https://phabricator.wikimedia.org/T255491)
[19:10:31] <wikibugs>	 (03CR) 10Ammarpad: "Hi @JdForester, can you comment whether it's OK to switch this directly like this?." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613682 (https://phabricator.wikimedia.org/T255491) (owner: 10Ammarpad)
[19:15:32] <wikibugs>	 (03PS1) 10Ammarpad: Remove wgPopupsPageBlacklist config setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613684 (https://phabricator.wikimedia.org/T254676)
[19:20:04] <wikibugs>	 (03CR) 10Jdlrobson: [C: 03+1] "I guess this will have to wait till Monday now? LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613684 (https://phabricator.wikimedia.org/T254676) (owner: 10Ammarpad)
[19:24:41] <wikibugs>	 (03CR) 10Ammarpad: "> I guess this will have to wait till Monday now? LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613684 (https://phabricator.wikimedia.org/T254676) (owner: 10Ammarpad)
[19:27:45] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "noop in prod: https://puppet-compiler.wmflabs.org/compiler1002/23973/" [puppet] - 10https://gerrit.wikimedia.org/r/613674 (owner: 10Dzahn)
[19:42:14] <wikibugs>	 (03PS1) 10Dzahn: aphlict: use default basedir, not custom [puppet] - 10https://gerrit.wikimedia.org/r/613697
[20:12:57] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "noop in prod https://puppet-compiler.wmflabs.org/compiler1003/23974/" [puppet] - 10https://gerrit.wikimedia.org/r/613697 (owner: 10Dzahn)
[20:18:12] <mutante>	 arr, got disconnected while having puppet-merge open and now blocking myself "failed to lock"
[20:24:43] <sukhe>	 mutante: I am curious what to do when that happens, so please share :P
[20:26:42] <cdanis>	 you should be able to rm the file
[20:27:17] <cdanis>	 although the script was written in such a way that it *should* have been cleaned up on any termination..
[20:27:56] <sukhe>	 ok, great. I would be very hesitant to rm anything on puppetmaster otherwise :)
[20:28:38] <mutante>	 sukhe: ok, update. i just could not get online at all for a while and now when i did it was back to normal by itself
[20:28:48] <mutante>	 and i could repeat puppet-merge normally
[20:29:07] <sukhe>	 mutante: ah! so maybe it was the clean termination cdanis mentioned
[20:29:22] <mutante>	 i also had a process running to add a virtual disk on a ganeti VM and it wasn't in screen ..so let's see about that...
[20:29:28] <mutante>	 sukhe: looks like it, yes
[20:31:22] <mutante>	 well, it had to timeout 
[20:31:42] <cdanis>	 sure, that makes sense -- once the connection got reset, it probably cleaned up
[20:32:07] <mutante>	 *nod* 
[20:32:42] <mutante>	 ok, now that puppetmaster is clean.. i gotta run and bbl, thx
[21:16:16] <dpifke>	 !log Removing MongoDB packages and data from webperf1002. 
[21:16:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:32:45] <icinga-wm>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 46 probes of 649 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[21:35:23] <wikibugs>	 (03PS1) 10Ladsgroup: osm: Rename "slave" to "replica" [puppet] - 10https://gerrit.wikimedia.org/r/613727 (https://phabricator.wikimedia.org/T254646)
[21:44:31] <icinga-wm>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 8 probes of 649 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[21:48:51] <wikibugs>	 (03PS5) 10Urbanecm: Initial configuration for lijwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611435 (https://phabricator.wikimedia.org/T257672) (owner: 10RhinosF1)
[21:49:46] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for lijwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611435 (https://phabricator.wikimedia.org/T257672) (owner: 10RhinosF1)
[21:50:30] <wikibugs>	 (03PS1) 10Ladsgroup: mariadb: Changing link to section from "slave" to "replica" [puppet] - 10https://gerrit.wikimedia.org/r/613732 (https://phabricator.wikimedia.org/T254646)
[21:51:20] <RhinosF1>	 Ty Urbanecm
[21:57:22] <wikibugs>	 (03PS2) 10Urbanecm: Initial configuration for arywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611934 (https://phabricator.wikimedia.org/T257674)
[21:58:15] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for arywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611934 (https://phabricator.wikimedia.org/T257674) (owner: 10Urbanecm)
[22:00:50] <wikibugs>	 (03PS3) 10Urbanecm: Initial configuration for arywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611934 (https://phabricator.wikimedia.org/T257674)
[22:03:31] <wikibugs>	 (03PS3) 10Urbanecm: Initial configuration for avkwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612897 (https://phabricator.wikimedia.org/T257943)
[22:05:12] <wikibugs>	 (03PS6) 10Urbanecm: Initial configuration for lijwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611435 (https://phabricator.wikimedia.org/T257672) (owner: 10RhinosF1)
[22:06:11] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for lijwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611435 (https://phabricator.wikimedia.org/T257672) (owner: 10RhinosF1)
[22:07:23] <wikibugs>	 (03PS7) 10Urbanecm: Initial configuration for lijwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611435 (https://phabricator.wikimedia.org/T257672) (owner: 10RhinosF1)
[22:32:35] <wikibugs>	 10Operations, 10Project-Admins: Rename #Operations Phab project to #WMF-SRE (or so) - https://phabricator.wikimedia.org/T258305 (10Aklapper)
[22:40:44] <mutante>	 RhinosF1: 3378 miraheze wikis, is that possible?
[23:08:46] <wikibugs>	 10Operations, 10Arc-Lamp, 10Performance-Team: webperf1002 server close to have /srv partition full - https://phabricator.wikimedia.org/T257931 (10Dzahn) I got disconnected while the command was still running and failed to run it in screen.   And now "gnt-instance info" just hangs.. ugh...
[23:16:21] <mutante>	 dpifke: is it ok to gzip some older logs in /srv/xenon on webperf1002?
[23:17:23] <dpifke>	 Not until https://gerrit.wikimedia.org/r/c/performance/arc-lamp/+/613740 lands.
[23:17:45] <dpifke>	 Last I heard was that we were going to grow /srv - see https://phabricator.wikimedia.org/T257931
[23:19:24] <mutante>	 dpifke: i tried to add a second disk but right now i can't even run the "info" command on it
[23:19:59] <mutante>	 how about moving some of the oldest files to somewhere in / ..hrmm
[23:20:40] <mutante>	 trying it again with a smaller disk and running the command in a screen
[23:23:15] <dpifke>	 We could compress some old files so long as we leave zero-length placeholders with the same name so the SVGs don't get deleted.  (It removes any SVG which doesn't have a corresponding log.)
[23:23:36] <dpifke>	 That'll break anyone grepping through the logfiles over the weekend, but probably not a big deal.
[23:25:01] <mutante>	 dpifke: so if i'd add /srv2 with, for example 20 more Gigabytes, that would not even help because the /srv/xenon needs to be in a single place?
[23:25:09] <mutante>	 then let's do what you suggested 
[23:25:25] <dpifke>	 Correct (unless we get tricky with bind mounts).
[23:26:01] <dpifke>	 If the problem adding a disk is that the Ganeti instance is running, it's OK to shut it down briefly.  It'll catch up when it comes back.
[23:26:56] <mutante>	 i don't think the issue is that it's running
[23:27:04] <dpifke>	 Also, a bunch of old files will expire when midnight UTC rolls around in ~30 min - we're just about at the peak for today.
[23:27:10] <mutante>	 creating the virtual disk should be separate 
[23:27:17] <mutante>	 ah, that's good
[23:27:31] <mutante>	 about 5G left until then
[23:32:44] <mutante>	 dpifke: find /srv/xenon/logs/daily -mtime +30 -size +100M -exec gzip {} \; -exec touch {};  ?
[23:33:16] <mutante>	 well, another \ at the end. but the second -exec should only run if the first is succesful
[23:36:22] <dpifke>	 Hmm... I just realized that's going to cause the SVGs to be regenerated against the empty logs.
[23:36:46] <dpifke>	 (Because the timestamp will change, and we were using ctime instead of mtime.)
[23:36:50] <mutante>	 hmm, ok, if we can survive until midnight with those 5G that would be easiest then
[23:37:03] <dpifke>	 We should definitely last another few days.
[23:37:13] <mutante>	 ok, then let's do .. nothing 
[23:37:45] <mutante>	 for today 
[23:37:58] <dpifke>	 But I think I'm OK with expiring (deleting) 2-3 days of logs early, so we know for sure it's not going to run into problems over the weekend.
[23:38:48] <mutante>	 ok, want me do do anything or you got it?
[23:40:13] <mutante>	 meanwhile i tried to create just a 20G disk to see what happens and it's .. just sitting there. i'll check on it again later and if all fails ask Alex for help
[23:41:19] <dpifke>	 I'll see how much is freed in 20 min, and delete a couple of additional days if it looks like not enough.
[23:42:12] <mutante>	 oh right, midnight UTC :) ack
[23:44:29] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:46:21] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:58:59] <wikibugs>	 10Operations, 10Pywikibot, 10cloud-services-team (Kanban): http://pywikibot.org/ is displaying Wikimedia error page - https://phabricator.wikimedia.org/T257536 (10Dzahn) Currently I don't see any NS record for pywikbot.org   ` dig NS pywikibot.org ... ; <<>> DiG 9.11.5-P4-5.1+deb10u1-Debian <<>> NS pywikibot...