[00:00:00] 10Operations, 10Upstream, 10cloud-services-team (Kanban): New anti-stackclash (4.9.25-1~bpo8+3 ) kernel super bad for NFS - https://phabricator.wikimedia.org/T169290 (10Bstorm) [00:07:05] (03PS1) 10Bstorm: cloud NFS: remove the nfsiostat diammond collector [puppet] - 10https://gerrit.wikimedia.org/r/604931 (https://phabricator.wikimedia.org/T210993) [00:10:18] !log BACON is done [00:10:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:27] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 49 probes of 573 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [00:10:54] Amir1: have we got a logo? https://www.24a11y.com/wp-content/uploads/push-button-receive-bacon.png [00:11:00] (03CR) 10Bstorm: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/23180/" [puppet] - 10https://gerrit.wikimedia.org/r/604931 (https://phabricator.wikimedia.org/T210993) (owner: 10Bstorm) [00:11:35] Reedy: Yes, it's after death of scap pig: https://phabricator.wikimedia.org/tag/scap/ [00:11:57] pigs may fly (into the bacon machine) [00:12:12] lol [00:14:37] (03PS1) 10Papaul: DHCP PARTMAN add alert2001 [puppet] - 10https://gerrit.wikimedia.org/r/604933 (https://phabricator.wikimedia.org/T255070) [00:14:51] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By:TBD) rack/setup/install alert2001 - https://phabricator.wikimedia.org/T255070 (10Papaul) [00:18:53] 10Puppet, 10Cloud-Services, 10cloud-services-team (Kanban): Retire and remove module labs_debrepo - https://phabricator.wikimedia.org/T153612 (10Bstorm) Adding WMCS to see if this is actually done or not. [00:38:01] PROBLEM - Check systemd state on netbox2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:38:09] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:39:11] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 59 probes of 573 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [00:39:49] RECOVERY - Check systemd state on netbox2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:39:59] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:44:57] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 48 probes of 573 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:22:41] (03PS1) 10BrandonXLF: Drop simplewiki mainpage special casing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604973 (https://phabricator.wikimedia.org/T32405) [01:36:10] (03CR) 10Papaul: [C: 03+2] DHCP PARTMAN add alert2001 [puppet] - 10https://gerrit.wikimedia.org/r/604933 (https://phabricator.wikimedia.org/T255070) (owner: 10Papaul) [02:36:22] (03PS1) 10Krinkle: private: Add documentation for PrivateSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605005 [02:37:08] (03PS2) 10Krinkle: private: Add documentation for PrivateSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605005 [02:43:06] (03PS3) 10Krinkle: private: Add documentation for PrivateSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605005 [02:46:07] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [02:47:57] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:59:29] 10Operations, 10Commons, 10Wikimedia-General-or-Unknown: Upload to Commons fails with a common ADSL connection in Taiwan - https://phabricator.wikimedia.org/T205619 (10Jidanni) I guess I will have to upload indirectly via phabricator until this is fixed... {F31862433} [04:14:56] 10Operations, 10Commons, 10Wikimedia-General-or-Unknown: Upload to Commons fails with a common ADSL connection in Taiwan - https://phabricator.wikimedia.org/T205619 (10Jidanni) OK. Made. https://commons.wikimedia.org/wiki/File:Bugu.ogg Idea: Perhaps make a page, that combines phabricator's upload form, and... [04:54:28] !log Deploy schema change on s6 codfw - T250066 [04:54:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:54:32] T250066: text table still has old_* fields and indexes on some hosts - https://phabricator.wikimedia.org/T250066 [05:02:03] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [05:03:11] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:03:44] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [05:04:21] (03PS1) 10Marostegui: mariadb: Reimage db2086 [puppet] - 10https://gerrit.wikimedia.org/r/605058 (https://phabricator.wikimedia.org/T250666) [05:04:59] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:05:14] (03CR) 10Marostegui: [C: 03+2] mariadb: Reimage db2086 [puppet] - 10https://gerrit.wikimedia.org/r/605058 (https://phabricator.wikimedia.org/T250666) (owner: 10Marostegui) [05:23:30] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/604715 (https://phabricator.wikimedia.org/T245114) (owner: 10Volans) [05:28:31] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [05:33:05] (03PS1) 10Muehlenhoff: ntp: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/605071 [05:40:06] !log installing buster kernel security updates (no reboots yet) [05:40:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:42:52] !log installing stretch kernel security updates (no reboots yet) [05:42:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:50:13] (03PS2) 10Muehlenhoff: ntp: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/605071 [05:51:34] (03CR) 10jerkins-bot: [V: 04-1] ntp: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/605071 (owner: 10Muehlenhoff) [05:53:14] (03PS3) 10Muehlenhoff: ntp: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/605071 [05:57:23] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/23182/" [puppet] - 10https://gerrit.wikimedia.org/r/605071 (owner: 10Muehlenhoff) [06:37:59] (03PS1) 10Elukey: Replace kafka[12]00[123] with kafka-main* in analyitcs-in4/6 filters [homer/public] - 10https://gerrit.wikimedia.org/r/605098 (https://phabricator.wikimedia.org/T252675) [06:40:45] (03PS1) 10Elukey: Add archiva1002 IPs to analytics-in4/6 filters [homer/public] - 10https://gerrit.wikimedia.org/r/605136 (https://phabricator.wikimedia.org/T252767) [06:42:11] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:44:20] (03CR) 10Elukey: [C: 03+1] "kafka part looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/602459 (https://phabricator.wikimedia.org/T252913) (owner: 10Herron) [06:45:12] (03CR) 10Elukey: [C: 03+2] Add archiva-new.wikimedia.org as CNAME to archiva1002 [dns] - 10https://gerrit.wikimedia.org/r/604734 (https://phabricator.wikimedia.org/T252767) (owner: 10Elukey) [06:45:16] (03PS2) 10Elukey: Add archiva-new.wikimedia.org as CNAME to archiva1002 [dns] - 10https://gerrit.wikimedia.org/r/604734 (https://phabricator.wikimedia.org/T252767) [06:51:21] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200612T0700) [07:02:15] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:04:43] 10Operations, 10Continuous-Integration-Infrastructure, 10serviceops: replace backends for releases.wikimedia.org with buster VMs - https://phabricator.wikimedia.org/T247652 (10Dzahn) a:03Dzahn [07:05:51] !log installing intel-microcode security updates (regressions have been sorted out) [07:05:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:47] most of the errors seems for wikidata on mw1384 [07:07:51] !log depool/scap pull/pool mw1384 [07:07:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:03] probably it will not change much but let's see [07:08:52] !log Reimage db2086 [07:08:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:59] (03CR) 10Dzahn: [C: 04-1] "The mandatory parameter notes_url is missing. Please add it with a value of a Wikitech URL that describes what this monitoring check is ab" [puppet] - 10https://gerrit.wikimedia.org/r/604850 (https://phabricator.wikimedia.org/T255189) (owner: 10Gilles) [07:11:15] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 53 probes of 576 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:11:17] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:12:40] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/602459 (https://phabricator.wikimedia.org/T252913) (owner: 10Herron) [07:13:52] (03CR) 10Dzahn: [C: 04-1] "also, do you want to set the contact_group parameter? who should be notified about this?" [puppet] - 10https://gerrit.wikimedia.org/r/604850 (https://phabricator.wikimedia.org/T255189) (owner: 10Gilles) [07:17:05] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 49 probes of 576 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:18:57] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:19:59] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:24:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [07:24:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:23] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:26:58] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:27:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:26] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1011-production-logstash-eqiad on logstash1011 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1011&panelId=37 [07:31:44] elukey: also took a look at errors. seems they are coming from testwiki. and "The content model 'CollaborationHubContent' is not registered on this wiki." ? [07:32:04] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:32:06] since it's testwiki and compared to other spike in last 24h it does not seem that critical [07:32:24] mutante: that's known [07:32:27] i'll make a ticket i guess [07:32:30] RhinosF1: ah! [07:32:50] mutante: https://phabricator.wikimedia.org/T255107 [07:32:54] RhinosF1: thanks [07:33:04] James_F and Urbanecm already dealt with it [07:33:42] (03PS3) 10Gilles: Add monitoring for ASO ranking report [puppet] - 10https://gerrit.wikimedia.org/r/604850 (https://phabricator.wikimedia.org/T255189) [07:33:56] RhinosF1: well, except it does seem to be happening right now [07:34:08] maybe i should reopen that [07:34:34] (03PS4) 10Gilles: Add monitoring for ASO ranking report [puppet] - 10https://gerrit.wikimedia.org/r/604850 (https://phabricator.wikimedia.org/T255189) [07:34:50] mutante: in https://phabricator.wikimedia.org/T255107#6215380 they say the noise s accepted [07:35:00] As long as the page causing is deleted [07:35:02] (03PS1) 10Marostegui: db2086: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/605150 [07:35:42] (03CR) 10jerkins-bot: [V: 04-1] Add monitoring for ASO ranking report [puppet] - 10https://gerrit.wikimedia.org/r/604850 (https://phabricator.wikimedia.org/T255189) (owner: 10Gilles) [07:36:29] (03CR) 10Marostegui: [C: 03+2] db2086: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/605150 (owner: 10Marostegui) [07:37:59] RhinosF1: ok. and actually, seems like that type of error stopped yesterday indeed. [07:38:28] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/602459 (https://phabricator.wikimedia.org/T252913) (owner: 10Herron) [07:38:39] mutante: ack. [07:38:43] as well as the current one [07:38:49] (03PS5) 10Gilles: Add monitoring for ASO ranking report [puppet] - 10https://gerrit.wikimedia.org/r/604850 (https://phabricator.wikimedia.org/T255189) [07:38:51] * RhinosF1 is actually starting to get sql now [07:44:28] (03CR) 10Dzahn: [C: 04-1] "compiler output is looking good now: https://puppet-compiler.wmflabs.org/compiler1003/23186/stat1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/604850 (https://phabricator.wikimedia.org/T255189) (owner: 10Gilles) [07:45:20] gilles: was the new check supposed to be on stat1004? [07:49:31] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:51:37] hrmm. yes, Wikibase Special:EntityData [07:51:52] it is the same as the last alert [07:52:00] mutante: I think that in practice all stat machines have that mount [07:52:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2084 from s4 and s5', diff saved to https://phabricator.wikimedia.org/P11476 and previous config saved to /var/cache/conftool/dbconfig/20200612-075202-marostegui.json [07:52:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:51] gilles: ah, should it be a separate check on each of the servers or would that be duplicate? [07:52:52] mutante: so it shouldn’t matter on which one the check runs, or if it runs on multiple stat hosts [07:53:10] mutante: doesn’t matter as long as it runs on at least one host [07:53:20] gilles: we would like to avoid duplicate checks though unless it actually makes sense to monitor on all of them [07:53:36] gilles: so far the file is not in that location. is that expected? [07:54:05] cd: /srv/published-datasets/performance: No such file or directory [07:54:16] (03PS1) 10Marostegui: mariadb: Move db2084 to s8 [puppet] - 10https://gerrit.wikimedia.org/r/605154 [07:54:51] mutante: it's not mounted on stat1004? [07:55:44] gilles: nope, there is no "performance" under /srv/published-datasets there. i picked stat1004 because the compiler picked it when i said "on whatever uses the profile::analytics::asoranking class [07:55:45] mutante: in that case I guess it does need to run on stat1007 [07:56:13] (03PS1) 10Awight: [beta] Enable a test survey to exercise new features [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605155 (https://phabricator.wikimedia.org/T254322) [07:56:19] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:56:21] gilles: alright, so we need to find a way in puppet to limit it to just stat1007 while the profile is also used on multiple machines.. let's see [07:56:28] !log depool mw1384 [07:56:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:32] yeah I don't know how to do that [07:57:30] gilles, mutante - let's chat one second about alarms on stat boxes [07:57:38] (03CR) 10Awight: [C: 03+2] "Beta-only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605155 (https://phabricator.wikimedia.org/T254322) (owner: 10Awight) [07:58:28] (03Merged) 10jenkins-bot: [beta] Enable a test survey to exercise new features [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605155 (https://phabricator.wikimedia.org/T254322) (owner: 10Awight) [07:58:30] what is the check that is needed? [07:58:49] also, why are we adding alarms to client nodes? [07:58:49] that my cron job on stat1007 to generate a report once a month succeeded [07:58:51] elukey: thanks for the depool, notable is that server is all alone in D8 [07:58:52] D8: Add basic .arclint that will handle pep8 and pylint checks - https://phabricator.wikimedia.org/D8 [07:59:11] it broke previously and we didn't notice for a few months [07:59:14] elukey: the check is for the age of a file autonomoussystems/latest.tsv [07:59:47] gilles: ok can we possibly do it using a systemd timer that fails if the return code is non-zero, raising an icinga alert? [08:00:06] if you have an example, sure [08:00:33] can that be done inside crontab? is that a command that wraps the actual script I want to run? [08:00:40] gilles: you can look for systemd::timer::job [08:00:47] gilles: it replaces crontab [08:00:50] (03CR) 10Kormat: [C: 03+1] mariadb: Move db2084 to s8 [puppet] - 10https://gerrit.wikimedia.org/r/605154 (owner: 10Marostegui) [08:01:11] we'll still need that to target the one host the script is deployed on [08:01:23] but at the moment all the stat100x hosts have the same role, the idea was to move the to pure client nodes [08:01:59] there are some exceptions, like profile::statistics::explorer::misc_jobs [08:02:01] (03CR) 10Marostegui: [C: 03+2] mariadb: Move db2084 to s8 [puppet] - 10https://gerrit.wikimedia.org/r/605154 (owner: 10Marostegui) [08:02:27] one way to do it is to have some kind of "active_host" or whatever key in Hiera and then do an "if $active_server == $::fqdn" in the puppet profile [08:02:28] gilles: yes, so at the moment the repo is deployed on all stats [08:02:40] mutante: yes see profile::statistics::explorer::misc_jobs [08:03:06] but again, those are client nodes, this is borderline [08:03:35] anyway, one possibility is to move the rakining repo + a systemd timer to profile::statistics::explorer::misc_jobs [08:03:53] $hosts_with_jobs ? [08:04:41] yes it is poorly named, currently it is stat1007 [08:04:47] the only one with jobs running [08:04:56] the idea was to gradually remove all of them [08:05:06] should I turn my stuff into a class like statistics::wmde? [08:05:09] i see, it already has if $::hostname in $hosts_with_jobs { [08:05:19] so it would just have to be added to that section [08:05:42] gilles: looks like you should add it to profile::statistics::explorer::misc_jobs [08:05:58] that's where the classe I've just mentioned is included [08:06:32] elukey: it needs kerberos, does that mean I should use kerberos::systemd_timer ? [08:07:12] gilles: correct [08:07:12] like in statistics::discovery [08:07:15] ok [08:07:42] elukey: so should I do something like statistics::discovery ? and move the scap definition in my new class as well? [08:07:59] statistics::performance [08:08:03] gilles: could be an idea yes [08:08:12] note: if it's done this way it will alert on any systemd issue on the host, not just that one specific timer [08:08:13] ok, I'll give that a shot today [08:08:31] mutante: ? [08:08:33] so doing custom contactgroups might or might not be working [08:08:59] if the timer fails it will raise a specific icinga alarm for that timer [08:09:01] elukey: if the systemd timer fails it will show up as systemd failed unit [08:09:10] yes [08:09:22] the generic systemd status from base will trigger [08:09:32] but that one will not have a custom contactgroup [08:09:40] it will just alert admins, not performance-team [08:09:50] sure this is the downside, we already do it for analytics [08:10:04] the previous method would have been more eh.. "dedicated" [08:10:13] ok, ack, just wanted to mention it [08:10:15] elukey: I'll put this together after lunch and assign the patch to you for review [08:10:20] ok! [08:10:37] my long term plan is to create a small vm and move all the recurrent non-analytics-team jobs there [08:10:46] because stat100x should be pure client nodes [08:10:53] (as FYI) [08:11:14] (03Abandoned) 10Gilles: Add monitoring for ASO ranking report [puppet] - 10https://gerrit.wikimedia.org/r/604850 (https://phabricator.wikimedia.org/T255189) (owner: 10Gilles) [08:11:18] (03PS6) 10Ema: purged: make Kafka cluster name configurable [puppet] - 10https://gerrit.wikimedia.org/r/604743 (https://phabricator.wikimedia.org/T254844) [08:11:59] sounds good [08:12:17] it always felt hacky to me to have it run on there [08:13:01] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/604743 (https://phabricator.wikimedia.org/T254844) (owner: 10Ema) [08:14:29] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [08:14:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db2084 from s4 and s5', diff saved to https://phabricator.wikimedia.org/P11477 and previous config saved to /var/cache/conftool/dbconfig/20200612-081455-marostegui.json [08:15:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:28] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [08:20:16] so back to mw1384, exceptions are gone [08:20:29] but it is not clear to me what happened [08:21:15] mutante: ah I have just read the comment about D8, I didn't see it [08:21:15] D8: Add basic .arclint that will handle pep8 and pylint checks - https://phabricator.wikimedia.org/D8 [08:21:23] (didn't see the comment in the chan I mean) [08:22:21] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [08:22:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:29] (03CR) 10Ema: [C: 03+2] purged: make Kafka cluster name configurable [puppet] - 10https://gerrit.wikimedia.org/r/604743 (https://phabricator.wikimedia.org/T254844) (owner: 10Ema) [08:24:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:24:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:33] elukey: yea, just happened to notice it is the lone appserver in that rack.. but i can't say i see much else. except it was wikibase related. [08:29:19] 10Operations, 10Wikimedia-Logstash, 10observability: Increase Logstash ingestion capacity - https://phabricator.wikimedia.org/T255243 (10fgiunchedi) [08:32:00] (03CR) 10Ayounsi: [C: 03+1] Add archiva1002 IPs to analytics-in4/6 filters [homer/public] - 10https://gerrit.wikimedia.org/r/605136 (https://phabricator.wikimedia.org/T252767) (owner: 10Elukey) [08:32:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2080 to clone db2084', diff saved to https://phabricator.wikimedia.org/P11478 and previous config saved to /var/cache/conftool/dbconfig/20200612-083231-marostegui.json [08:32:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:56] (03CR) 10Ayounsi: [C: 03+1] Replace kafka[12]00[123] with kafka-main* in analyitcs-in4/6 filters [homer/public] - 10https://gerrit.wikimedia.org/r/605098 (https://phabricator.wikimedia.org/T252675) (owner: 10Elukey) [08:36:47] !log Clone db2084 from db2080 [08:36:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:03] XioNoX: o/ since you are in review mode :) https://gerrit.wikimedia.org/r/#/c/operations/homer/public/+/604810/ [08:37:11] (ok if I deploy those afterwards?) [08:37:53] ACKNOWLEDGEMENT - MariaDB Slave Lag: s8 on db2080 is CRITICAL: CRITICAL slave_sql_lag could not connect Marostegui known https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [08:37:53] ACKNOWLEDGEMENT - MariaDB read only s8 on db2080 is CRITICAL: Could not connect to localhost:3306 Marostegui known https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [08:37:53] ACKNOWLEDGEMENT - mysqld processes on db2080 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld Marostegui known https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [08:40:38] (03CR) 10Ayounsi: [C: 03+1] Add kafka-jumbo100[7-9] to analytics-in4 and analytics-in6 filters [homer/public] - 10https://gerrit.wikimedia.org/r/604810 (https://phabricator.wikimedia.org/T252675) (owner: 10Elukey) [08:40:59] elukey: yep! [08:43:28] thanks! [08:43:42] (03CR) 10Elukey: [C: 03+2] Add kafka-jumbo100[7-9] to analytics-in4 and analytics-in6 filters [homer/public] - 10https://gerrit.wikimedia.org/r/604810 (https://phabricator.wikimedia.org/T252675) (owner: 10Elukey) [08:43:50] (03CR) 10Elukey: [C: 03+2] Replace kafka[12]00[123] with kafka-main* in analyitcs-in4/6 filters [homer/public] - 10https://gerrit.wikimedia.org/r/605098 (https://phabricator.wikimedia.org/T252675) (owner: 10Elukey) [08:43:56] (03PS1) 10Marostegui: db2092: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/605161 (https://phabricator.wikimedia.org/T254462) [08:43:58] (03CR) 10Elukey: [C: 03+2] Add archiva1002 IPs to analytics-in4/6 filters [homer/public] - 10https://gerrit.wikimedia.org/r/605136 (https://phabricator.wikimedia.org/T252767) (owner: 10Elukey) [08:44:21] !log Compress InnoDB on db2092 - T254462 [08:44:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:25] T254462: Compress enwiki InnoDB tables - https://phabricator.wikimedia.org/T254462 [08:44:31] (03CR) 10Marostegui: [C: 03+2] db2092: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/605161 (https://phabricator.wikimedia.org/T254462) (owner: 10Marostegui) [08:45:01] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1012-production-logstash-eqiad on logstash1012 is CRITICAL: 119 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1012&panelId=37 [08:46:56] 10Operations, 10Core Platform Team, 10Traffic, 10Patch-For-Review: Configure purged in deployment-prep - https://phabricator.wikimedia.org/T254844 (10ema) 05Open→03Resolved a:03ema Both deployment-cache-text06 and deployment-cache-upload06 are now reading purges from Kafka. Closing! [08:48:05] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=gerrit site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:48:37] !log update cr1/cr2 analyitics filters for T252767 and T252675 [08:48:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:42] T252675: Add new kafka brokers kafka-jumbo100[789] to the jumbo-eqiad Kafka cluster - https://phabricator.wikimedia.org/T252675 [08:48:42] T252767: Move Archiva to Debian Buster - https://phabricator.wikimedia.org/T252767 [08:49:27] PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [08:49:38] wow [08:50:06] gerrit unreachable for me [08:50:10] yeah, same [08:50:17] ugh..is somebody restarting gerrit to work on that ticket about the missing dashboard plugin? [08:50:36] active (running) since Wed 2020-05-13 18:54:10 UTC; 4 weeks 1 days ago [08:50:49] can I restart it? [08:51:01] yes, please [08:51:35] !log restart gerrit on gerrit1001 [08:51:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:38] https://grafana.wikimedia.org/d/Bw2mQ3iWz/gerrit-javamelody?panelId=16&fullscreen&orgId=1 is also indicating a rise in active threads [08:52:52] (the restart command is still stuck) [08:53:09] ok should be up now [08:53:13] it's normal so far [08:53:18] coming back in a moment [08:53:24] yeah, works for me now [08:53:25] there it is [08:53:37] ugh yea, that is a huge spike in threads there [08:53:50] IIRC there was a recurrent issue about it [08:53:51] RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 27995 bytes in 0.118 second response time https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [08:54:39] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM, the base diamond collector directory does recurse/purge, so this gets automatically cleaned up on affected instances." [puppet] - 10https://gerrit.wikimedia.org/r/604931 (https://phabricator.wikimedia.org/T210993) (owner: 10Bstorm) [08:54:51] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:14:06] (03CR) 10Dzahn: "password updated for host/gerrit1002 in private hieradata" [puppet] - 10https://gerrit.wikimedia.org/r/604343 (https://phabricator.wikimedia.org/T254516) (owner: 10Dzahn) [09:18:40] (03CR) 10Dzahn: [C: 03+2] "change on gerrit1002, noop on gerrit1001/gerrit2001: https://puppet-compiler.wmflabs.org/compiler1001/23184/" [puppet] - 10https://gerrit.wikimedia.org/r/604343 (https://phabricator.wikimedia.org/T254516) (owner: 10Dzahn) [09:23:56] (03CR) 10Jbond: [C: 03+2] labtestpuppet: puppet::servers [puppet] - 10https://gerrit.wikimedia.org/r/604696 (https://phabricator.wikimedia.org/T254491) (owner: 10Jbond) [09:25:22] (03CR) 10Dzahn: "on gerrit1002:" [puppet] - 10https://gerrit.wikimedia.org/r/604343 (https://phabricator.wikimedia.org/T254516) (owner: 10Dzahn) [09:33:36] (03CR) 10Jbond: [C: 03+1] "lgtm" [homer/public] - 10https://gerrit.wikimedia.org/r/592246 (owner: 10Ayounsi) [09:34:18] (03CR) 10Jcrespo: [C: 03+1] "Tested ok." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/604379 (https://phabricator.wikimedia.org/T254738) (owner: 10Marostegui) [09:34:34] (03CR) 10Jbond: [C: 03+1] "lgtm" [homer/public] - 10https://gerrit.wikimedia.org/r/592251 (owner: 10Ayounsi) [09:37:53] (03CR) 10Jbond: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/592938 (https://phabricator.wikimedia.org/T191667) (owner: 10Ayounsi) [09:40:00] (03PS2) 10Volans: scripts: complete interface automation generation [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/601877 (https://phabricator.wikimedia.org/T233183) [09:41:23] 10Operations, 10SRE-Access-Requests: Requesting access researchers, statistics-privatedata-users, and analytics-privatedata-users, nda for AndrewKuznetsov - https://phabricator.wikimedia.org/T254939 (10jbond) >>! In T254939#6216238, @AndrewKuznetsov wrote: > Sorry about the confusion with the username, I log i... [09:44:43] (03CR) 10Jcrespo: "A few minor wording/meaning corrections, once those are done this can go." (033 comments) [software/transferpy] - 10https://gerrit.wikimedia.org/r/602618 (https://phabricator.wikimedia.org/T248256) (owner: 10Privacybatm) [09:46:42] 10Operations, 10fundraising-tech-ops, 10netops, 10WMF-NDA: Deploy pfw policy 1591901800 for T122104 - https://phabricator.wikimedia.org/T255185 (10jbond) p:05Triage→03Medium [09:48:03] 10Operations, 10Release-Engineering-Team-TODO, 10Scap, 10Sustainability (Incident Prevention), 10User-brennen: scap's logstash_checker.py is blissfully unaware of any logstash indexing latency - https://phabricator.wikimedia.org/T255197 (10jbond) p:05Triage→03Medium [09:48:43] 10Operations, 10Wikimedia-Logstash, 10observability: Increase Logstash ingestion capacity - https://phabricator.wikimedia.org/T255243 (10jbond) p:05Triage→03Medium [09:51:10] (03CR) 10Marostegui: [C: 03+2] check_mariadb.py: Add check for the event_scheduler [puppet] - 10https://gerrit.wikimedia.org/r/604379 (https://phabricator.wikimedia.org/T254738) (owner: 10Marostegui) [09:52:52] (03PS1) 10Dzahn: add IPs for releases1002/releases2002 [dns] - 10https://gerrit.wikimedia.org/r/605176 (https://phabricator.wikimedia.org/T247652) [09:54:26] (03PS2) 10Dzahn: add IPs for releases1002/releases2002 [dns] - 10https://gerrit.wikimedia.org/r/605176 (https://phabricator.wikimedia.org/T247652) [09:54:31] (03PS4) 10Jcrespo: mariadb-backups: Disable transfer.py logging to systemd [puppet] - 10https://gerrit.wikimedia.org/r/602636 [09:58:17] (03PS2) 10Filippo Giunchedi: swift: remove swift-container-sharder unit [puppet] - 10https://gerrit.wikimedia.org/r/604623 (https://phabricator.wikimedia.org/T252186) [09:58:20] (03PS1) 10Filippo Giunchedi: prometheus: enable thanos upload in ops eqsin/ulsfo/codfw [puppet] - 10https://gerrit.wikimedia.org/r/605177 (https://phabricator.wikimedia.org/T252186) [09:58:25] (03PS1) 10Filippo Giunchedi: prometheus: enable thanos upload in ops eqiad [puppet] - 10https://gerrit.wikimedia.org/r/605178 (https://phabricator.wikimedia.org/T252186) [09:58:35] !log roll-restart thanos-fe / thanos-be for microcode updates [09:58:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Include db2084 in dbctl, depooled', diff saved to https://phabricator.wikimedia.org/P11480 and previous config saved to /var/cache/conftool/dbconfig/20200612-095855-marostegui.json [09:58:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:34] !log jmm@cumin1001 START - Cookbook sre.hosts.downtime [10:01:36] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single [10:01:36] !log filippo@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [10:01:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:41] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:01:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:20] !log filippo@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [10:02:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:07] (03PS1) 10Dzahn: mediawiki/php: use new data type for PHP version [puppet] - 10https://gerrit.wikimedia.org/r/605179 [10:10:31] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Disable transfer.py logging to systemd [puppet] - 10https://gerrit.wikimedia.org/r/602636 (owner: 10Jcrespo) [10:13:49] PROBLEM - Thanos compact has not run on icinga1001 is CRITICAL: 4.422e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [10:15:06] (03PS1) 10Jbond: profile::archiva::proxy: Use ip addresses instead of localhost [puppet] - 10https://gerrit.wikimedia.org/r/605183 [10:16:58] (03PS2) 10Jbond: profile::archiva::proxy: Use ip addresses instead of localhost [puppet] - 10https://gerrit.wikimedia.org/r/605183 [10:19:36] (03PS3) 10Jbond: profile::archiva::proxy: Use ip addresses instead of localhost [puppet] - 10https://gerrit.wikimedia.org/r/605183 [10:19:47] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/605183 (owner: 10Jbond) [10:21:31] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [10:21:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:27] (03PS3) 10Volans: scripts: complete interface automation generation [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/601877 (https://phabricator.wikimedia.org/T233183) [10:25:32] 10Operations, 10CommRel-Specialists-Support (Jul-Sep-2020): CommRel support for FY2020-2021 Q1 DC switchover - https://phabricator.wikimedia.org/T244808 (10Elitre) >>! In T244808#6196490, @RLazarus wrote: > Thanks for checking -- not sure yet, but as we're planning out Q1 on our side too, I'm starting to take... [10:26:31] (03CR) 10Volans: "Update according to our IRC chat. It can be tested on af-netbox, I've put there two scripts to test both scenarios:" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/601877 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [10:29:41] (03PS1) 10Elukey: archiva: assign archiva-new.wikimedia.org to archiva1002 [puppet] - 10https://gerrit.wikimedia.org/r/605186 (https://phabricator.wikimedia.org/T252767) [10:33:07] !log rolling restart of the ulsfo ganeti cluster [10:33:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:42] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/23190/archiva1002.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/605186 (https://phabricator.wikimedia.org/T252767) (owner: 10Elukey) [10:36:46] !log jmm@cumin1001 START - Cookbook sre.hosts.reboot-single [10:36:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:49] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [10:39:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:36] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TDB) rack/setup/install rdb200[78] - https://phabricator.wikimedia.org/T251626 (10akosiaris) [10:43:37] PROBLEM - Check systemd state on ganeti4003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:44:24] (03PS2) 10Arturo Borrero Gonzalez: wmcs: kubeadm: haproxy: introduce support for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/604789 (https://phabricator.wikimedia.org/T195217) [10:45:31] (03CR) 10jerkins-bot: [V: 04-1] wmcs: kubeadm: haproxy: introduce support for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/604789 (https://phabricator.wikimedia.org/T195217) (owner: 10Arturo Borrero Gonzalez) [10:47:28] (03PS3) 10Arturo Borrero Gonzalez: wmcs: kubeadm: haproxy: introduce support for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/604789 (https://phabricator.wikimedia.org/T195217) [10:51:59] (03PS1) 10Elukey: Add archiva-new configuration for Acme Chief [puppet] - 10https://gerrit.wikimedia.org/r/605187 (https://phabricator.wikimedia.org/T252767) [10:52:30] vgutierrez: --^ (if you have a moment) [10:53:05] (03PS1) 10Kormat: mariadb: Detect lag spikes [puppet] - 10https://gerrit.wikimedia.org/r/605188 (https://phabricator.wikimedia.org/T253120) [10:54:03] (03CR) 10Muehlenhoff: profile::archiva::proxy: Use ip addresses instead of localhost (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/605183 (owner: 10Jbond) [10:54:45] (03CR) 10Vgutierrez: [C: 03+1] Add archiva-new configuration for Acme Chief [puppet] - 10https://gerrit.wikimedia.org/r/605187 (https://phabricator.wikimedia.org/T252767) (owner: 10Elukey) [10:54:54] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/605176 (https://phabricator.wikimedia.org/T247652) (owner: 10Dzahn) [10:56:05] RECOVERY - Check systemd state on ganeti4003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:57:54] (03CR) 10Elukey: profile::archiva::proxy: Use ip addresses instead of localhost (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/605183 (owner: 10Jbond) [10:58:03] !log jmm@cumin1001 START - Cookbook sre.hosts.reboot-single [10:58:04] (03CR) 10Elukey: [C: 03+2] Add archiva-new configuration for Acme Chief [puppet] - 10https://gerrit.wikimedia.org/r/605187 (https://phabricator.wikimedia.org/T252767) (owner: 10Elukey) [10:58:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:43] (03CR) 10Jbond: profile::archiva::proxy: Use ip addresses instead of localhost (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/605183 (owner: 10Jbond) [11:02:08] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [11:02:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:41] !log jmm@cumin1001 START - Cookbook sre.hosts.reboot-single [11:07:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:47] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [11:11:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:09] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/605183 (owner: 10Jbond) [11:14:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db2080 and db2084 into s8 T253217', diff saved to https://phabricator.wikimedia.org/P11481 and previous config saved to /var/cache/conftool/dbconfig/20200612-111422-marostegui.json [11:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:26] T253217: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 [11:14:33] PROBLEM - HTTPS on archiva1002 is CRITICAL: SSL CRITICAL - failed to verify archiva.wikimedia.org against archiva-new.wikimedia.org https://wikitech.wikimedia.org/wiki/Analytics/Systems/Archiva [11:15:01] ah! of course --^ [11:15:04] !log failover ganeti master in ulsfo to ganeti4003 [11:15:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:19] (lunch but it'll get fixed asap) [11:15:43] (03PS1) 10Marostegui: db2084: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/605193 (https://phabricator.wikimedia.org/T253217) [11:16:14] (03PS1) 10Gilles: Add monitoring for ASO ranking report [puppet] - 10https://gerrit.wikimedia.org/r/605194 (https://phabricator.wikimedia.org/T255189) [11:17:20] (03CR) 10jerkins-bot: [V: 04-1] Add monitoring for ASO ranking report [puppet] - 10https://gerrit.wikimedia.org/r/605194 (https://phabricator.wikimedia.org/T255189) (owner: 10Gilles) [11:17:39] (03PS2) 10Gilles: Convert ASO ranking report into a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/605194 (https://phabricator.wikimedia.org/T255189) [11:18:45] (03PS3) 10Gilles: Convert ASO ranking report into a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/605194 (https://phabricator.wikimedia.org/T255189) [11:18:48] (03CR) 10jerkins-bot: [V: 04-1] Convert ASO ranking report into a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/605194 (https://phabricator.wikimedia.org/T255189) (owner: 10Gilles) [11:18:55] PROBLEM - ganeti-mond running on ganeti4002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [11:19:04] !log jmm@cumin1001 START - Cookbook sre.hosts.reboot-single [11:19:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:05] RECOVERY - Rate of JVM GC Old generation-s runs - logstash1011-production-logstash-eqiad on logstash1011 is OK: (C)100 gt (W)80 gt 78.31 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1011&panelId=37 [11:22:03] (03CR) 10Marostegui: [C: 03+2] db2084: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/605193 (https://phabricator.wikimedia.org/T253217) (owner: 10Marostegui) [11:23:09] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [11:23:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:25] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:26:09] PROBLEM - ganeti-mond running on ganeti4002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [11:27:13] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:28:53] the ganeti4002 alert is a monitiring blip, Icinga hasn't realised yet that 4002 is no longer the Ganeti master [11:32:23] RECOVERY - ganeti-mond running on ganeti4002 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti [11:37:25] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:38:35] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:39:21] (03PS1) 10Arturo Borrero Gonzalez: openstack: keystone: add paws-dns-manager to the safelist [puppet] - 10https://gerrit.wikimedia.org/r/605198 (https://phabricator.wikimedia.org/T255252) [11:43:48] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: keystone: add paws-dns-manager to the safelist [puppet] - 10https://gerrit.wikimedia.org/r/605198 (https://phabricator.wikimedia.org/T255252) (owner: 10Arturo Borrero Gonzalez) [11:45:15] RECOVERY - Thanos compact has not run on icinga1001 is OK: (C)24 ge (W)12 ge 0.2031 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [11:48:57] (03PS1) 10Marostegui: install_server: Reimage db2099 [puppet] - 10https://gerrit.wikimedia.org/r/605199 (https://phabricator.wikimedia.org/T253217) [11:49:32] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db2099 [puppet] - 10https://gerrit.wikimedia.org/r/605199 (https://phabricator.wikimedia.org/T253217) (owner: 10Marostegui) [11:52:28] !log filippo@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [11:52:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:23] (03PS2) 10Kormat: mariadb: Detect lag spikes [puppet] - 10https://gerrit.wikimedia.org/r/605188 (https://phabricator.wikimedia.org/T253120) [12:01:13] (03PS3) 10Kormat: mariadb: Detect lag spikes [puppet] - 10https://gerrit.wikimedia.org/r/605188 (https://phabricator.wikimedia.org/T253120) [12:02:23] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Detect lag spikes [puppet] - 10https://gerrit.wikimedia.org/r/605188 (https://phabricator.wikimedia.org/T253120) (owner: 10Kormat) [12:04:49] (03PS6) 10Privacybatm: transferpy: Remove wmfmariadbpy package [software/transferpy] - 10https://gerrit.wikimedia.org/r/602618 (https://phabricator.wikimedia.org/T248256) [12:07:05] (03PS1) 10Elukey: profile::archiva::proxy: use certificate_name for monitoring [puppet] - 10https://gerrit.wikimedia.org/r/605203 (https://phabricator.wikimedia.org/T252767) [12:08:54] 10Operations, 10DBA, 10Sustainability (Incident Prevention): Batch db1074-db1079 hosts having BBU issues - https://phabricator.wikimedia.org/T233569 (10Marostegui) 05Open→03Declined Declining as these hosts will be refreshed next FY [12:08:57] 10Operations, 10ops-eqiad, 10DBA, 10Wikimedia-Incident: db1075 (s3 master) crashed - BBU failure - https://phabricator.wikimedia.org/T233534 (10Marostegui) [12:09:02] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Rack and Initial setup db1074-79 - https://phabricator.wikimedia.org/T128753 (10Marostegui) [12:10:33] (03CR) 10Privacybatm: "> Patch Set 5:" (033 comments) [software/transferpy] - 10https://gerrit.wikimedia.org/r/602618 (https://phabricator.wikimedia.org/T248256) (owner: 10Privacybatm) [12:12:20] (03CR) 10Ayounsi: [C: 03+2] "No diff locally." [homer/public] - 10https://gerrit.wikimedia.org/r/592246 (owner: 10Ayounsi) [12:12:39] (03PS4) 10Kormat: mariadb: Detect lag spikes [puppet] - 10https://gerrit.wikimedia.org/r/605188 (https://phabricator.wikimedia.org/T253120) [12:12:41] PROBLEM - Check systemd state on thanos-be1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:12:43] PROBLEM - Host thanos-be2004 is DOWN: PING CRITICAL - Packet loss = 100% [12:13:01] PROBLEM - Check systemd state on thanos-be1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:13:02] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/23196/" [puppet] - 10https://gerrit.wikimedia.org/r/605203 (https://phabricator.wikimedia.org/T252767) (owner: 10Elukey) [12:13:41] PROBLEM - Check systemd state on thanos-be2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:14:12] PROBLEM - Check systemd state on thanos-be2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:14:13] PROBLEM - Check systemd state on thanos-be1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:14:53] RECOVERY - Host thanos-be2004 is UP: PING OK - Packet loss = 0%, RTA = 36.23 ms [12:15:23] PROBLEM - Host thanos-be2002 is DOWN: PING CRITICAL - Packet loss = 100% [12:16:04] (03CR) 10Elukey: "Looks very good! Some comments to double check some doubts." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/605194 (https://phabricator.wikimedia.org/T255189) (owner: 10Gilles) [12:16:31] PROBLEM - Check systemd state on thanos-be1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:18:12] (03PS4) 10Gilles: Convert ASO ranking report into a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/605194 (https://phabricator.wikimedia.org/T255189) [12:18:15] (03CR) 10Gilles: Convert ASO ranking report into a systemd timer (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/605194 (https://phabricator.wikimedia.org/T255189) (owner: 10Gilles) [12:18:19] godog: ^ [12:18:21] RECOVERY - Host thanos-be2002 is UP: PING OK - Packet loss = 0%, RTA = 36.22 ms [12:18:36] PROBLEM - Check systemd state on thanos-be2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:18:54] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [12:18:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:15] XioNoX: thanks, side effect of roll restart and expired downtime I think [12:19:30] will fix, FWIW none of it is in production yet [12:21:39] RECOVERY - Check systemd state on thanos-be2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:21:45] RECOVERY - Check systemd state on thanos-be2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:21:53] RECOVERY - Check systemd state on thanos-be1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:22:03] RECOVERY - Check systemd state on thanos-be1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:23:43] (03CR) 10Elukey: Convert ASO ranking report into a systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/605194 (https://phabricator.wikimedia.org/T255189) (owner: 10Gilles) [12:24:29] RECOVERY - Check systemd state on thanos-be1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:25:45] (03PS4) 10Arturo Borrero Gonzalez: wmcs: kubeadm: haproxy: introduce support for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/604789 (https://phabricator.wikimedia.org/T195217) [12:31:03] (03PS5) 10Arturo Borrero Gonzalez: wmcs: kubeadm: haproxy: introduce support for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/604789 (https://phabricator.wikimedia.org/T195217) [12:31:59] (03PS5) 10Gilles: Convert ASO ranking report into a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/605194 (https://phabricator.wikimedia.org/T255189) [12:32:04] (03CR) 10Gilles: Convert ASO ranking report into a systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/605194 (https://phabricator.wikimedia.org/T255189) (owner: 10Gilles) [12:32:38] (03PS1) 10Kormat: mariadb: Fix mariadb::instance comment [puppet] - 10https://gerrit.wikimedia.org/r/605206 [12:33:11] RECOVERY - Check systemd state on thanos-be2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:35:51] (03PS6) 10Arturo Borrero Gonzalez: wmcs: kubeadm: haproxy: introduce support for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/604789 (https://phabricator.wikimedia.org/T195217) [12:37:45] (03CR) 10Marostegui: [C: 03+1] mariadb: Fix mariadb::instance comment [puppet] - 10https://gerrit.wikimedia.org/r/605206 (owner: 10Kormat) [12:38:45] (03PS7) 10Arturo Borrero Gonzalez: wmcs: kubeadm: haproxy: introduce support for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/604789 (https://phabricator.wikimedia.org/T195217) [12:38:47] (03CR) 10Kormat: [C: 03+2] mariadb: Fix mariadb::instance comment [puppet] - 10https://gerrit.wikimedia.org/r/605206 (owner: 10Kormat) [12:39:15] marostegui: i'll take your +1 as an admission of guilt ;) [12:40:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1088 for schema change', diff saved to https://phabricator.wikimedia.org/P11482 and previous config saved to /var/cache/conftool/dbconfig/20200612-124015-marostegui.json [12:40:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:52] (03PS1) 10Cmjohnson: Removing asset dns entries for decom'd mw1221-1258 [dns] - 10https://gerrit.wikimedia.org/r/605209 (https://phabricator.wikimedia.org/T253856) [12:41:00] (03CR) 10jerkins-bot: [V: 04-1] Removing asset dns entries for decom'd mw1221-1258 [dns] - 10https://gerrit.wikimedia.org/r/605209 (https://phabricator.wikimedia.org/T253856) (owner: 10Cmjohnson) [12:41:07] (03PS8) 10Arturo Borrero Gonzalez: wmcs: kubeadm: haproxy: introduce support for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/604789 (https://phabricator.wikimedia.org/T195217) [12:43:34] (03PS2) 10Cmjohnson: Removing asset dns entries for decom'd mw1221-1258 [dns] - 10https://gerrit.wikimedia.org/r/605209 (https://phabricator.wikimedia.org/T253856) [12:44:15] RECOVERY - Check systemd state on thanos-be1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:44:23] (03CR) 10Cmjohnson: [C: 03+2] Removing asset dns entries for decom'd mw1221-1258 [dns] - 10https://gerrit.wikimedia.org/r/605209 (https://phabricator.wikimedia.org/T253856) (owner: 10Cmjohnson) [12:47:45] (03PS9) 10Arturo Borrero Gonzalez: wmcs: kubeadm: haproxy: introduce support for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/604789 (https://phabricator.wikimedia.org/T195217) [12:48:09] RECOVERY - HTTPS on archiva1002 is OK: SSL OK - Certificate archiva-new.wikimedia.org valid until 2020-09-10 10:00:20 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/Analytics/Systems/Archiva [12:49:28] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: decom 36 old appservers in eqiad (onsite, dcops) - https://phabricator.wikimedia.org/T253856 (10Cmjohnson) [12:50:06] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: decom 36 old appservers in eqiad (onsite, dcops) - https://phabricator.wikimedia.org/T253856 (10Cmjohnson) [12:50:32] (03CR) 10Elukey: "puppet complier happy, last thing that I noticed is the log dir, let me know your thoughts.. (sorry for the back and forth but I didn't se" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/605194 (https://phabricator.wikimedia.org/T255189) (owner: 10Gilles) [12:50:35] (03PS1) 10Hnowlan: changeprop: Bump memory limits for pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/605213 [12:50:49] 10Operations, 10serviceops: decom old appservers in eqiad - https://phabricator.wikimedia.org/T247780 (10Cmjohnson) [12:50:51] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: decom 36 old appservers in eqiad (onsite, dcops) - https://phabricator.wikimedia.org/T253856 (10Cmjohnson) 05Open→03Resolved removed from networks switch, all dns entries removed, removed from rack and netbox has been updated to o... [12:50:54] 10Operations, 10ops-eqiad, 10DC-Ops: Audit down ports - https://phabricator.wikimedia.org/T218751 (10Cmjohnson) [12:52:04] (03PS10) 10Arturo Borrero Gonzalez: wmcs: kubeadm: haproxy: introduce support for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/604789 (https://phabricator.wikimedia.org/T195217) [12:53:45] (03PS11) 10Arturo Borrero Gonzalez: wmcs: kubeadm: haproxy: introduce support for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/604789 (https://phabricator.wikimedia.org/T195217) [12:54:22] 10Operations, 10ops-eqiad: Degraded RAID on restbase-dev1004 - https://phabricator.wikimedia.org/T253607 (10Cmjohnson) @wiki_willy I completely forgot but restbases have ssds that were purchased separately from the servers. I believe this is the task of the original purchase. T158795. Also, this is a past... [12:56:24] (03PS12) 10Arturo Borrero Gonzalez: wmcs: kubeadm: haproxy: introduce support for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/604789 (https://phabricator.wikimedia.org/T195217) [12:57:33] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC noop for Toolforge: https://puppet-compiler.wmflabs.org/compiler1003/23209/" [puppet] - 10https://gerrit.wikimedia.org/r/604789 (https://phabricator.wikimedia.org/T195217) (owner: 10Arturo Borrero Gonzalez) [13:11:37] (03PS1) 10Arturo Borrero Gonzalez: paws: haproxy: fix some small issues [puppet] - 10https://gerrit.wikimedia.org/r/605217 (https://phabricator.wikimedia.org/T195217) [13:12:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1088 after schema change', diff saved to https://phabricator.wikimedia.org/P11483 and previous config saved to /var/cache/conftool/dbconfig/20200612-131205-marostegui.json [13:12:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:47] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] paws: haproxy: fix some small issues [puppet] - 10https://gerrit.wikimedia.org/r/605217 (https://phabricator.wikimedia.org/T195217) (owner: 10Arturo Borrero Gonzalez) [13:18:13] (03PS1) 10Ottomata: otto .bashrc - prompt for kerberos login on kerberos host if no kerberos ticket is active [puppet] - 10https://gerrit.wikimedia.org/r/605219 [13:18:38] 10Operations, 10vm-requests, 10Patch-For-Review: esams,ulsfo,eqsin: one VM request each for install_servers - https://phabricator.wikimedia.org/T254157 (10MoritzMuehlenhoff) @Dzahn: @akosiaris configured public interfaces on the ganeti hosts and after the Ganeti clusters are rebooted (which I'm currently han... [13:19:01] (03Abandoned) 10Vgutierrez: ATS: Increase to 2 the number of accept_threads in cp3052 and cp3056 [puppet] - 10https://gerrit.wikimedia.org/r/587516 (https://phabricator.wikimedia.org/T249335) (owner: 10Vgutierrez) [13:19:03] (03CR) 10jerkins-bot: [V: 04-1] otto .bashrc - prompt for kerberos login on kerberos host if no kerberos ticket is active [puppet] - 10https://gerrit.wikimedia.org/r/605219 (owner: 10Ottomata) [13:20:36] (03PS6) 10Gilles: Convert ASO ranking report into a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/605194 (https://phabricator.wikimedia.org/T255189) [13:20:40] (03CR) 10Gilles: Convert ASO ranking report into a systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/605194 (https://phabricator.wikimedia.org/T255189) (owner: 10Gilles) [13:22:18] (03PS1) 10Alexandros Kosiaris: Changeprop: Bump CPU limits by 100% [deployment-charts] - 10https://gerrit.wikimedia.org/r/605220 [13:24:38] (03PS2) 10Alexandros Kosiaris: Changeprop: Bump CPU limits by 100% [deployment-charts] - 10https://gerrit.wikimedia.org/r/605220 [13:24:44] (03PS2) 10Ottomata: otto .bashrc - prompt for kerberos login on kerberos host if no kerberos ticket is active [puppet] - 10https://gerrit.wikimedia.org/r/605219 [13:25:26] (03CR) 10Hnowlan: [C: 03+1] Changeprop: Bump CPU limits by 100% [deployment-charts] - 10https://gerrit.wikimedia.org/r/605220 (owner: 10Alexandros Kosiaris) [13:25:31] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Changeprop: Bump CPU limits by 100% [deployment-charts] - 10https://gerrit.wikimedia.org/r/605220 (owner: 10Alexandros Kosiaris) [13:25:33] (03CR) 10jerkins-bot: [V: 04-1] otto .bashrc - prompt for kerberos login on kerberos host if no kerberos ticket is active [puppet] - 10https://gerrit.wikimedia.org/r/605219 (owner: 10Ottomata) [13:25:41] (03PS3) 10Ottomata: otto - prompt for kerberos login on kerberos host if no kerberos ticket is active [puppet] - 10https://gerrit.wikimedia.org/r/605219 [13:26:28] (03CR) 10jerkins-bot: [V: 04-1] otto - prompt for kerberos login on kerberos host if no kerberos ticket is active [puppet] - 10https://gerrit.wikimedia.org/r/605219 (owner: 10Ottomata) [13:31:44] (03CR) 10Alexandros Kosiaris: [C: 03+2] Changeprop: Bump CPU limits by 100% [deployment-charts] - 10https://gerrit.wikimedia.org/r/605220 (owner: 10Alexandros Kosiaris) [13:32:11] (03Merged) 10jenkins-bot: Changeprop: Bump CPU limits by 100% [deployment-charts] - 10https://gerrit.wikimedia.org/r/605220 (owner: 10Alexandros Kosiaris) [13:33:32] (03PS4) 10Ottomata: otto - prompt for kerberos login on kerberos host [puppet] - 10https://gerrit.wikimedia.org/r/605219 [13:34:08] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop' for release 'production' . [13:34:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:23] !log update changeprop in eqiad+codfw for higher CPU limits [13:34:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:50] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'changeprop' for release 'production' . [13:36:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:05] 10Operations, 10Performance-Team, 10Thumbor, 10serviceops, 10Patch-For-Review: Build python-thumbor-wikimedia 2.9 Debian package and upload to apt.wikimedia.org - https://phabricator.wikimedia.org/T254845 (10Gilles) [13:39:06] (03CR) 1020after4: "The admin port probably doesn't need https termination since those connections are coming from localhost." [puppet] - 10https://gerrit.wikimedia.org/r/603895 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [13:39:48] (03CR) 10Ottomata: [C: 03+2] otto - prompt for kerberos login on kerberos host [puppet] - 10https://gerrit.wikimedia.org/r/605219 (owner: 10Ottomata) [13:40:38] (03CR) 1020after4: phabricator: add envoy TLS terminator for aphlict (DO NOT MERGE) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/603895 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [13:41:15] (03CR) 10Reedy: private: Add documentation for PrivateSettings.php (0312 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605005 (owner: 10Krinkle) [13:50:59] (03PS1) 10Elukey: Add AAAA/PTR records for kafka-jumbo100[7-9] [dns] - 10https://gerrit.wikimedia.org/r/605225 (https://phabricator.wikimedia.org/T252675) [13:51:59] (03CR) 10Elukey: [C: 03+1] profile::archiva::proxy: Use ip addresses instead of localhost [puppet] - 10https://gerrit.wikimedia.org/r/605183 (owner: 10Jbond) [13:52:28] (03CR) 10Jbond: [C: 03+2] profile::archiva::proxy: Use ip addresses instead of localhost [puppet] - 10https://gerrit.wikimedia.org/r/605183 (owner: 10Jbond) [13:52:39] (03PS4) 10Jbond: profile::archiva::proxy: Use ip addresses instead of localhost [puppet] - 10https://gerrit.wikimedia.org/r/605183 [13:53:08] (03CR) 10Ottomata: [C: 03+1] Add AAAA/PTR records for kafka-jumbo100[7-9] [dns] - 10https://gerrit.wikimedia.org/r/605225 (https://phabricator.wikimedia.org/T252675) (owner: 10Elukey) [13:56:50] (03CR) 10Elukey: [C: 03+2] Add AAAA/PTR records for kafka-jumbo100[7-9] [dns] - 10https://gerrit.wikimedia.org/r/605225 (https://phabricator.wikimedia.org/T252675) (owner: 10Elukey) [14:00:20] (03PS5) 10Kormat: [WIP] mariadb: Refactor puppetness. [puppet] - 10https://gerrit.wikimedia.org/r/605188 [14:02:56] (03PS1) 10DCausse: [wdqs] add an option to skolemize blank nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/605231 [14:04:13] (03PS2) 10DCausse: [wdqs] add an option to skolemize blank nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/605231 [14:04:38] (03PS1) 10Alexandros Kosiaris: kubernetes: Add more nodes to the 2 clusters [puppet] - 10https://gerrit.wikimedia.org/r/605233 (https://phabricator.wikimedia.org/T241850) [14:06:15] (03CR) 10jerkins-bot: [V: 04-1] [wdqs] add an option to skolemize blank nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/605231 (owner: 10DCausse) [14:09:53] (03PS3) 10DCausse: [wdqs] add an option to skolemize blank nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/605231 [14:12:12] (03CR) 10jerkins-bot: [V: 04-1] [wdqs] add an option to skolemize blank nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/605231 (owner: 10DCausse) [14:15:52] (03PS4) 10DCausse: [wdqs] add an option to skolemize blank nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/605231 [14:17:48] (03PS1) 10Vgutierrez: acme_chief,x509: Provide .crt.key file support [software/acme-chief] - 10https://gerrit.wikimedia.org/r/605237 (https://phabricator.wikimedia.org/T255249) [14:18:42] (03CR) 10jerkins-bot: [V: 04-1] [wdqs] add an option to skolemize blank nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/605231 (owner: 10DCausse) [14:20:15] (03PS5) 10DCausse: [wdqs] add an option to skolemize blank nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/605231 [14:21:33] (03CR) 10jerkins-bot: [V: 04-1] acme_chief,x509: Provide .crt.key file support [software/acme-chief] - 10https://gerrit.wikimedia.org/r/605237 (https://phabricator.wikimedia.org/T255249) (owner: 10Vgutierrez) [14:23:50] (03PS6) 10Kormat: [WIP] mariadb: Refactor puppetness. [puppet] - 10https://gerrit.wikimedia.org/r/605188 [14:24:38] (03PS1) 10Alexandros Kosiaris: Changeprop: Bump CPU limits by 50% more [deployment-charts] - 10https://gerrit.wikimedia.org/r/605239 [14:26:29] (03PS1) 10Muehlenhoff: package_builder: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/605240 [14:28:46] (03CR) 10Alexandros Kosiaris: [C: 03+2] Changeprop: Bump CPU limits by 50% more [deployment-charts] - 10https://gerrit.wikimedia.org/r/605239 (owner: 10Alexandros Kosiaris) [14:29:18] (03Merged) 10jenkins-bot: Changeprop: Bump CPU limits by 50% more [deployment-charts] - 10https://gerrit.wikimedia.org/r/605239 (owner: 10Alexandros Kosiaris) [14:29:59] (03PS4) 10Jbond: facter: cpu_details add flags, model, bugs and family [puppet] - 10https://gerrit.wikimedia.org/r/604381 [14:30:40] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop' for release 'production' . [14:30:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:59] !log bump cpu limits for changeprop another 50% [14:31:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:41] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'changeprop' for release 'production' . [14:31:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:29] (03PS7) 10Kormat: [WIP] mariadb: Refactor puppetness. [puppet] - 10https://gerrit.wikimedia.org/r/605188 [14:46:55] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/23212/stat1007.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/605194 (https://phabricator.wikimedia.org/T255189) (owner: 10Gilles) [14:47:42] 10Operations, 10Wikimedia-Logstash, 10observability: Increase logging pipeline ingestion capacity - https://phabricator.wikimedia.org/T255243 (10fgiunchedi) [14:49:32] gilles: deployed! Shall I do one test run? [14:50:20] elukey: sure, it should simply overwrite the May report I already generate manually [14:50:21] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:50:33] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:50:46] (03CR) 10Alexandros Kosiaris: [C: 03+2] kubernetes: Add more nodes to the 2 clusters [puppet] - 10https://gerrit.wikimedia.org/r/605233 (https://phabricator.wikimedia.org/T241850) (owner: 10Alexandros Kosiaris) [14:51:44] !log repool mw1384 as test [14:51:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:39] (03PS1) 10Papaul: site: Add alert2001 with role insetup [puppet] - 10https://gerrit.wikimedia.org/r/605248 (https://phabricator.wikimedia.org/T255070) [14:54:36] (03CR) 10Papaul: [C: 03+2] site: Add alert2001 with role insetup [puppet] - 10https://gerrit.wikimedia.org/r/605248 (https://phabricator.wikimedia.org/T255070) (owner: 10Papaul) [14:54:46] (03PS1) 10Elukey: statistics::performance: add syslog id to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/605249 [14:54:50] (03PS2) 10Papaul: site: Add alert2001 with role insetup [puppet] - 10https://gerrit.wikimedia.org/r/605248 (https://phabricator.wikimedia.org/T255070) [14:54:54] (03CR) 10Papaul: [V: 03+2 C: 03+2] site: Add alert2001 with role insetup [puppet] - 10https://gerrit.wikimedia.org/r/605248 (https://phabricator.wikimedia.org/T255070) (owner: 10Papaul) [14:55:28] the zayo transport is down [14:56:15] but I don't see any sign of maintenance or emails [14:56:49] ah no cdanis is of course already on top of it :) [14:57:05] goooood [14:57:25] elukey: I typed 'og' into Gmail's search bar and it offered me the autocomplete 'ogyx/120003' [14:58:31] cdanis: "I am pretty sure you are going to call again Zayo, lemme anticipate you" [14:59:02] (03CR) 10Alexandros Kosiaris: [C: 03+1] remove obsoleted/unused wmflib::service::lvs_icinga() [puppet] - 10https://gerrit.wikimedia.org/r/604149 (owner: 10CDanis) [14:59:16] (03CR) 10Elukey: [C: 03+2] statistics::performance: add syslog id to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/605249 (owner: 10Elukey) [15:01:02] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By:TBD) rack/setup/install alert2001 - https://phabricator.wikimedia.org/T255070 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` alert2001.codfw.wmnet ` The log can be found in `/var/l... [15:09:18] 10Operations, 10ops-eqiad, 10serviceops, 10Patch-For-Review: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by akosiaris on cumin1001.eqiad.wmnet for hosts: ` ['kubernetes1007.eqiad.wm... [15:13:46] (03PS1) 10Andrew-WMDE: TwoColConflict: Talk page small deployment InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605253 (https://phabricator.wikimedia.org/T254458) [15:14:32] (03PS2) 10Vgutierrez: acme_chief,x509: Provide .crt.key file support [software/acme-chief] - 10https://gerrit.wikimedia.org/r/605237 (https://phabricator.wikimedia.org/T255249) [15:14:34] (03PS1) 10Vgutierrez: api: Allow acme-chief clients to fetch .crt.key files [software/acme-chief] - 10https://gerrit.wikimedia.org/r/605254 (https://phabricator.wikimedia.org/T255249) [15:16:02] (03CR) 10Cwhite: [C: 03+2] profile::base: add hardware_monitoring option and set for out-of-warranty nodes [puppet] - 10https://gerrit.wikimedia.org/r/602751 (owner: 10Cwhite) [15:17:46] (03PS1) 10Andrew-WMDE: TwoColConflict: Talk page small deployment CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605255 (https://phabricator.wikimedia.org/T254458) [15:18:26] (03CR) 10Vgutierrez: "This change is ready for review." [software/acme-chief] - 10https://gerrit.wikimedia.org/r/605237 (https://phabricator.wikimedia.org/T255249) (owner: 10Vgutierrez) [15:19:59] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:20:23] (03CR) 10Jbond: "sorry didn't hit send on this, thanks for the review just waiting for the last few fixes to get merged before merging this" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/602693 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [15:21:16] 10Operations, 10ops-codfw, 10DC-Ops: (Need By:TBD) rack/setup/install alert2001 - https://phabricator.wikimedia.org/T255070 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['alert2001.codfw.wmnet'] ` Of which those **FAILED**: ` ['alert2001.codfw.wmnet'] ` [15:21:27] (03CR) 10Hnowlan: [C: 03+2] changeprop: Bump memory limits for pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/605213 (owner: 10Hnowlan) [15:21:36] (03CR) 10jerkins-bot: [V: 04-1] changeprop: Bump memory limits for pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/605213 (owner: 10Hnowlan) [15:21:59] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:22:09] 10Operations, 10ops-codfw, 10DC-Ops: (Need By:TBD) rack/setup/install alert2001 - https://phabricator.wikimedia.org/T255070 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` alert2001.wikimedia.org ` The log can be found in `/var/log/wmf-auto-reimage/... [15:22:14] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [15:22:15] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [15:22:15] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [15:22:16] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [15:22:16] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [15:22:16] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [15:22:16] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [15:22:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:17] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [15:22:18] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [15:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:31] (03PS2) 10Hnowlan: changeprop: Bump memory limits for pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/605213 [15:22:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:23] (03CR) 10Hnowlan: [C: 03+2] changeprop: Bump memory limits for pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/605213 (owner: 10Hnowlan) [15:23:49] (03Merged) 10jenkins-bot: changeprop: Bump memory limits for pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/605213 (owner: 10Hnowlan) [15:24:46] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:24:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:49] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [15:24:49] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [15:24:49] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [15:24:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:10] !log hnowlan@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop' for release 'production' . [15:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:25] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={swagger_check_citoid_cluster_codfw,swagger_check_citoid_cluster_eqiad} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:26:33] PROBLEM - Host kubernetes1009 is DOWN: PING CRITICAL - Packet loss = 100% [15:26:33] PROBLEM - Host kubernetes1011 is DOWN: PING CRITICAL - Packet loss = 100% [15:27:14] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:27:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:37] 10Operations, 10ops-codfw, 10DC-Ops: Decomission oresrdb2002.codfw.wmnet - https://phabricator.wikimedia.org/T254240 (10Papaul) ` [edit interfaces interface-range vlan-private1-c-codfw] - member ge-5/0/23; [edit interfaces interface-range disabled] member ge-3/0/39 { ... } + member ge-5/0/23; [edi... [15:27:46] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:27:56] haha those kubernetes host down alerts set off my adrenaline from yesterday but I see what's going on [15:28:34] PROBLEM - Host kubernetes1008 is DOWN: PING CRITICAL - Packet loss = 100% [15:29:16] RECOVERY - Host kubernetes1011 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [15:29:16] RECOVERY - Host kubernetes1009 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [15:29:32] RECOVERY - Host kubernetes1008 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [15:29:46] 10Operations, 10ops-codfw, 10serviceops: (Need by: TBD) rack/setup/install kubernetes20[07-14].codfw.wmnet and kubestage200[1-2].codfw.wmnet. - https://phabricator.wikimedia.org/T252185 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by akosiaris on cumin1001.eqiad.wmnet for hosts: ` ['kubernetes... [15:31:11] (03PS1) 10Jbond: alytics::refinery::job: build simple script inline in puppet [puppet] - 10https://gerrit.wikimedia.org/r/605261 (https://phabricator.wikimedia.org/T254480) [15:31:13] 10Operations, 10ops-codfw, 10DC-Ops: Decomission oresrdb2002.codfw.wmnet - https://phabricator.wikimedia.org/T254240 (10Papaul) [15:33:07] 10Operations, 10ops-eqiad, 10serviceops: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kubernetes1014.eqiad.wmnet'] ` Of which those **FAILED**: ` ['kubernetes1014.eqiad.wmnet'] ` [15:36:28] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/605261 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [15:36:59] !log hnowlan@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop' for release 'production' . [15:37:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:33] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [15:37:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:19] (03PS1) 10Jbond: puppetmaster::gitclone: build single line script inline [puppet] - 10https://gerrit.wikimedia.org/r/605262 (https://phabricator.wikimedia.org/T254480) [15:39:49] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/605262 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [15:40:08] !log pt1979@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:40:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:19] !log hnowlan@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'changeprop' for release 'production' . [15:40:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:19] (03CR) 10CDanis: [C: 03+1] puppetmaster::gitclone: build single line script inline (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/605262 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [15:42:18] (03PS2) 10Jbond: alytics::refinery::job: build simple script inline in puppet [puppet] - 10https://gerrit.wikimedia.org/r/605261 (https://phabricator.wikimedia.org/T254480) [15:42:35] (03CR) 10CDanis: [C: 03+2] remove obsoleted/unused wmflib::service::lvs_icinga() [puppet] - 10https://gerrit.wikimedia.org/r/604149 (owner: 10CDanis) [15:43:52] (03PS2) 10Jbond: puppetmaster::gitclone: build single line script inline [puppet] - 10https://gerrit.wikimedia.org/r/605262 (https://phabricator.wikimedia.org/T254480) [15:43:54] (03CR) 10CDanis: [C: 03+1] swift: remove swift-container-sharder unit [puppet] - 10https://gerrit.wikimedia.org/r/604623 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [15:44:04] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/605261 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [15:44:16] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [15:44:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:19] (03CR) 10BryanDavis: [C: 03+1] cloud NFS: remove the nfsiostat diammond collector [puppet] - 10https://gerrit.wikimedia.org/r/604931 (https://phabricator.wikimedia.org/T210993) (owner: 10Bstorm) [15:44:21] (03CR) 10Jbond: "thanks for the quick review :)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/605262 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [15:44:29] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/605262 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [15:46:21] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/604381 (owner: 10Jbond) [15:46:56] (03CR) 10Jbond: [C: 03+2] puppetmaster::gitclone: build single line script inline [puppet] - 10https://gerrit.wikimedia.org/r/605262 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [15:47:59] (03CR) 10Bstorm: [C: 03+2] cloud NFS: remove the nfsiostat diammond collector [puppet] - 10https://gerrit.wikimedia.org/r/604931 (https://phabricator.wikimedia.org/T210993) (owner: 10Bstorm) [15:48:43] (03CR) 10Jbond: [C: 03+2] facter: cpu_details add flags, model, bugs and family [puppet] - 10https://gerrit.wikimedia.org/r/604381 (owner: 10Jbond) [15:49:11] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:49:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:12] 10Operations, 10observability, 10Patch-For-Review, 10User-fgiunchedi, 10cloud-services-team (Kanban): Deprecate Diamond collectors in Cloud VPS - https://phabricator.wikimedia.org/T210993 (10Bstorm) [16:01:29] (03PS1) 10Jbond: aptrepo: build single line shell script inline [puppet] - 10https://gerrit.wikimedia.org/r/605267 (https://phabricator.wikimedia.org/T254480) [16:02:14] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/605267 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [16:07:25] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:11:44] (03PS1) 10Cwhite: hiera: disable hardware monitoring on analytics1049 and thumbor1004 [puppet] - 10https://gerrit.wikimedia.org/r/605270 [16:12:17] (03PS1) 10Jbond: profile::icinga: move single line scripts in line [puppet] - 10https://gerrit.wikimedia.org/r/605271 (https://phabricator.wikimedia.org/T254480) [16:12:31] 10Operations, 10ops-codfw, 10DC-Ops: (Need By:TBD) rack/setup/install alert2001 - https://phabricator.wikimedia.org/T255070 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['alert2001.wikimedia.org'] ` and were **ALL** successful. [16:12:43] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/605271 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [16:14:42] 10Operations, 10ops-codfw, 10DC-Ops: (Need By:TBD) rack/setup/install alert2001 - https://phabricator.wikimedia.org/T255070 (10Papaul) [16:15:13] 10Operations, 10ops-codfw, 10DC-Ops: (Need By:TBD) rack/setup/install alert2001 - https://phabricator.wikimedia.org/T255070 (10Papaul) 05Open→03Resolved @fgiunchedi server ready for service [16:15:43] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:16:57] (03CR) 1020after4: phabricator: add envoy TLS terminator for aphlict (DO NOT MERGE) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/603895 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [16:18:58] Amir1: o/ [16:19:03] are you around by any chance? [16:19:58] elukey: I'm always around [16:20:07] of course you are! :D [16:20:09] (03PS1) 10Jbond: varnish::logging: move default definitions inline [puppet] - 10https://gerrit.wikimedia.org/r/605272 (https://phabricator.wikimedia.org/T254480) [16:20:15] But sometimes you're square? ;-) [16:20:42] lol [16:20:44] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/605272 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [16:20:51] Amir1: I am investigating something weird, namely mw1384 having trouble with wikidata, logging errors afaics [16:21:08] see https://logstash.wikimedia.org/app/kibana#/dashboard/mediawiki-errors [16:21:23] let me see [16:21:29] I tried to depool/pool it, and I see errors stopping/starting [16:22:08] oh, that's not good [16:22:10] I also tried to scap pull [16:22:39] so I wanted to know if it is something already seen or not, in case I'll just depool and open a task [16:23:22] I haven't seen it but let me investigate a little [16:23:31] if not, let's open a task [16:23:37] This sounds like fun [16:24:13] (03PS1) 10Jbond: phabricator: move template to file as no dynamic values [puppet] - 10https://gerrit.wikimedia.org/r/605274 (https://phabricator.wikimedia.org/T254480) [16:26:09] (03CR) 10Jbond: [C: 03+2] phabricator: move template to file as no dynamic values [puppet] - 10https://gerrit.wikimedia.org/r/605274 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [16:27:18] (03CR) 1020after4: [C: 03+1] "Why is the port 3120 in tls.yaml but it's 444 in backend.yaml? I don't see where 3120 comes from at all." [puppet] - 10https://gerrit.wikimedia.org/r/569104 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [16:29:19] elukey: since it's Friday, let's depool it and create a ticket [16:30:10] 10Operations, 10ops-codfw, 10serviceops: (Need by: TBD) rack/setup/install kubernetes20[07-14].codfw.wmnet and kubestage200[1-2].codfw.wmnet. - https://phabricator.wikimedia.org/T252185 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kubernetes2013.codfw.wmnet', 'kubernetes2012.codfw.wmnet', 'kub... [16:30:49] we got report of all sorts of weird behaviour today [16:30:56] T255187 [16:30:56] T255187: Special:NewItem appears to have placeholder text heading "⧼Create a new Item⧽" - https://phabricator.wikimedia.org/T255187 [16:32:18] all right [16:32:37] https://usercontent.irccloud-cdn.com/file/8PRagDiV/image.png [16:32:39] !log depool again mw1348 - investigation will follow up in a task [16:32:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:00] !log (correct) depool again mw1384 - investigation will follow up in a task [16:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:48] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:33:49] Amir1: do you have time to open a task with the details? Otherwise I'll try to do it [16:34:05] I do it [16:34:10] thanks! [16:34:27] Thank you for reporting! [16:34:39] (03PS1) 10Jbond: rsync: move oneline script inline [puppet] - 10https://gerrit.wikimedia.org/r/605275 (https://phabricator.wikimedia.org/T254480) [16:35:12] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:35:22] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:37:53] (03PS1) 10Jbond: phabricator: move template to file as no dynamic values [puppet] - 10https://gerrit.wikimedia.org/r/605276 (https://phabricator.wikimedia.org/T254480) [16:38:18] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:39:38] Amir1: fyi i can't get onto mattermost any more to continue the conversation [16:40:15] 10Operations, 10Wikidata: mw1384 is misbehaving - https://phabricator.wikimedia.org/T255282 (10Ladsgroup) [16:40:29] :( [16:41:22] addshore: see the ticket, maybe caused by it? [16:41:36] it's mostly RDF though [16:41:42] (03CR) 10Jbond: [C: 03+2] phabricator: move template to file as no dynamic values [puppet] - 10https://gerrit.wikimedia.org/r/605276 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [16:41:46] (03PS1) 10Jbond: trafficserver::instance: move single line scripts inline [puppet] - 10https://gerrit.wikimedia.org/r/605279 (https://phabricator.wikimedia.org/T254480) [16:43:03] Amir1: yeah, i dont know what is going on there.... but it could all come down to the property info local cache? so many somehting is wrong with apc on that node? or something? [16:43:17] 10Operations, 10Wikidata: mw1384 is misbehaving - https://phabricator.wikimedia.org/T255282 (10Ladsgroup) [16:43:24] Amir1: could look and see what is in it in eval.php ? [16:43:42] yeah, you can login to that node directly [16:43:49] I sometimes do :P [16:44:04] * addshore is currently finishing off something else first (dinner) [16:44:49] 10Operations, 10Wikidata: mw1384 is misbehaving - https://phabricator.wikimedia.org/T255282 (10elukey) To add some details - I depooled/repooled doing `scap pull` and it didn't really work. I also depooled, waited hours, repooled and the issue gets back consistently. It may be a single host issue, but not real... [16:44:53] I doubt it, property info is really small [16:44:59] 10Operations, 10Wikidata, 10serviceops: mw1384 is misbehaving - https://phabricator.wikimedia.org/T255282 (10elukey) [16:45:00] (in APCu size) [16:45:30] and also, if it was a code-issue, it should show itself everywhere not just one node [16:45:35] yeah :/ [16:46:22] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:47:47] Amir1: I didn't try to restart php fpm though [16:48:08] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:48:51] hmm, we can try it [16:48:55] let's do it [16:49:54] !log restart php-fpm and pool mw1384 - T255282 [16:49:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:01] T255282: mw1384 is misbehaving - https://phabricator.wikimedia.org/T255282 [16:52:13] errors are not coming back [16:52:47] of course I forgot to dump the status of apc sigh [16:53:22] Amir1: I guess that APC was corrupted or something similar? [16:53:30] yeah, it can be [16:54:10] all right I am incline to close, if it rehappens I'll reopen [16:55:09] Yeh, I think apc being broken would cause this [16:55:20] 10Operations, 10Wikidata, 10serviceops: mw1384 is misbehaving - https://phabricator.wikimedia.org/T255282 (10elukey) 05Open→03Resolved a:03elukey Error seem to be gone after the php-fpm restart, but I forgot to dump the status of APC :( My bet is on APC corruption or something similar, will reopen in c... [16:55:33] I wonder if that error state / corruption can be detected in code? [16:56:07] 10Operations, 10Wikidata, 10serviceops: mw1384 is misbehaving - https://phabricator.wikimedia.org/T255282 (10Addshore) [16:56:10] there are alarms that SRE has IIRC, maybe this one was an outlier/corner-case [16:56:40] cool, well, im not going to think too much about it and instead go and eat an icecream :D [16:56:41] o/ [16:56:51] have a nice weekend folks! [16:58:29] Enjoy! [16:58:40] Thanks for fixing it elukey [17:13:03] (03CR) 10Bstorm: [C: 03+1] Disable tool name alias in lighttpd config with --canonical [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/603668 (https://phabricator.wikimedia.org/T254640) (owner: 10BryanDavis) [17:19:23] (03PS1) 10Awight: Remove unused field [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605285 (https://phabricator.wikimedia.org/T255291) [17:23:05] 10Operations, 10ops-eqiad: Degraded RAID on restbase-dev1004 - https://phabricator.wikimedia.org/T253607 (10wiki_willy) [17:23:37] 10Operations, 10ops-eqiad: Degraded RAID on restbase-dev1004 - https://phabricator.wikimedia.org/T253607 (10wiki_willy) Thanks @Cmjohnson - T255293 created for ordering the new disk. Thanks, Willy [17:24:52] (03CR) 10Reedy: [C: 03+2] Remove unused field [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605285 (https://phabricator.wikimedia.org/T255291) (owner: 10Awight) [17:25:37] (03Merged) 10jenkins-bot: Remove unused field [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605285 (https://phabricator.wikimedia.org/T255291) (owner: 10Awight) [17:28:11] Amir1: elukey: opcache corruption causing 'impossible' error messages on one server is a not-uncommon thing, unfortunately [17:29:30] cdanis: thanks, is there a ticket for me to take a look? [17:30:27] Amir1: https://phabricator.wikimedia.org/T253673 I think there are some other tasks with particular examples referenced there [17:30:50] some past examples here: https://phabricator.wikimedia.org/T231089 https://phabricator.wikimedia.org/T232233 partial list [17:31:22] thanks! [17:31:33] Amir1: "php sucks" [17:31:58] Reedy: tell me something I don't know already :P [17:34:52] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1011-production-logstash-eqiad on logstash1011 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1011&panelId=37 [17:44:53] !log restarting logstash1011 elasticsearch instance [17:44:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:00] (03CR) 10Krinkle: private: Add documentation for PrivateSettings.php (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605005 (owner: 10Krinkle) [17:47:06] (03PS4) 10Krinkle: private: Add documentation for PrivateSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605005 [17:47:11] (03PS5) 10Krinkle: private: Add documentation for PrivateSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605005 [17:47:27] (03CR) 10Krinkle: [C: 03+2] "Won't do removals in this commit, but perhaps later :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605005 (owner: 10Krinkle) [17:48:23] (03Merged) 10jenkins-bot: private: Add documentation for PrivateSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605005 (owner: 10Krinkle) [18:03:48] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash instance=kafkamon1001:9501 job=burrow partition={2,3} site=eqiad topic=udp_localhost-info https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logg [18:03:48] ic=All&var-consumer_group=All [18:04:04] RECOVERY - Rate of JVM GC Old generation-s runs - logstash1011-production-logstash-eqiad on logstash1011 is OK: (C)100 gt (W)80 gt 78.31 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1011&panelId=37 [18:04:15] (03CR) 10Volans: [V: 03+2 C: 03+2] Add support for buster in the build process [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/604715 (https://phabricator.wikimedia.org/T245114) (owner: 10Volans) [18:07:24] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [19:14:04] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 51 probes of 575 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [19:16:29] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Port exim statistics to Prometheus - https://phabricator.wikimedia.org/T179565 (10Cyrille37) Hi, Please, where is the source code of this exim4 to prometheus ? There are very few exim4 exporter findable on the net, perhaps yours is the greater... [19:22:36] RECOVERY - Rate of JVM GC Old generation-s runs - logstash1012-production-logstash-eqiad on logstash1012 is OK: (C)100 gt (W)80 gt 77.29 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1012&panelId=37 [19:31:28] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 48 probes of 575 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [19:56:40] (03PS1) 10Bstorm: cloud nfs: allow opt-in soft mounting wherever folks want to try it [puppet] - 10https://gerrit.wikimedia.org/r/605310 (https://phabricator.wikimedia.org/T127559) [20:16:36] PROBLEM - Ensure legal html en.m.wp on en.m.wikipedia.org is CRITICAL: a\shref=(https:)?\/\/foundation\.wikimedia\.org\/wiki\/Privacy_policy\sclass=extiw\stitle=wmf:Privacy\spolicyPrivacy/a html not found https://phabricator.wikimedia.org/project/members/28/ [20:26:21] SIGH [20:28:56] 10Operations, 10observability: "ensure legal html" footer monitoring turned CRIT - https://phabricator.wikimedia.org/T119456 (10CDanis) 05Resolved→03Open Happened again today, because the inner text of the mobile link was changed from "Privacy" to "Privacy policy". [20:29:40] (03PS1) 10CDanis: check_legal_html: update with trivial fix [puppet] - 10https://gerrit.wikimedia.org/r/605314 (https://phabricator.wikimedia.org/T119456) [20:35:33] (03CR) 10CDanis: [C: 03+2] check_legal_html: update with trivial fix [puppet] - 10https://gerrit.wikimedia.org/r/605314 (https://phabricator.wikimedia.org/T119456) (owner: 10CDanis) [20:39:17] RECOVERY - Ensure legal html en.m.wp on en.m.wikipedia.org is OK: all html is present. https://phabricator.wikimedia.org/project/members/28/ [20:40:13] 10Operations, 10observability, 10Patch-For-Review: "ensure legal html" footer monitoring turned CRIT - https://phabricator.wikimedia.org/T119456 (10CDanis) 05Open→03Resolved [20:40:46] (03PS7) 10Andrew Bogott: Initial module and profile for galera + mariadb [puppet] - 10https://gerrit.wikimedia.org/r/604856 (https://phabricator.wikimedia.org/T242455) [20:40:48] (03PS1) 10Andrew Bogott: Add icinga monitoring for WMCS galera [puppet] - 10https://gerrit.wikimedia.org/r/605315 [20:41:52] (03CR) 10BryanDavis: [C: 03+1] check_legal_html: update with trivial fix [puppet] - 10https://gerrit.wikimedia.org/r/605314 (https://phabricator.wikimedia.org/T119456) (owner: 10CDanis) [20:41:56] (03CR) 10jerkins-bot: [V: 04-1] Initial module and profile for galera + mariadb [puppet] - 10https://gerrit.wikimedia.org/r/604856 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott) [20:41:59] bd808: thanks! [20:42:08] (03CR) 10jerkins-bot: [V: 04-1] Add icinga monitoring for WMCS galera [puppet] - 10https://gerrit.wikimedia.org/r/605315 (owner: 10Andrew Bogott) [20:43:16] cdanis: that check might be overly specific :) [20:43:28] bd808: I made the minimal patch because it needs so many things [20:44:35] cdanis: we should just replace that with an httpbb check, actually [20:45:06] * bd808 is reading the original task to try to understand the intent [20:45:10] huh that's weird it's getting pretty late on a Friday afternoon https://media.giphy.com/media/4pMX5rJ4PYAEM/giphy.gif [20:45:43] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:46:47] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:01:55] (03PS8) 10Andrew Bogott: Initial module and profile for galera + mariadb [puppet] - 10https://gerrit.wikimedia.org/r/604856 (https://phabricator.wikimedia.org/T242455) [21:01:57] (03PS2) 10Andrew Bogott: Add icinga monitoring for WMCS galera [puppet] - 10https://gerrit.wikimedia.org/r/605315 [21:03:09] (03CR) 10jerkins-bot: [V: 04-1] Initial module and profile for galera + mariadb [puppet] - 10https://gerrit.wikimedia.org/r/604856 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott) [21:03:14] (03CR) 10jerkins-bot: [V: 04-1] Add icinga monitoring for WMCS galera [puppet] - 10https://gerrit.wikimedia.org/r/605315 (owner: 10Andrew Bogott) [21:05:19] (03PS3) 10Andrew Bogott: Add icinga monitoring for WMCS galera [puppet] - 10https://gerrit.wikimedia.org/r/605315 [21:06:28] (03CR) 10jerkins-bot: [V: 04-1] Add icinga monitoring for WMCS galera [puppet] - 10https://gerrit.wikimedia.org/r/605315 (owner: 10Andrew Bogott) [21:09:31] (03PS4) 10Andrew Bogott: Add icinga monitoring for WMCS galera [puppet] - 10https://gerrit.wikimedia.org/r/605315 [21:10:11] (03CR) 10Andrew Bogott: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/604856 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott) [21:10:53] (03CR) 10jerkins-bot: [V: 04-1] Add icinga monitoring for WMCS galera [puppet] - 10https://gerrit.wikimedia.org/r/605315 (owner: 10Andrew Bogott) [21:12:25] 10Operations, 10Release-Engineering-Team-TODO, 10Scap, 10Release-Engineering-Team (Deployment services), and 2 others: scap's logstash_checker.py is blissfully unaware of any logstash indexing latency - https://phabricator.wikimedia.org/T255197 (10greg) [21:43:53] (03CR) 10Bstorm: cloud nfs: allow opt-in soft mounting wherever folks want to try it (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/605310 (https://phabricator.wikimedia.org/T127559) (owner: 10Bstorm) [22:03:51] (03PS9) 10Andrew Bogott: Initial module and profile for galera + mariadb [puppet] - 10https://gerrit.wikimedia.org/r/604856 (https://phabricator.wikimedia.org/T242455) [22:03:53] (03PS5) 10Andrew Bogott: Add icinga monitoring for WMCS galera [puppet] - 10https://gerrit.wikimedia.org/r/605315 [22:05:06] (03CR) 10jerkins-bot: [V: 04-1] Add icinga monitoring for WMCS galera [puppet] - 10https://gerrit.wikimedia.org/r/605315 (owner: 10Andrew Bogott) [22:07:23] (03PS6) 10Andrew Bogott: Add icinga monitoring for WMCS galera [puppet] - 10https://gerrit.wikimedia.org/r/605315 [22:08:32] (03CR) 10jerkins-bot: [V: 04-1] Add icinga monitoring for WMCS galera [puppet] - 10https://gerrit.wikimedia.org/r/605315 (owner: 10Andrew Bogott) [22:10:01] (03PS7) 10Andrew Bogott: Add icinga monitoring for WMCS galera [puppet] - 10https://gerrit.wikimedia.org/r/605315 [22:12:38] (03PS8) 10Andrew Bogott: Add icinga monitoring for WMCS galera [puppet] - 10https://gerrit.wikimedia.org/r/605315 [22:15:41] (03CR) 10Andrew Bogott: "Untested but interested in a second opinion about adding this additional icinga plugin (and if I'm doing it right)" [puppet] - 10https://gerrit.wikimedia.org/r/605315 (owner: 10Andrew Bogott) [22:40:14] 10Operations, 10Commons, 10ConfirmEdit (CAPTCHA extension), 10Editing-team, and 5 others: GenerateFancyCaptchas.php crashes with "FormatJson.php: File not found in" after 1000 iterations - https://phabricator.wikimedia.org/T230245 (10Krinkle) What we know so far... * The fatal error happens consistently w... [22:40:15] ACKNOWLEDGEMENT - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP CDanis Zayo outage TTN-0004162940 - The acknowledgement expires at: 2020-06-13 02:39:49. https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:40:15] ACKNOWLEDGEMENT - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: CDanis Zayo outage TTN-0004162940 - The acknowledgement expires at: 2020-06-13 02:39:49. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:45:47] (03PS9) 10Andrew Bogott: Add icinga monitoring for WMCS galera [puppet] - 10https://gerrit.wikimedia.org/r/605315 [22:45:48] (03PS1) 10Andrew Bogott: Galera: move behind haproxy [puppet] - 10https://gerrit.wikimedia.org/r/605336 (https://phabricator.wikimedia.org/T242455) [22:53:29] (03PS10) 10Andrew Bogott: Add icinga monitoring for WMCS galera [puppet] - 10https://gerrit.wikimedia.org/r/605315 [22:58:25] 10Operations, 10Commons, 10ConfirmEdit (CAPTCHA extension), 10Editing-team, and 5 others: GenerateFancyCaptchas.php crashes with "FormatJson.php: File not found in" after 1000 iterations - https://phabricator.wikimedia.org/T230245 (10Platonides) I would try * throwing a `clearstatcache()` somewhere, in cas... [23:21:33] 10Operations, 10Research: Admin password reset request for a mailman list: research-wmf - https://phabricator.wikimedia.org/T255326 (10leila) [23:21:49] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash instance=kafkamon1001:9501 job=burrow partition={4,5} site=eqiad topic={udp_localhost-info,udp_localhost-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometh [23:21:49] er=logging-eqiad&var-topic=All&var-consumer_group=All [23:22:23] 10Operations, 10Cloud-VPS, 10DNS, 10Maps, and 2 others: multi-component wmflabs.org subdomains doesn't work under simple wildcard TLS cert - https://phabricator.wikimedia.org/T161256 (10bd808) >>! In T161256#5070781, @TheDJ wrote: > FYI, I have configured [abc].tiles.wmflabs.org webhosts to redirect to htt... [23:27:19] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [23:34:13] (03PS1) 10Cwhite: service::docker: enhance volume support [puppet] - 10https://gerrit.wikimedia.org/r/605343 (https://phabricator.wikimedia.org/T222826) [23:48:25] (03PS1) 10BryanDavis: toolforge: remove legacy killgridjobs.sh script [puppet] - 10https://gerrit.wikimedia.org/r/605345 (https://phabricator.wikimedia.org/T157792) [23:49:54] (03CR) 10BryanDavis: "This is bastion only, so manual cleanup on the 3 nodes seemed easier than ensure absent and then cleaning that up later." [puppet] - 10https://gerrit.wikimedia.org/r/605345 (https://phabricator.wikimedia.org/T157792) (owner: 10BryanDavis)