[00:00:04] twentyafterfour: How many deployers does it take to do Phabricator update deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200924T0000). [00:04:27] PROBLEM - Disk space on an-master1002 is CRITICAL: DISK CRITICAL - free space: /srv 4759 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-master1002&var-datasource=eqiad+prometheus/ops [00:07:41] 10Operations, 10MediaWiki-API, 10Traffic, 10Patch-For-Review: Varnish does not cache Action API responses when logged in - https://phabricator.wikimedia.org/T155314 (10Tgr) 05Open→03Declined Given we have a REST API now, which should probably be the preferred way to implement cached endpoints, and that... [00:15:38] (03PS1) 10Dzahn: add dc-ops admin group to new install servers, ensure services stopped [puppet] - 10https://gerrit.wikimedia.org/r/629493 (https://phabricator.wikimedia.org/T252526) [00:19:46] (03CR) 10Dzahn: [C: 03+2] CI profile: move ruamel requirement to publisher [puppet] - 10https://gerrit.wikimedia.org/r/629449 (https://phabricator.wikimedia.org/T255835) (owner: 10Jeena Huneidi) [00:21:03] (03CR) 10Dzahn: "Jeena: This will not remove the package on hosts using the ci::pipeline::builder class though (unless you ensure => absent them in puppet " [puppet] - 10https://gerrit.wikimedia.org/r/629449 (https://phabricator.wikimedia.org/T255835) (owner: 10Jeena Huneidi) [00:25:17] (03CR) 10Dzahn: diffscan: hiera->lookup, add data types (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/629427 (owner: 10Dzahn) [00:25:30] (03PS2) 10Dzahn: diffscan: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/629427 [00:26:00] (03CR) 10Dzahn: [C: 03+2] add dc-ops admin group to new install servers, ensure services stopped [puppet] - 10https://gerrit.wikimedia.org/r/629493 (https://phabricator.wikimedia.org/T252526) (owner: 10Dzahn) [00:28:43] (03PS1) 10Dzahn: site: add installserver::light role to new install servers [puppet] - 10https://gerrit.wikimedia.org/r/629495 (https://phabricator.wikimedia.org/T252526) [00:35:52] (03PS1) 10Dzahn: bastionhost::pop: remove tftp from bastions [puppet] - 10https://gerrit.wikimedia.org/r/629496 [00:37:16] (03CR) 10Dzahn: [C: 04-2] "WIP" [puppet] - 10https://gerrit.wikimedia.org/r/629496 (owner: 10Dzahn) [00:37:39] off [00:40:17] longma: ran puppet on contint*. python3-ruamel is installed. cya [00:40:59] thanks mutante! [00:56:07] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1009 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(requests.packages.urllib3.connection.HTTPConnection object at 0x7f5f8cb2f4e0: Failed to establish a new connection: [Errno 111] Connection [00:56:07] ://wikitech.wikimedia.org/wiki/Search%23Administration [00:56:47] PROBLEM - Check systemd state on logstash1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:14:21] ^ Took a look at the above, shard recovery had hit the limit [01:14:55] !log (before) `{"cluster_name":"production-elk7-codfw","status":"yellow","timed_out":false,"number_of_nodes":12,"number_of_data_nodes":7,"active_primary_shards":459,"active_shards":866,"relocating_shards":4,"initializing_shards":0,"unassigned_shards":2,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0 [01:14:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:15:20] 6:14 PM !log (before) `{"cluster_name":"production-elk7-codfw","status":"yellow","timed_out":false,"number_of_nodes":12,"number_of_data_nodes":7,"active_primary_shards":459,"active_shards":866,"relocating_shards":4,"initializing_shards":0,"unassigned_shards":2,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0` [01:15:36] lol, one more time [01:15:42] !log (before) `{"cluster_name":"production-elk7-codfw","status":"yellow","timed_out":false,"number_of_nodes":12,"number_of_data_nodes":7,"active_primary_shards":459,"active_shards":866,"relocating_shards":4,"initializing_shards":0,"unassigned_shards":2,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0` [01:15:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:16:07] !log Ran `curl -X POST 'http://localhost:9200/_cluster/reroute?retry_failed=true'`, cluster status is green again [01:16:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:16:42] !log (after) `{"cluster_name":"production-elk7-codfw","status":"green","timed_out":false,"number_of_nodes":12,"number_of_data_nodes":7,"active_primary_shards":459,"active_shards":868,"relocating_shards":4,"initializing_shards":0,"unassigned_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0` [01:16:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:18:17] The above was me on `codfw`, the critical is for `logstash1009` so I'll need to address that too [01:19:58] !log Getting `connection refused` when trying to `curl -X GET 'http://localhost:9200/_cluster/health'` on `logstash1009` [01:20:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:21:03] Somehow the elasticsearch service got kill -9'd? https://www.irccloud.com/pastebin/2r9Q0VTv/ [01:21:50] !log Observed that `elasticsearch_5@production-logstash-eqiad.service` is in a `failed` state since `Thu 2020-09-24 00:53:53 UTC`; appears the process received a SIGKILL - not sure why [01:21:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:22:34] ACKNOWLEDGEMENT - ElasticSearch health check for shards on 9200 on logstash1009 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(requests.packages.urllib3.connection.HTTPConnection object at 0x7f10833b64e0: Failed to establish a new connection: [Errno 111] Co [01:22:34] )) Ryan Kemper looking into it, somehow elasticsearch_5@production-logstash-eqiad.service is in a failed state, appears to have been sigkilled? https://wikitech.wikimedia.org/wiki/Search%23Administration [01:23:09] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1009 is OK: OK - elasticsearch status production-logstash-eqiad: cluster_name: production-logstash-eqiad, active_shards: 916, active_shards_percent_as_number: 100.0, status: green, unassigned_shards: 0, number_of_nodes: 6, number_of_in_flight_fetch: 0, active_primary_shards: 483, timed_out: False, number_of_data_nodes: 3, delayed_unassigned_shards: 0, relocating_ [01:23:09] _of_pending_tasks: 0, task_max_waiting_in_queue_millis: 0, initializing_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [01:23:51] RECOVERY - Check systemd state on logstash1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:24:04] The service came back online by itself, (I was about to restart it but hadn't yet). Interesting [01:25:20] !log Root cause of sigkill of `elasticsearch_5@production-logstash-eqiad.service` appears to be OOMKill of the java process: `Killed process 1775 (java) total-vm:8016136kB, anon-rss:4888232kB, file-rss:0kB, shmem-rss:0kB`. Service appears to have restarted itself and is healthy again [01:25:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:49:01] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:54:49] PROBLEM - Check systemd state on elastic2037 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:56:21] RECOVERY - Check systemd state on elastic2037 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:02:57] PROBLEM - Host elastic2037 is DOWN: PING CRITICAL - Packet loss = 100% [05:22:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2109 for MCR schema change', diff saved to https://phabricator.wikimedia.org/P12777 and previous config saved to /var/cache/conftool/dbconfig/20200924-052207-marostegui.json [05:22:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:27:31] (03PS1) 10Marostegui: instances.yaml: Remove es2012 and es2018 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/629502 (https://phabricator.wikimedia.org/T263615) [05:28:39] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove es2012 and es2018 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/629502 (https://phabricator.wikimedia.org/T263615) (owner: 10Marostegui) [05:30:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove es2012 and es2018 from dbctl - T263615 T263613', diff saved to https://phabricator.wikimedia.org/P12778 and previous config saved to /var/cache/conftool/dbconfig/20200924-053001-marostegui.json [05:30:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:30:08] T263615: decommission es2018.codfw.wmnet - https://phabricator.wikimedia.org/T263615 [05:30:08] T263613: decommission es2012.codfw.wmnet - https://phabricator.wikimedia.org/T263613 [05:33:45] (03PS1) 10Marostegui: mariadb: Remove es2012, es2018 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/629503 (https://phabricator.wikimedia.org/T263615) [05:37:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [05:37:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:41:08] !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [05:41:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:45:07] (03PS1) 10Marostegui: dns: Remove es2012 DNS entries [dns] - 10https://gerrit.wikimedia.org/r/629504 (https://phabricator.wikimedia.org/T263613) [05:45:30] (03PS2) 10KartikMistry: ContentTranslation: Do not use wikishared DB for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629371 (https://phabricator.wikimedia.org/T263417) [05:50:54] (03PS3) 10KartikMistry: ContentTranslation: Do not use wikishared DB for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629371 (https://phabricator.wikimedia.org/T263417) [05:57:32] !log Remove es2012 from tendril and zarcillo T263613 [05:57:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:57:39] T263613: decommission es2012.codfw.wmnet - https://phabricator.wikimedia.org/T263613 [06:11:00] 10Operations, 10vm-requests: eqiad: New Ganeti instance for Hue (an-tool1009) - https://phabricator.wikimedia.org/T258771 (10elukey) 05Open→03Resolved Yep thanks! (Moritz took care of it) [06:18:41] RECOVERY - Check systemd state on stat1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:19:16] (03CR) 10Volans: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/629504 (https://phabricator.wikimedia.org/T263613) (owner: 10Marostegui) [06:20:08] (03CR) 10Marostegui: [C: 03+2] dns: Remove es2012 DNS entries [dns] - 10https://gerrit.wikimedia.org/r/629504 (https://phabricator.wikimedia.org/T263613) (owner: 10Marostegui) [06:21:35] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission es2012.codfw.wmnet - https://phabricator.wikimedia.org/T263613 (10Marostegui) [06:21:44] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission es2012.codfw.wmnet - https://phabricator.wikimedia.org/T263613 (10Marostegui) Host ready for #dc-ops [06:22:10] !log powercycle elastic2037 (host stuck, no mgmt serial console working, DIMM errors in racadm getsel) [06:22:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:23:00] (03PS1) 10Marostegui: dns: Remove es2014 from DNS [dns] - 10https://gerrit.wikimedia.org/r/629529 (https://phabricator.wikimedia.org/T261717) [06:25:03] RECOVERY - Host elastic2037 is UP: PING OK - Packet loss = 0%, RTA = 30.19 ms [06:25:17] 10Operations, 10ops-codfw: elastic2037 DIMM errors logged in racadm getsel - https://phabricator.wikimedia.org/T263714 (10elukey) [06:40:08] (03CR) 10Volans: [C: 03+2] netbox: convert Icinga check in timer [puppet] - 10https://gerrit.wikimedia.org/r/629440 (https://phabricator.wikimedia.org/T258729) (owner: 10Volans) [06:40:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Place db2073 into vslow, not api in s4', diff saved to https://phabricator.wikimedia.org/P12780 and previous config saved to /var/cache/conftool/dbconfig/20200924-064018-marostegui.json [06:40:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:18] (03CR) 10Volans: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/629529 (https://phabricator.wikimedia.org/T261717) (owner: 10Marostegui) [06:41:29] (03CR) 10Marostegui: [C: 03+2] dns: Remove es2014 from DNS [dns] - 10https://gerrit.wikimedia.org/r/629529 (https://phabricator.wikimedia.org/T261717) (owner: 10Marostegui) [06:43:46] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/629427 (owner: 10Dzahn) [06:45:15] (03CR) 10Giuseppe Lavagetto: [C: 03+2] services: retire the ORES http endpoint (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/628802 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [06:47:46] (03CR) 10Muehlenhoff: "The patch is correct, but I think we can remove this script entirely. In the past there were occasional audits between what was in LDAP an" [puppet] - 10https://gerrit.wikimedia.org/r/624112 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [06:48:23] (03PS1) 10Volans: netbox: improve check_json_file [puppet] - 10https://gerrit.wikimedia.org/r/629606 (https://phabricator.wikimedia.org/T258729) [06:49:19] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/624733 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [06:50:35] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:51:59] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:54:27] RECOVERY - Disk space on an-master1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-master1002&var-datasource=eqiad+prometheus/ops [06:54:38] freed some space --^ [06:55:23] (03CR) 10Volans: [C: 03+2] "trivial, self merging to cleanup and fix the small case of the missing stale file" [puppet] - 10https://gerrit.wikimedia.org/r/629606 (https://phabricator.wikimedia.org/T258729) (owner: 10Volans) [07:02:26] (03PS1) 10Jcrespo: Release 0.3 version [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/629607 [07:03:03] (03CR) 10Jcrespo: [C: 03+2] Release 0.3 version [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/629607 (owner: 10Jcrespo) [07:09:55] PROBLEM - Too many messages in kafka logging-eqiad #o11y on alert1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash-codfw instance=kafkamon1002 job=burrow partition={0,1,2,3,4,5} prometheus=ops site=eqiad topic={rsyslog-notice,udp_localhost-info,udp_localhost-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h [07:09:55] ar-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [07:10:11] (03PS1) 10Muehlenhoff: Remove access for jkumalah [puppet] - 10https://gerrit.wikimedia.org/r/629608 [07:12:48] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for jkumalah [puppet] - 10https://gerrit.wikimedia.org/r/629608 (owner: 10Muehlenhoff) [07:13:18] 10Operations, 10observability: Puppet has failure on each run on alert1001 - https://phabricator.wikimedia.org/T263716 (10Volans) p:05Triage→03High [07:21:04] (03Abandoned) 10Elukey: sre.hadoop.init-hadoop-workers.py: fix disk labels window [cookbooks] - 10https://gerrit.wikimedia.org/r/629413 (owner: 10Elukey) [07:21:18] (03Abandoned) 10Elukey: sre.hadoop.init-hadoop-workers.py: add journalnode partition [cookbooks] - 10https://gerrit.wikimedia.org/r/629435 (https://phabricator.wikimedia.org/T262189) (owner: 10Elukey) [07:21:33] (03PS4) 10Giuseppe Lavagetto: services: retire the ORES http endpoint (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/628802 (https://phabricator.wikimedia.org/T244843) [07:21:35] (03PS4) 10Giuseppe Lavagetto: services: retire the ORES http endpoint (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/628803 (https://phabricator.wikimedia.org/T244843) [07:25:11] !log push pfw policies - T263674 [07:25:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:43] (03PS1) 10Jcrespo: Reorganize remote backups (snapshots) for speedup [puppet] - 10https://gerrit.wikimedia.org/r/629610 (https://phabricator.wikimedia.org/T257551) [07:31:12] (03PS2) 10Jcrespo: mariadb-backups: Reorganize remote backups (snapshots) for speedup [puppet] - 10https://gerrit.wikimedia.org/r/629610 (https://phabricator.wikimedia.org/T257551) [07:31:47] PROBLEM - Check systemd state on ms-be1056 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:33:14] (03PS23) 10Elukey: profile::hadoop::common: add datanode mountpoints override [puppet] - 10https://gerrit.wikimedia.org/r/629165 [07:37:03] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Marostegui) Looks like it arrived \o/: ` Delivered Wednesday 9/23/2020 at 9:57 am ` [07:37:15] 10Operations, 10observability: Puppet has failure on each run on alert1001 - https://phabricator.wikimedia.org/T263716 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi The host was missing the latest version of `prometheus-icinga-exporter` package, which ships the missing unit. I've upgraded the package a... [07:39:22] (03CR) 10JMeybohm: [C: 03+1] service_proxy: enable ipv6 on envoy config [puppet] - 10https://gerrit.wikimedia.org/r/629343 (https://phabricator.wikimedia.org/T255568) (owner: 10Effie Mouzeli) [07:43:37] (03CR) 10JMeybohm: [C: 03+1] services: retire the ORES http endpoint (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/628802 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [07:43:41] (03CR) 10JMeybohm: [C: 03+1] services: retire the ORES http endpoint (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/628803 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [07:46:52] (03CR) 10Gilles: [C: 03+1] webperf: new python-ua-parser navtiming dependency (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/629436 (https://phabricator.wikimedia.org/T260580) (owner: 10Dave Pifke) [07:48:10] (03PS1) 10Volans: netbox: move state file to /var/run [puppet] - 10https://gerrit.wikimedia.org/r/629613 (https://phabricator.wikimedia.org/T258729) [07:49:08] !log roll restart logstash codfw, gc death [07:49:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:14] !log klausman@cumin1001 START - Cookbook sre.hosts.downtime [07:49:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:45] (03CR) 10Volans: [C: 03+2] netbox: move state file to /var/run [puppet] - 10https://gerrit.wikimedia.org/r/629613 (https://phabricator.wikimedia.org/T258729) (owner: 10Volans) [07:50:33] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1056 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [07:51:12] 10Operations, 10Scap, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Scap configuration for WDQS should get server groups from a known source or truth - https://phabricator.wikimedia.org/T252124 (10Zbyszko) a:03Zbyszko [07:52:15] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:52:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:04] (03CR) 10Ema: [C: 03+1] "We might want to omit http_status_family, but feel free to ignore if you think it could be useful to have dashboards using it!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/629430 (https://phabricator.wikimedia.org/T263536) (owner: 10Filippo Giunchedi) [07:53:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2109 (re)pooling @ 33%: Slowly repool db2109 ', diff saved to https://phabricator.wikimedia.org/P12781 and previous config saved to /var/cache/conftool/dbconfig/20200924-075312-root.json [07:53:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:21] (03CR) 10Ayounsi: [C: 03+2] Add vrrp_master_pinning in eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/629364 (https://phabricator.wikimedia.org/T263212) (owner: 10Ayounsi) [07:56:27] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:56:48] (03Merged) 10jenkins-bot: Add vrrp_master_pinning in eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/629364 (https://phabricator.wikimedia.org/T263212) (owner: 10Ayounsi) [07:57:31] !log configure vrrp_master_pinning in eqiad - T263212 [07:57:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:36] T263212: Consider balancing VRRP primaries to cr1/cr2 - https://phabricator.wikimedia.org/T263212 [07:57:51] !log Remove es2018 from tendril and zarcillo T263613 [07:57:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:56] T263613: decommission es2012.codfw.wmnet - https://phabricator.wikimedia.org/T263613 [07:58:01] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:58:03] !log volans@cumin1001 START - Cookbook sre.hosts.decommission [07:58:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:27] RECOVERY - Check systemd state on ms-be1056 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:00:55] RECOVERY - Too many messages in kafka logging-eqiad #o11y on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [08:02:43] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:04:17] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:06:09] (03PS3) 10Filippo Giunchedi: profile: add alertmanager::alerts [puppet] - 10https://gerrit.wikimedia.org/r/629153 (https://phabricator.wikimedia.org/T258948) [08:06:45] (03PS1) 10Volans: dns: fix check on argument [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/629614 (https://phabricator.wikimedia.org/T258729) [08:07:11] (03CR) 10jerkins-bot: [V: 04-1] profile: add alertmanager::alerts [puppet] - 10https://gerrit.wikimedia.org/r/629153 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [08:07:23] * volans looking at netbox1001 systemd [08:08:12] (03CR) 10Volans: [C: 03+2] dns: fix check on argument [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/629614 (https://phabricator.wikimedia.org/T258729) (owner: 10Volans) [08:08:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2109 (re)pooling @ 66%: Slowly repool db2109 ', diff saved to https://phabricator.wikimedia.org/P12782 and previous config saved to /var/cache/conftool/dbconfig/20200924-080816-root.json [08:08:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:42] (03PS1) 10Ayounsi: Configure vrrp_master_pinning in codfw [homer/public] - 10https://gerrit.wikimedia.org/r/629615 (https://phabricator.wikimedia.org/T263212) [08:08:50] (03CR) 10Marostegui: [C: 03+2] mariadb: Remove es2012, es2018 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/629503 (https://phabricator.wikimedia.org/T263615) (owner: 10Marostegui) [08:08:52] !log volans@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [08:08:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:24] (03CR) 10Ayounsi: [C: 03+2] Configure vrrp_master_pinning in codfw [homer/public] - 10https://gerrit.wikimedia.org/r/629615 (https://phabricator.wikimedia.org/T263212) (owner: 10Ayounsi) [08:09:25] !log volans@cumin1001 START - Cookbook sre.hosts.decommission [08:09:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:47] (03Merged) 10jenkins-bot: Configure vrrp_master_pinning in codfw [homer/public] - 10https://gerrit.wikimedia.org/r/629615 (https://phabricator.wikimedia.org/T263212) (owner: 10Ayounsi) [08:10:08] (03PS4) 10Filippo Giunchedi: profile: add alertmanager::alerts [puppet] - 10https://gerrit.wikimedia.org/r/629153 (https://phabricator.wikimedia.org/T258948) [08:10:30] !log installing mariadb-10.1/mariadb-10.3 updates (packaged version from Debian, not the wmf-mariadb variants we used for mysqld) [08:10:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:54] 10Operations, 10ops-codfw, 10DBA, 10decommission-hardware: decommission es2018.codfw.wmnet - https://phabricator.wikimedia.org/T263615 (10Marostegui) a:05Marostegui→03Papaul Ready for #dc-ops [08:12:19] 10Operations, 10ops-codfw, 10decommission-hardware: decommission es2018.codfw.wmnet - https://phabricator.wikimedia.org/T263615 (10Marostegui) [08:12:31] (03PS1) 10KartikMistry: Enable ContentTranslation in Bashkir Wikipedia as a default tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629616 (https://phabricator.wikimedia.org/T258504) [08:14:38] (03CR) 10Filippo Giunchedi: "> Patch Set 1: Code-Review+1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/629430 (https://phabricator.wikimedia.org/T263536) (owner: 10Filippo Giunchedi) [08:15:01] !log configure vrrp_master_pinning in codfw - T263212 [08:15:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:06] T263212: Consider balancing VRRP primaries to cr1/cr2 - https://phabricator.wikimedia.org/T263212 [08:15:24] !log volans@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [08:15:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:58] 10Operations, 10ops-codfw, 10decommission-hardware: decommission es2018.codfw.wmnet - https://phabricator.wikimedia.org/T263615 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by volans@cumin1001 for hosts: `es2018.codfw.wmnet` - es2018.codfw.wmnet (**FAIL**) - Downtimed host on Icinga -... [08:17:53] !log volans@cumin1001 START - Cookbook sre.dns.netbox [08:17:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:22] (03PS1) 10Marostegui: dns: Remove es2018 [dns] - 10https://gerrit.wikimedia.org/r/629618 (https://phabricator.wikimedia.org/T263615) [08:19:36] (03CR) 10Arturo Borrero Gonzalez: "overall looks good to me. Some comments inline." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/629472 (https://phabricator.wikimedia.org/T263680) (owner: 10Andrew Bogott) [08:19:45] (03CR) 10Marostegui: [C: 03+2] dns: Remove es2018 [dns] - 10https://gerrit.wikimedia.org/r/629618 (https://phabricator.wikimedia.org/T263615) (owner: 10Marostegui) [08:20:01] PROBLEM - puppet last run on mw2241 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:20:07] PROBLEM - puppet last run on mw1393 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:20:07] PROBLEM - puppet last run on mw1364 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:20:07] PROBLEM - puppet last run on wtp1031 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:20:08] PROBLEM - puppet last run on mw1318 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:20:08] PROBLEM - puppet last run on snapshot1007 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:20:08] PROBLEM - puppet last run on snapshot1005 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:20:08] PROBLEM - puppet last run on mw2293 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:20:08] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Remove stretch-backports from bootstrapvz config [puppet] - 10https://gerrit.wikimedia.org/r/610121 (https://phabricator.wikimedia.org/T256881) (owner: 10Muehlenhoff) [08:20:09] PROBLEM - puppet last run on snapshot1006 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:20:09] PROBLEM - puppet last run on mw2294 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:20:11] PROBLEM - puppet last run on restbase2023 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:20:15] PROBLEM - puppet last run on mw1300 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:20:15] PROBLEM - puppet last run on mw1289 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:20:15] PROBLEM - puppet last run on mw2287 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:20:15] PROBLEM - puppet last run on mw2262 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:20:15] PROBLEM - puppet last run on mw2257 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:20:17] PROBLEM - puppet last run on mw1313 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:20:23] PROBLEM - puppet last run on mw1400 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:20:23] PROBLEM - puppet last run on mw1402 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:20:23] PROBLEM - puppet last run on restbase1026 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:20:23] PROBLEM - puppet last run on mw1381 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:20:23] PROBLEM - puppet last run on wtp1041 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:20:24] PROBLEM - puppet last run on ores1007 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:20:24] PROBLEM - puppet last run on mw1339 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:20:25] PROBLEM - puppet last run on wtp1039 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:20:25] PROBLEM - puppet last run on wtp1033 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:20:26] PROBLEM - puppet last run on scandium is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:20:34] uh? [08:20:35] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:20:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:39] <_joe_> that's me, sorry [08:20:51] <_joe_> I actually reenabled puppet [08:20:55] PROBLEM - puppet last run on mw2331 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:20:55] PROBLEM - puppet last run on mw2362 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:20:55] PROBLEM - puppet last run on parse2016 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:20:55] PROBLEM - puppet last run on wtp2016 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:20:59] PROBLEM - puppet last run on labweb1001 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:20:59] PROBLEM - puppet last run on mw1371 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:20:59] PROBLEM - puppet last run on mw1380 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:20:59] PROBLEM - puppet last run on mw1388 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:20:59] PROBLEM - puppet last run on mw1265 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:00] PROBLEM - puppet last run on mw1409 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:00] PROBLEM - puppet last run on mw1288 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:01] PROBLEM - puppet last run on mw1378 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:01] PROBLEM - puppet last run on mw1342 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:02] PROBLEM - puppet last run on mw1337 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:02] PROBLEM - puppet last run on restbase1023 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:03] PROBLEM - puppet last run on wtp1045 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:03] PROBLEM - puppet last run on wtp2009 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:05] PROBLEM - puppet last run on wtp1047 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:05] PROBLEM - puppet last run on wtp1027 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:05] PROBLEM - puppet last run on mw2303 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:05] PROBLEM - puppet last run on mw2295 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:06] PROBLEM - puppet last run on mw2270 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:06] PROBLEM - puppet last run on mw2238 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:07] PROBLEM - puppet last run on parse2018 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:07] PROBLEM - puppet last run on ores2005 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:08] PROBLEM - puppet last run on mw1350 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:08] PROBLEM - puppet last run on mw1395 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:09] PROBLEM - puppet last run on mw1399 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:09] PROBLEM - puppet last run on restbase1024 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:11] PROBLEM - puppet last run on mw1268 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:11] PROBLEM - puppet last run on mw1387 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:11] PROBLEM - puppet last run on mw1261 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:11] PROBLEM - puppet last run on mw1343 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:12] PROBLEM - puppet last run on restbase1017 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:12] PROBLEM - puppet last run on wtp1036 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:15] PROBLEM - puppet last run on mw2307 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:15] PROBLEM - puppet last run on mw2373 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:15] PROBLEM - puppet last run on parse2020 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:15] PROBLEM - puppet last run on wtp2015 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:17] PROBLEM - puppet last run on wtp2001 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:17] PROBLEM - puppet last run on mw2271 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:19] PROBLEM - puppet last run on restbase2014 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:19] PROBLEM - puppet last run on restbase2011 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:23] PROBLEM - puppet last run on mw1275 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:23] PROBLEM - puppet last run on mw1283 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:23] PROBLEM - puppet last run on mw1310 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:23] PROBLEM - puppet last run on mw1303 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:23] PROBLEM - puppet last run on mw1316 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:24] PROBLEM - puppet last run on mw1290 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:25] PROBLEM - puppet last run on restbase-dev1004 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:25] PROBLEM - puppet last run on wtp1043 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:25] PROBLEM - puppet last run on restbase2018 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:26] PROBLEM - puppet last run on wtp2012 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:27] <_joe_> sigh [08:21:27] PROBLEM - puppet last run on mw1352 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:27] PROBLEM - puppet last run on mw1361 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:27] bye bye icinga-wm [08:21:28] 10Operations, 10SRE-swift-storage, 10Patch-For-Review, 10User-fgiunchedi: Put ms-be2057 (Dell R740xd2) in service - https://phabricator.wikimedia.org/T261633 (10fgiunchedi) Status update: the rebalancing is going well and the host is behaving as expected as far as I can tell. With the current capacity we'l... [08:21:31] PROBLEM - puppet last run on mw1353 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:31] PROBLEM - puppet last run on mw1359 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:31] PROBLEM - puppet last run on mw1317 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:31] PROBLEM - puppet last run on mw1367 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:31] PROBLEM - puppet last run on mw1373 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:32] PROBLEM - puppet last run on mw1345 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:32] PROBLEM - puppet last run on mw1322 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:33] PROBLEM - puppet last run on mw1374 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:33] PROBLEM - puppet last run on wtp1026 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:34] PROBLEM - puppet last run on mw2244 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:37] PROBLEM - puppet last run on mw1282 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:37] PROBLEM - puppet last run on mw1334 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:37] PROBLEM - puppet last run on mw1328 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:37] PROBLEM - puppet last run on mw2273 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:39] PROBLEM - puppet last run on mw2376 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:40] <_joe_> this check is just so bad :P [08:21:41] PROBLEM - puppet last run on mw1407 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:43] PROBLEM - puppet last run on mw1329 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:43] PROBLEM - puppet last run on mw1335 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:43] PROBLEM - puppet last run on mw1338 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:43] I blame kormat [08:21:45] PROBLEM - puppet last run on mw2319 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:45] PROBLEM - puppet last run on mw2314 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:45] PROBLEM - puppet last run on mw2263 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:45] PROBLEM - puppet last run on ores1006 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:45] PROBLEM - puppet last run on parse2009 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:45] PROBLEM - puppet last run on parse2005 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:47] PROBLEM - puppet last run on wtp2019 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:47] PROBLEM - puppet last run on wtp2014 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:47] PROBLEM - puppet last run on mw1299 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:47] PROBLEM - puppet last run on mw2337 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:48] PROBLEM - puppet last run on mw2264 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:49] PROBLEM - puppet last run on mw1390 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:49] PROBLEM - puppet last run on mw1404 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:49] PROBLEM - puppet last run on ores2004 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:50] PROBLEM - puppet last run on mw1354 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:50] PROBLEM - puppet last run on mw1362 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:51] PROBLEM - puppet last run on mw1383 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:51] PROBLEM - puppet last run on mw1365 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:22:05] so the check is green when puppet doesn't run and starts screaming when it does? [08:22:16] <_joe_> ema: yes, because it's disabled [08:22:19] excellent [08:22:39] <_joe_> now it's enabled and not running since yesterday afternoon when I started my string of meetings [08:23:11] PROBLEM - puppet last run on mw1267 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:11] PROBLEM - puppet last run on mw1272 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:11] PROBLEM - puppet last run on mw1274 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:11] PROBLEM - puppet last run on restbase1021 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:11] PROBLEM - puppet last run on wtp1028 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:12] PROBLEM - puppet last run on wtp1044 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:12] PROBLEM - puppet last run on mw2312 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:13] PROBLEM - puppet last run on mw1382 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:13] PROBLEM - puppet last run on mw1358 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:13] PROBLEM - puppet last run on restbase1019 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:15] PROBLEM - puppet last run on cloudweb2001-dev is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:15] yeah definitely a bug going critical after being reenabled, sigh [08:23:17] PROBLEM - puppet last run on mw1319 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:19] PROBLEM - puppet last run on mw2366 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:19] PROBLEM - puppet last run on mw1396 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:19] PROBLEM - puppet last run on mw1263 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:19] PROBLEM - puppet last run on mw2296 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:19] PROBLEM - puppet last run on mw2215 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:19] PROBLEM - puppet last run on mw2253 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:19] PROBLEM - puppet last run on mw2279 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2109 (re)pooling @ 100%: Slowly repool db2109 ', diff saved to https://phabricator.wikimedia.org/P12783 and previous config saved to /var/cache/conftool/dbconfig/20200924-082319-root.json [08:23:20] PROBLEM - puppet last run on mw2259 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:21] PROBLEM - puppet last run on mw2225 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:21] PROBLEM - puppet last run on mw2243 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:21] PROBLEM - puppet last run on mw2274 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:22] PROBLEM - puppet last run on mw2275 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:22] PROBLEM - puppet last run on mw2232 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:25] PROBLEM - puppet last run on restbase1018 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:27] PROBLEM - puppet last run on ores1005 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:27] PROBLEM - puppet last run on parse2004 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:27] PROBLEM - puppet last run on parse2006 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:27] PROBLEM - puppet last run on parse2008 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:27] PROBLEM - puppet last run on parse2019 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:29] PROBLEM - puppet last run on ores2003 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:29] PROBLEM - puppet last run on restbase2022 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:33] PROBLEM - puppet last run on mw1302 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:35] PROBLEM - puppet last run on mw2252 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:35] PROBLEM - puppet last run on mw2304 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:35] PROBLEM - puppet last run on mw2339 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:35] PROBLEM - puppet last run on mw2297 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:35] PROBLEM - puppet last run on mw2363 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:35] PROBLEM - puppet last run on mw2356 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:36] PROBLEM - puppet last run on mw2370 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:36] PROBLEM - puppet last run on mw2325 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:36] PROBLEM - puppet last run on mw2283 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:37] PROBLEM - puppet last run on wtp2003 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:37] PROBLEM - puppet last run on wtp2006 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:38] PROBLEM - puppet last run on mw2237 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:38] PROBLEM - puppet last run on restbase2017 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:45] PROBLEM - puppet last run on mw2224 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:45] PROBLEM - puppet last run on mw2321 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:45] PROBLEM - puppet last run on mw2311 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:45] PROBLEM - puppet last run on mw2290 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:45] PROBLEM - puppet last run on mwdebug2002 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:45] <_joe_> gosh [08:23:49] PROBLEM - puppet last run on mw1379 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:51] PROBLEM - puppet last run on mw1264 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:51] PROBLEM - puppet last run on mw1278 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:51] PROBLEM - puppet last run on mw1262 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:51] PROBLEM - puppet last run on mw1330 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:51] PROBLEM - puppet last run on wtp1034 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:51] PROBLEM - puppet last run on restbase2020 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:23:59] PROBLEM - puppet last run on mw1276 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:02] <_joe_> it's all mw and restbase servers :/ [08:24:12] and Parsoid :-) [08:24:15] PROBLEM - puppet last run on mw1391 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:15] PROBLEM - puppet last run on mw1398 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:15] PROBLEM - puppet last run on mw1286 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:15] PROBLEM - puppet last run on mw1271 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:15] PROBLEM - puppet last run on mw1273 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:21] <_joe_> those are mw servers now :P [08:24:25] PROBLEM - puppet last run on mw1355 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:25] PROBLEM - puppet last run on mw1363 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:25] PROBLEM - puppet last run on mw1370 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:25] PROBLEM - puppet last run on mw1384 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:25] PROBLEM - puppet last run on mw1377 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:25] PROBLEM - puppet last run on mw2280 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:25] PROBLEM - puppet last run on mw2254 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:26] PROBLEM - puppet last run on mw2313 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:26] PROBLEM - puppet last run on mw2292 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:27] PROBLEM - puppet last run on mw2320 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:27] PROBLEM - puppet last run on mw2327 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:28] PROBLEM - puppet last run on mw2281 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:29] PROBLEM - puppet last run on mw2306 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:29] PROBLEM - puppet last run on mw2260 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:29] PROBLEM - puppet last run on mw2285 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:30] PROBLEM - puppet last run on restbase2015 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:30] PROBLEM - puppet last run on mw1385 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:31] PROBLEM - puppet last run on mw1336 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:31] PROBLEM - puppet last run on mw1312 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:32] PROBLEM - puppet last run on mw1301 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:32] PROBLEM - puppet last run on mw1333 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:33] PROBLEM - puppet last run on mw1314 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:33] PROBLEM - puppet last run on ores1008 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:34] PROBLEM - puppet last run on mwdebug1002 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:34] PROBLEM - puppet last run on ores1001 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:35] PROBLEM - puppet last run on wtp1029 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:35] PROBLEM - puppet last run on wtp1032 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:36] PROBLEM - puppet last run on mw2330 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:36] PROBLEM - puppet last run on mw2316 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:37] PROBLEM - puppet last run on parse2001 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:37] PROBLEM - puppet last run on mw1405 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:38] PROBLEM - puppet last run on mw1412 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:38] PROBLEM - puppet last run on ores1009 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:39] PROBLEM - puppet last run on mw1266 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:39] PROBLEM - puppet last run on mw2334 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:40] PROBLEM - puppet last run on mw2338 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:40] PROBLEM - puppet last run on mw2358 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:41] PROBLEM - puppet last run on mw2351 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:41] PROBLEM - puppet last run on mw2372 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:42] PROBLEM - puppet last run on mw2229 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:42] PROBLEM - puppet last run on mw2371 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2127 for MCR schema change', diff saved to https://phabricator.wikimedia.org/P12784 and previous config saved to /var/cache/conftool/dbconfig/20200924-082443-marostegui.json [08:24:43] PROBLEM - puppet last run on parse2012 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:44] PROBLEM - puppet last run on parse2003 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:44] PROBLEM - puppet last run on mw2282 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:44] PROBLEM - puppet last run on wtp2018 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:45] PROBLEM - puppet last run on ores2006 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:45] PROBLEM - puppet last run on wtp2004 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:46] PROBLEM - puppet last run on ores2007 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:46] PROBLEM - puppet last run on wtp2010 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:47] PROBLEM - puppet last run on snapshot1008 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:47] PROBLEM - puppet last run on mw2220 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:48] PROBLEM - puppet last run on mw2227 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:53] PROBLEM - puppet last run on snapshot1009 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:53] PROBLEM - puppet last run on restbase1022 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:53] PROBLEM - puppet last run on restbase-dev1005 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:53] PROBLEM - puppet last run on wtp1037 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:55] PROBLEM - puppet last run on snapshot1010 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:57] PROBLEM - puppet last run on mw1349 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:25:01] PROBLEM - puppet last run on mw2359 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:25:01] PROBLEM - puppet last run on restbase2016 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:25:01] PROBLEM - puppet last run on mw2246 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:25:01] PROBLEM - puppet last run on mw2234 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:25:01] RECOVERY - puppet last run on mw2241 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:25:03] PROBLEM - puppet last run on restbase2009 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:25:09] PROBLEM - puppet last run on mw1308 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:25:09] RECOVERY - puppet last run on mw1318 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:25:09] PROBLEM - puppet last run on mw1294 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:25:09] PROBLEM - puppet last run on mw1344 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:25:09] PROBLEM - puppet last run on mw2335 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:25:11] PROBLEM - puppet last run on mw2230 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:25:13] PROBLEM - puppet last run on mw2368 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:25:17] RECOVERY - puppet last run on mw1300 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:25:17] RECOVERY - puppet last run on mw2257 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:25:55] RECOVERY - puppet last run on wtp2016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:25:59] RECOVERY - puppet last run on mw1288 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:25:59] RECOVERY - puppet last run on restbase1023 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:26:01] RECOVERY - puppet last run on wtp2009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:26:05] RECOVERY - puppet last run on mw2295 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:26:11] RECOVERY - puppet last run on mw1387 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:26:17] 10Operations, 10Traffic: Upgrade production cache nodes to Varnish 6 - https://phabricator.wikimedia.org/T263557 (10ema) [08:26:19] RECOVERY - puppet last run on wtp2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:26:25] RECOVERY - puppet last run on mw1303 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:26:25] RECOVERY - puppet last run on mw1283 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:26:27] RECOVERY - puppet last run on restbase2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:26:33] RECOVERY - puppet last run on mw1359 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:26:49] RECOVERY - puppet last run on mw1390 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:26:49] RECOVERY - puppet last run on mw2337 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:26:49] RECOVERY - puppet last run on ores2004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:26:51] RECOVERY - puppet last run on mw1354 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:26:51] RECOVERY - puppet last run on mw1365 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:26:51] RECOVERY - puppet last run on mw1340 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:26:55] RECOVERY - puppet last run on mw2222 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:26:57] RECOVERY - puppet last run on mw2258 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:26:57] RECOVERY - puppet last run on mw2217 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:27:01] RECOVERY - puppet last run on restbase2013 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:27:09] RECOVERY - puppet last run on mw1392 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:27:13] RECOVERY - puppet last run on wtp2002 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:27:15] RECOVERY - puppet last run on parse2010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:27:15] RECOVERY - puppet last run on mw2247 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:27:25] PROBLEM - puppet last run on parse2011 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:27:29] PROBLEM - puppet last run on wtp1025 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:27:39] RECOVERY - puppet last run on mw1356 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:27:41] RECOVERY - puppet last run on mw1305 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:27:45] RECOVERY - puppet last run on mw1368 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:27:47] RECOVERY - puppet last run on mw2361 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:27:53] RECOVERY - puppet last run on deploy1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:27:55] RECOVERY - puppet last run on mw2317 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:27:55] RECOVERY - puppet last run on mw2221 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:27:55] RECOVERY - puppet last run on mw2324 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:28:05] RECOVERY - puppet last run on mw2355 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:28:17] RECOVERY - puppet last run on mw1358 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:28:21] RECOVERY - puppet last run on mw1319 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:28:23] RECOVERY - puppet last run on mw2296 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:28:37] RECOVERY - puppet last run on mw2363 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:28:37] RECOVERY - puppet last run on mw2356 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:28:49] RECOVERY - puppet last run on mwdebug2002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:28:53] RECOVERY - puppet last run on mw1278 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:28:53] RECOVERY - puppet last run on mw1262 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:29:07] (03PS1) 10Muehlenhoff: debdeploy: Skip restart detection if no library base names are specified [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/629620 [08:29:25] PROBLEM - puppet last run on mw1309 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:29:29] RECOVERY - puppet last run on mw1370 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:29:29] RECOVERY - puppet last run on mw1384 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:29:29] RECOVERY - puppet last run on mw1363 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:29:29] PROBLEM - puppet last run on mw2269 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:29:29] PROBLEM - puppet last run on mw2291 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:29:29] PROBLEM - puppet last run on mw2278 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:29:29] RECOVERY - puppet last run on mw2280 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:29:30] RECOVERY - puppet last run on mw2285 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:29:30] RECOVERY - puppet last run on mw2281 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:29:33] RECOVERY - puppet last run on mw1385 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:29:33] RECOVERY - puppet last run on mw1312 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:29:33] RECOVERY - puppet last run on mw1314 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:29:36] 10Operations, 10Traffic: Upgrade production cache nodes to Varnish 6 - https://phabricator.wikimedia.org/T263557 (10ema) cp4027 has been running fine since yesterday with Varnish 6.0.6. Performance-wise there's no impact either, I've added a panel with p75 response time comparison to [[https://grafana.wikimedi... [08:29:37] PROBLEM - puppet last run on wtp2008 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:29:41] RECOVERY - puppet last run on mw2358 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:29:41] RECOVERY - puppet last run on parse2003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:29:41] RECOVERY - puppet last run on mw2229 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:29:41] RECOVERY - puppet last run on ores2007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:29:41] RECOVERY - puppet last run on wtp2010 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:29:59] RECOVERY - puppet last run on snapshot1009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:30:03] RECOVERY - puppet last run on mw1349 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:30:07] RECOVERY - puppet last run on restbase2016 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:30:07] RECOVERY - puppet last run on mw2234 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:30:15] RECOVERY - puppet last run on mw1364 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:30:15] RECOVERY - puppet last run on mw1308 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:30:23] RECOVERY - puppet last run on mw1289 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:30:23] RECOVERY - puppet last run on mw2287 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:30:29] RECOVERY - puppet last run on ores1007 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:30:29] RECOVERY - puppet last run on scandium is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:30:40] (03Abandoned) 10Ema: varnishkafka 1.0.15 [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/626177 (https://phabricator.wikimedia.org/T261632) (owner: 10Vgutierrez) [08:31:01] RECOVERY - puppet last run on mw1265 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:31:08] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] remove shinken module, profile, role [puppet] - 10https://gerrit.wikimedia.org/r/629464 (https://phabricator.wikimedia.org/T236547) (owner: 10Dzahn) [08:31:15] RECOVERY - puppet last run on restbase1017 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:31:19] RECOVERY - puppet last run on mw2373 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:31:29] RECOVERY - puppet last run on mw1316 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:31:29] RECOVERY - puppet last run on mw1275 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:31:29] RECOVERY - puppet last run on wtp2012 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:31:39] RECOVERY - puppet last run on mw1373 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:31:39] PROBLEM - puppet last run on mw2218 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:31:39] RECOVERY - puppet last run on wtp1026 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:31:39] RECOVERY - puppet last run on mw1322 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:31:43] 10Operations, 10Puppet, 10observability: Notification spam from "last puppet run" upon re-enabling puppet - https://phabricator.wikimedia.org/T263720 (10fgiunchedi) [08:31:45] 10Operations, 10Puppet, 10observability: Notification spam from "last puppet run" upon re-enabling puppet - https://phabricator.wikimedia.org/T263720 (10fgiunchedi) [08:31:47] RECOVERY - puppet last run on mw1282 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:31:55] RECOVERY - puppet last run on mw2319 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:31:59] RECOVERY - puppet last run on mw1404 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:32:01] RECOVERY - puppet last run on restbase1025 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:32:01] RECOVERY - puppet last run on mw1311 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:32:05] RECOVERY - puppet last run on mw1326 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:32:05] RECOVERY - puppet last run on mw2357 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:32:05] RECOVERY - puppet last run on restbase2019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:32:05] RECOVERY - puppet last run on mw2288 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:32:05] RECOVERY - puppet last run on mw1346 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:32:19] RECOVERY - puppet last run on mw2277 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:32:25] RECOVERY - puppet last run on restbase1016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:32:29] RECOVERY - puppet last run on mw1403 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:32:29] RECOVERY - puppet last run on mw2240 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:32:29] RECOVERY - puppet last run on mw2255 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:32:35] RECOVERY - puppet last run on parse2011 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:32:41] RECOVERY - puppet last run on labweb1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:32:55] RECOVERY - puppet last run on mw1413 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:33:03] RECOVERY - puppet last run on mw1287 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:33:05] RECOVERY - puppet last run on wtp2007 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:33:07] RECOVERY - puppet last run on deploy2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:33:13] RECOVERY - puppet last run on mw2223 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:33:21] RECOVERY - puppet last run on wtp1044 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:33:25] RECOVERY - puppet last run on restbase1019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:33:31] RECOVERY - puppet last run on mw1263 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:33:31] RECOVERY - puppet last run on mw2225 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:33:31] RECOVERY - puppet last run on mw2253 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:33:33] RECOVERY - puppet last run on mw2274 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:33:37] RECOVERY - puppet last run on parse2006 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:33:47] RECOVERY - puppet last run on mw2325 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:33:47] RECOVERY - puppet last run on mw2370 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:33:47] RECOVERY - puppet last run on mw2252 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:33:47] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1056 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [08:33:49] RECOVERY - puppet last run on wtp2006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:33:51] PROBLEM - puppet last run on mw2276 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:33:55] PROBLEM - puppet last run on mw2374 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:33:57] RECOVERY - puppet last run on mw2290 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:34:01] RECOVERY - puppet last run on mw1379 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:34:37] RECOVERY - puppet last run on mw1309 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:34:38] (03PS1) 10Marostegui: filtered_tables.txt: Update columns after MCR changes [puppet] - 10https://gerrit.wikimedia.org/r/629621 (https://phabricator.wikimedia.org/T238966) [08:34:39] RECOVERY - puppet last run on mw1377 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:34:41] RECOVERY - puppet last run on mw2327 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:34:45] RECOVERY - puppet last run on ores1008 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:34:45] RECOVERY - puppet last run on mw2330 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:34:51] RECOVERY - puppet last run on mw1412 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:34:53] RECOVERY - puppet last run on mw2372 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:34:53] RECOVERY - puppet last run on wtp2004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:34:53] RECOVERY - puppet last run on ores2006 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:35:01] RECOVERY - puppet last run on mw2220 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:35:19] RECOVERY - puppet last run on mw2246 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:35:24] (03CR) 10jerkins-bot: [V: 04-1] filtered_tables.txt: Update columns after MCR changes [puppet] - 10https://gerrit.wikimedia.org/r/629621 (https://phabricator.wikimedia.org/T238966) (owner: 10Marostegui) [08:35:25] RECOVERY - puppet last run on wtp1031 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:35:25] RECOVERY - puppet last run on mw1294 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:35:33] RECOVERY - puppet last run on mw2262 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:35:39] RECOVERY - puppet last run on mw1402 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:35:41] RECOVERY - puppet last run on mw1381 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:35:41] RECOVERY - puppet last run on mw1339 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:36:05] PROBLEM - puppet last run on parse2015 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:36:09] RECOVERY - puppet last run on mw2362 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:36:13] RECOVERY - puppet last run on mw1380 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:36:13] RECOVERY - puppet last run on mw1371 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:36:19] RECOVERY - puppet last run on mw1350 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:36:19] RECOVERY - puppet last run on parse2018 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:36:19] RECOVERY - puppet last run on mw2303 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:36:21] RECOVERY - puppet last run on restbase1024 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:36:35] RECOVERY - puppet last run on mw2271 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:36:43] RECOVERY - puppet last run on restbase-dev1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:36:43] RECOVERY - puppet last run on mw1310 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:36:45] RECOVERY - puppet last run on mw1352 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:36:53] RECOVERY - puppet last run on mw1353 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:36:53] RECOVERY - puppet last run on mw1374 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:36:53] RECOVERY - puppet last run on mw1345 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:36:53] RECOVERY - puppet last run on mw2218 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:36:53] RECOVERY - puppet last run on mw2244 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:37:03] (03PS2) 10Marostegui: filtered_tables.txt: Update columns after MCR changes [puppet] - 10https://gerrit.wikimedia.org/r/629621 (https://phabricator.wikimedia.org/T238966) [08:37:09] RECOVERY - puppet last run on parse2005 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:37:13] RECOVERY - puppet last run on wtp1046 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:37:15] RECOVERY - puppet last run on mw1351 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:37:15] RECOVERY - puppet last run on mw1369 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:37:15] RECOVERY - puppet last run on restbase-dev1006 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:37:15] RECOVERY - puppet last run on mw1293 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:37:15] RECOVERY - puppet last run on mw2350 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:37:19] RECOVERY - puppet last run on mw1411 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:37:19] RECOVERY - puppet last run on mw1284 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:37:21] RECOVERY - puppet last run on ores2008 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:37:23] 10Operations, 10Puppet, 10observability: Notification spam from "last puppet run" upon re-enabling puppet - https://phabricator.wikimedia.org/T263720 (10Joe) The most obvious suggestion I can give is we should have a shorter grace period for puppet being disabled than the one we have for individual alerts ab... [08:37:41] RECOVERY - puppet last run on wtp1042 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:37:41] RECOVERY - puppet last run on parse2013 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:37:43] RECOVERY - puppet last run on mw2251 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:37:45] RECOVERY - puppet last run on mw2236 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:37:45] RECOVERY - puppet last run on mw2233 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:37:45] RECOVERY - puppet last run on mw2250 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:37:53] RECOVERY - puppet last run on mw1357 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:37:55] RECOVERY - puppet last run on mw2245 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:38:15] RECOVERY - puppet last run on mw2286 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:38:15] PROBLEM - puppet last run on mw1315 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:38:16] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [08:38:17] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:38:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:23] RECOVERY - puppet last run on wtp1035 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:38:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:25] RECOVERY - puppet last run on wtp2013 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:38:29] RECOVERY - puppet last run on mw1279 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:38:37] RECOVERY - puppet last run on mw1267 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:38:37] RECOVERY - puppet last run on mw1274 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:38:47] RECOVERY - puppet last run on mw2366 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:38:47] RECOVERY - puppet last run on mw2279 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:38:47] RECOVERY - puppet last run on mw2259 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:38:55] RECOVERY - puppet last run on restbase1018 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:38:55] 10Puppet: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10Volans) Totally agree that postgres issues should show on both eqiad and codfw so most likely a red herring, nevertheless the sizes seemed a bit too large for the data. But I agree to not go down that path if not n... [08:38:57] RECOVERY - puppet last run on ores2003 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:39:05] RECOVERY - puppet last run on mw2283 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:39:13] RECOVERY - puppet last run on mw2374 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:39:21] RECOVERY - puppet last run on restbase2020 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:39:25] (03PS1) 10Muehlenhoff: Add library hint for mariadb-10.1 [puppet] - 10https://gerrit.wikimedia.org/r/629622 [08:39:29] RECOVERY - puppet last run on mw1276 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:39:54] (03PS1) 10Ema: varnish: upgrade upload labs node to v6 [puppet] - 10https://gerrit.wikimedia.org/r/629623 (https://phabricator.wikimedia.org/T263557) [08:39:55] RECOVERY - puppet last run on mw2313 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:39:55] RECOVERY - puppet last run on mw2320 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:39:55] RECOVERY - puppet last run on restbase2015 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:40:01] RECOVERY - puppet last run on wtp1032 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:40:01] RECOVERY - puppet last run on wtp1029 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:40:01] RECOVERY - puppet last run on mw2316 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:40:07] RECOVERY - puppet last run on mw2334 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:40:21] PROBLEM - puppet last run on mw1281 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:40:25] PROBLEM - puppet last run on mw1401 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:40:29] RECOVERY - puppet last run on restbase1022 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:40:40] 10Operations, 10Gerrit, 10LDAP-Access-Requests: Add hashar to `archiva-deployers` LDAP group - https://phabricator.wikimedia.org/T263721 (10hashar) [08:40:45] RECOVERY - puppet last run on snapshot1007 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:40:45] RECOVERY - puppet last run on mw2293 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:40:47] RECOVERY - puppet last run on mw2294 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:40:47] RECOVERY - puppet last run on mw2230 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:40:55] RECOVERY - puppet last run on mw1313 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:40:59] RECOVERY - puppet last run on mw1400 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:40:59] RECOVERY - puppet last run on restbase1026 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:40:59] RECOVERY - puppet last run on wtp1039 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:41:03] (03CR) 10Ema: [C: 03+2] varnish: upgrade upload labs node to v6 [puppet] - 10https://gerrit.wikimedia.org/r/629623 (https://phabricator.wikimedia.org/T263557) (owner: 10Ema) [08:41:07] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for mariadb-10.1 [puppet] - 10https://gerrit.wikimedia.org/r/629622 (owner: 10Muehlenhoff) [08:41:23] RECOVERY - puppet last run on parse2015 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:41:25] moritzm: yes please :) [08:41:25] RECOVERY - puppet last run on mw2331 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:41:26] ema: shall I merge along? [08:41:29] RECOVERY - puppet last run on mw1388 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:41:29] RECOVERY - puppet last run on mw1378 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:41:33] done [08:41:34] 10Operations, 10Analytics, 10LDAP-Access-Requests: Grant access to archiva-deployers for mstyles - https://phabricator.wikimedia.org/T242624 (10hashar) [08:41:35] RECOVERY - puppet last run on mw1399 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:41:37] RECOVERY - puppet last run on ores2005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:41:37] RECOVERY - puppet last run on mw2238 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:41:42] 10Operations, 10Analytics, 10LDAP-Access-Requests: Grant access to archiva-deployers for zpapierski - https://phabricator.wikimedia.org/T242622 (10hashar) [08:41:45] thanks [08:41:47] RECOVERY - puppet last run on parse2020 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:42:15] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Marostegui) @Papaul can you coordinate with @Kormat for this? I will be off from today's evening till Monday, so if you need something from us... [08:42:27] RECOVERY - puppet last run on mw2314 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:42:31] RECOVERY - puppet last run on ores1006 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:42:31] RECOVERY - puppet last run on wtp2014 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:42:31] RECOVERY - puppet last run on wtp2019 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:42:32] (03PS2) 10KartikMistry: Enable ContentTranslation in Bashkir, Urdu and Welsh WPs as a default tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629616 (https://phabricator.wikimedia.org/T258504) [08:42:35] 10Operations, 10Gerrit, 10LDAP-Access-Requests: Add hashar to `archiva-deployers` LDAP group - https://phabricator.wikimedia.org/T263721 (10hashar) [08:42:37] RECOVERY - puppet last run on mw1277 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:42:37] RECOVERY - puppet last run on mw1285 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:42:37] RECOVERY - puppet last run on mw1297 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:42:43] RECOVERY - puppet last run on mw1389 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:42:43] RECOVERY - puppet last run on restbase2012 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:42:51] RECOVERY - puppet last run on mw1406 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:42:55] RECOVERY - puppet last run on mw2375 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:42:55] RECOVERY - puppet last run on mw2235 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:42:55] RECOVERY - puppet last run on mw2228 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:43:07] RECOVERY - puppet last run on wtp2011 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:43:09] RECOVERY - puppet last run on wtp1030 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:43:09] RECOVERY - puppet last run on mw2226 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:43:09] RECOVERY - puppet last run on mw2249 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:43:19] RECOVERY - puppet last run on mw1410 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:43:19] RECOVERY - puppet last run on mw1386 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:43:19] RECOVERY - puppet last run on mw1366 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:43:41] RECOVERY - puppet last run on mw1315 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:43:43] RECOVERY - puppet last run on ores2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:43:49] RECOVERY - puppet last run on mw2352 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:43:49] RECOVERY - puppet last run on restbase1027 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:43:49] RECOVERY - puppet last run on wtp2020 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:44:01] RECOVERY - puppet last run on restbase1021 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:44:03] RECOVERY - puppet last run on wtp1028 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:44:05] RECOVERY - puppet last run on mw2312 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:44:09] RECOVERY - puppet last run on cloudweb2001-dev is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:44:11] RECOVERY - puppet last run on mw2215 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:44:15] RECOVERY - puppet last run on mw2275 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:44:19] RECOVERY - puppet last run on parse2019 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:44:19] RECOVERY - puppet last run on parse2004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:44:19] RECOVERY - puppet last run on parse2008 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:44:27] RECOVERY - puppet last run on mw2339 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:44:27] RECOVERY - puppet last run on mw2297 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:44:31] RECOVERY - puppet last run on restbase2017 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:44:41] RECOVERY - puppet last run on mw2311 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:44:47] RECOVERY - puppet last run on mw1330 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:44:53] PROBLEM - puppet last run on mw2301 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:45:09] RECOVERY - puppet last run on mw1391 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:45:09] RECOVERY - puppet last run on mw1271 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:45:09] RECOVERY - puppet last run on mw1286 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:45:09] RECOVERY - puppet last run on mw1273 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:45:18] * apergos watches bemusedly [08:45:21] RECOVERY - puppet last run on mw2291 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:45:21] RECOVERY - puppet last run on mw2306 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:45:21] RECOVERY - puppet last run on mw2292 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:45:21] RECOVERY - puppet last run on mw2269 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:45:21] RECOVERY - puppet last run on mw2254 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:45:21] RECOVERY - puppet last run on mw2278 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:45:27] RECOVERY - puppet last run on mwdebug1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:45:27] RECOVERY - puppet last run on ores1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:45:27] RECOVERY - puppet last run on mw1301 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:45:27] RECOVERY - puppet last run on mw1333 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:45:27] RECOVERY - puppet last run on parse2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:45:31] RECOVERY - puppet last run on mw1405 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:45:35] RECOVERY - puppet last run on mw2351 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:45:35] RECOVERY - puppet last run on mw2282 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:45:43] RECOVERY - puppet last run on mw2227 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:45:43] (03CR) 10Filippo Giunchedi: [C: 03+1] "Thank you so much Daniel for cleaning this up!" [puppet] - 10https://gerrit.wikimedia.org/r/629464 (https://phabricator.wikimedia.org/T236547) (owner: 10Dzahn) [08:46:09] RECOVERY - puppet last run on mw1344 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:46:11] RECOVERY - puppet last run on snapshot1005 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:46:11] RECOVERY - puppet last run on mw2335 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:46:13] RECOVERY - puppet last run on snapshot1006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:46:15] RECOVERY - puppet last run on mw2368 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:46:15] RECOVERY - puppet last run on restbase2023 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:46:25] RECOVERY - puppet last run on wtp1033 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:46:55] RECOVERY - puppet last run on mw1337 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:47:01] RECOVERY - puppet last run on mw1395 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:47:01] RECOVERY - puppet last run on mw2270 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:47:03] PROBLEM - puppet last run on mwdebug1001 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:47:07] RECOVERY - puppet last run on mw1268 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:47:11] RECOVERY - puppet last run on wtp1036 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:47:13] RECOVERY - puppet last run on mw2307 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:47:17] RECOVERY - puppet last run on restbase2011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:47:25] RECOVERY - puppet last run on mw1290 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:47:29] RECOVERY - puppet last run on mw1361 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:47:45] RECOVERY - puppet last run on mw1328 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:47:45] RECOVERY - puppet last run on mw2273 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:47:47] RECOVERY - puppet last run on mw2376 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:47:51] RECOVERY - puppet last run on mw1407 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:47:53] RECOVERY - puppet last run on mw1338 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:47:53] RECOVERY - puppet last run on mw1329 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:47:53] RECOVERY - puppet last run on mw1335 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:47:55] RECOVERY - puppet last run on mw2263 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:47:57] RECOVERY - puppet last run on parse2009 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:48:03] RECOVERY - puppet last run on mw1299 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:48:03] RECOVERY - puppet last run on mw2264 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:48:05] RECOVERY - puppet last run on mw1376 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:48:05] RECOVERY - puppet last run on mw1324 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:48:05] RECOVERY - puppet last run on mw1304 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:48:11] RECOVERY - puppet last run on mw1320 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:48:13] RECOVERY - puppet last run on mw2268 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:48:25] RECOVERY - puppet last run on mw1408 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:48:25] RECOVERY - puppet last run on mw1296 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:48:25] RECOVERY - puppet last run on mw2272 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:48:25] RECOVERY - puppet last run on mw2261 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:48:29] <_joe_> sorry for the utter spam :/ [08:48:35] RECOVERY - puppet last run on restbase2021 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:48:37] RECOVERY - puppet last run on mw1269 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:48:37] RECOVERY - puppet last run on mw1298 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:48:37] RECOVERY - puppet last run on mw2216 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:48:41] RECOVERY - puppet last run on parse2002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:48:41] RECOVERY - puppet last run on mw2219 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:48:53] RECOVERY - puppet last run on wtp1025 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:48:53] RECOVERY - puppet last run on mw1306 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:48:59] RECOVERY - puppet last run on wtp1048 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:49:15] RECOVERY - puppet last run on mw2323 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:49:21] RECOVERY - puppet last run on mw2302 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:49:23] RECOVERY - puppet last run on mw2305 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:49:25] RECOVERY - puppet last run on mw2284 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:49:35] RECOVERY - puppet last run on mw1272 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:49:41] RECOVERY - puppet last run on mw1382 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:49:45] RECOVERY - puppet last run on mw1396 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:49:45] RECOVERY - puppet last run on mw2243 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:49:47] RECOVERY - puppet last run on mw2232 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:49:51] RECOVERY - puppet last run on ores1005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:49:55] RECOVERY - puppet last run on restbase2022 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:49:57] RECOVERY - puppet last run on mw1302 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:49:59] RECOVERY - puppet last run on mw2304 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:50:01] RECOVERY - puppet last run on wtp2003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:50:36] RECOVERY - puppet last run on mw1355 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:50:38] RECOVERY - puppet last run on mw2260 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:50:42] RECOVERY - puppet last run on mw1336 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:50:46] RECOVERY - puppet last run on wtp2008 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:50:48] RECOVERY - puppet last run on ores1009 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:50:52] RECOVERY - puppet last run on mw2338 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:50:52] RECOVERY - puppet last run on mw2371 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:50:52] RECOVERY - puppet last run on parse2012 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:50:52] RECOVERY - puppet last run on wtp2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:50:58] RECOVERY - puppet last run on snapshot1008 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:51:02] RECOVERY - puppet last run on mw1281 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:51:06] RECOVERY - puppet last run on mw1401 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:51:10] RECOVERY - puppet last run on restbase-dev1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:51:10] RECOVERY - puppet last run on wtp1037 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:51:14] RECOVERY - puppet last run on snapshot1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:51:20] RECOVERY - puppet last run on mw2359 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:51:20] RECOVERY - puppet last run on restbase2009 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:51:20] RECOVERY - puppet last run on mw1393 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:51:24] RECOVERY - puppet last run on wtp1041 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:51:26] (03CR) 10Kormat: [C: 03+1] filtered_tables.txt: Update columns after MCR changes [puppet] - 10https://gerrit.wikimedia.org/r/629621 (https://phabricator.wikimedia.org/T238966) (owner: 10Marostegui) [08:51:48] RECOVERY - puppet last run on parse2016 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:51:54] RECOVERY - puppet last run on labweb1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:51:54] RECOVERY - puppet last run on mw1409 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:51:54] RECOVERY - puppet last run on mw1342 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:51:54] RECOVERY - puppet last run on wtp1045 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:52:02] RECOVERY - puppet last run on wtp1027 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:52:02] RECOVERY - puppet last run on wtp1047 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:52:02] RECOVERY - puppet last run on mwdebug1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:52:08] RECOVERY - puppet last run on mw1343 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:52:08] RECOVERY - puppet last run on mw1261 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:52:12] RECOVERY - puppet last run on wtp2015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:52:16] RECOVERY - puppet last run on restbase2014 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:52:24] RECOVERY - puppet last run on wtp1043 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:52:36] RECOVERY - puppet last run on mw1367 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:52:38] RECOVERY - puppet last run on mw1317 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:52:46] RECOVERY - puppet last run on mw1334 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:53:06] RECOVERY - puppet last run on mw1383 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:53:06] RECOVERY - puppet last run on mw1362 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:53:08] RECOVERY - puppet last run on mw1270 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:53:08] RECOVERY - puppet last run on mw1347 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:53:16] RECOVERY - puppet last run on mwdebug2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:53:16] RECOVERY - puppet last run on mw2239 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:53:38] RECOVERY - puppet last run on mw2300 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:53:40] RECOVERY - puppet last run on mw2298 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:54:30] RECOVERY - puppet last run on mw1372 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:55:22] (03PS2) 10JMeybohm: services_proxy: switch zotero to the TLS endpoint [puppet] - 10https://gerrit.wikimedia.org/r/629337 (https://phabricator.wikimedia.org/T255869) [08:55:24] (03PS2) 10JMeybohm: lvs: Remove zotero non-TLS endpoint 1/2 [puppet] - 10https://gerrit.wikimedia.org/r/629338 (https://phabricator.wikimedia.org/T255869) [08:55:26] (03PS2) 10JMeybohm: lvs: Remove zotero non-TLS endpoint 2/2 [puppet] - 10https://gerrit.wikimedia.org/r/629339 (https://phabricator.wikimedia.org/T255869) [08:55:48] RECOVERY - puppet last run on mw1266 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:56:52] RECOVERY - puppet last run on mw1398 is OK: OK: Puppet is currently enabled, last run 6 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:59:31] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' . [08:59:31] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [08:59:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:14] RECOVERY - puppet last run on mw2301 is OK: OK: Puppet is currently enabled, last run 13 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:02:30] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Reorganize remote backups (snapshots) for speedup [puppet] - 10https://gerrit.wikimedia.org/r/629610 (https://phabricator.wikimedia.org/T257551) (owner: 10Jcrespo) [09:03:22] RECOVERY - puppet last run on mw1264 is OK: OK: Puppet is currently enabled, last run 13 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:04:59] !log hnowlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' . [09:04:59] !log hnowlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [09:05:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:11] uhh [09:05:28] 10Operations, 10Traffic: restrict upload cache access for private wikis - https://phabricator.wikimedia.org/T129839 (10fgiunchedi) Yes this is still valid IMHO despite the lack of activity. Specifically as a defense in depth measure, swift ACLs being the primary line of defense. [09:05:38] RECOVERY - puppet last run on mw2321 is OK: OK: Puppet is currently enabled, last run 18 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:06:02] hnowlan: you sound like someone confident in what they have just done ;) [09:06:25] I'm confident in at least 50% of it [09:06:43] :D [09:07:17] 10Operations: Integrate Stretch 9.13 point update - https://phabricator.wikimedia.org/T258407 (10MoritzMuehlenhoff) [09:08:24] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:10:02] RECOVERY - puppet last run on mw2224 is OK: OK: Puppet is currently enabled, last run 23 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:11:00] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:11:34] PROBLEM - Host elastic2037 is DOWN: PING CRITICAL - Packet loss = 100% [09:12:17] (03PS1) 10Ema: cp4021: upgrade to Varnish 6 (cache_upload) [puppet] - 10https://gerrit.wikimedia.org/r/629634 (https://phabricator.wikimedia.org/T263557) [09:13:10] (03CR) 10Ema: [C: 03+2] cp4021: upgrade to Varnish 6 (cache_upload) [puppet] - 10https://gerrit.wikimedia.org/r/629634 (https://phabricator.wikimedia.org/T263557) (owner: 10Ema) [09:14:22] !log cp4021: depool and upgrade varnish to 6.0.6-1wm1 T263557 [09:14:24] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [09:14:25] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:14:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:26] RECOVERY - puppet last run on mw2276 is OK: OK: Puppet is currently enabled, last run 26 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:14:27] T263557: Upgrade production cache nodes to Varnish 6 - https://phabricator.wikimedia.org/T263557 [09:14:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:46] !log kormat@cumin1001 dbctl commit (dc=all): 'db2138:3312 depooling: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12785 and previous config saved to /var/cache/conftool/dbconfig/20200924-091445-kormat.json [09:14:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:51] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [09:16:34] RECOVERY - puppet last run on mw2237 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:19:41] !log cp4021: redepool with varnish to 6.0.6-1wm1 T263557 [09:19:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:46] T263557: Upgrade production cache nodes to Varnish 6 - https://phabricator.wikimedia.org/T263557 [09:20:10] that wasn't the best !log message of all times [09:20:31] top 5 though, i'm sure [09:20:42] !log cp4021: repool with varnish 6.0.6-1wm1 T263557 [09:20:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:05] kormat: <3 [09:25:47] RECOVERY - puppet last run on wtp1034 is OK: OK: Puppet is currently enabled, last run 7 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:25:54] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [09:25:54] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' . [09:25:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:29] (03CR) 10Jbond: "lgtm see comment" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/628972 (owner: 10Dzahn) [09:35:15] !log kormat@cumin1001 dbctl commit (dc=all): 'db2138:3312 (re)pooling @ 25%: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12786 and previous config saved to /var/cache/conftool/dbconfig/20200924-093514-kormat.json [09:35:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:20] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [09:36:51] (03CR) 10Jbond: [C: 03+1] diffscan: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/629427 (owner: 10Dzahn) [09:36:59] (03CR) 10Jbond: [C: 03+1] "will merge" [puppet] - 10https://gerrit.wikimedia.org/r/629427 (owner: 10Dzahn) [09:37:01] (03CR) 10Jbond: [C: 03+2] diffscan: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/629427 (owner: 10Dzahn) [09:38:28] 10Operations, 10serviceops, 10Platform Team Initiatives (API Gateway): Separate mediawiki latency metrics by endpoint - https://phabricator.wikimedia.org/T263727 (10Joe) [09:39:10] (03CR) 10Jbond: [C: 03+1] "lgtm" [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/629620 (owner: 10Muehlenhoff) [09:41:50] (03CR) 10JMeybohm: [C: 03+2] service: add TLS endpoint for mathoid 1/2 [puppet] - 10https://gerrit.wikimedia.org/r/629325 (https://phabricator.wikimedia.org/T255875) (owner: 10JMeybohm) [09:42:58] 10Operations, 10observability, 10serviceops, 10Platform Team Initiatives (API Gateway): mtail 3.0.0-rc35 doesn't support the histogram type in -oneshot mode. - https://phabricator.wikimedia.org/T263728 (10Joe) [09:43:40] !log running puppet on lvs servers - T255875 [09:43:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:45] T255875: Move mathoid to use TLS only - https://phabricator.wikimedia.org/T255875 [09:44:04] 10Operations, 10serviceops, 10Platform Team Initiatives (API Gateway): Separate mediawiki latency metrics by endpoint - https://phabricator.wikimedia.org/T263727 (10Joe) [09:46:28] !log restart pybal on lvs1016.eqiad.wmnet,lvs2010.codfw.wmnet - T255875 [09:46:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:19] 10Operations, 10MediaWiki-REST-API, 10Traffic: Route requests to the REST MediaWiki API to the api cluster - https://phabricator.wikimedia.org/T263729 (10Joe) [09:47:44] 10Puppet: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10jbond) Thanks for the further investigation, So it looks like: * it us happening on both servers * it is worse on puppetdb2002 * It happens sporadicly during the day and is not just an artifact of re-imaging... [09:48:15] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.20:4001]) https://wikitech.wikimedia.org/wiki/PyBal [09:48:15] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.20:4001]) https://wikitech.wikimedia.org/wiki/PyBal [09:48:18] this is me [09:48:58] !log restart pybal on lvs1015.eqiad.wmnet,lvs2009.codfw.wmnet - T255875 [09:49:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:03] T255875: Move mathoid to use TLS only - https://phabricator.wikimedia.org/T255875 [09:49:06] jayme: at this point i assume all pybal issues are you. it just saves time. :) [09:49:17] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:50:02] !log hnowlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [09:50:02] !log hnowlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' . [09:50:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:18] !log kormat@cumin1001 dbctl commit (dc=all): 'db2138:3312 (re)pooling @ 50%: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12787 and previous config saved to /var/cache/conftool/dbconfig/20200924-095018-kormat.json [09:50:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:23] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [09:50:30] kormat: fine, but I hear in general that's your fault as well then. So I'll just act as a blaming proxy :) [09:50:56] hahah. well played. [09:50:59] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] debdeploy: Skip restart detection if no library base names are specified [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/629620 (owner: 10Muehlenhoff) [09:51:11] (03Abandoned) 10Filippo Giunchedi: prometheus: rename mtail appserver handlers [puppet] - 10https://gerrit.wikimedia.org/r/531666 (https://phabricator.wikimedia.org/T230396) (owner: 10Filippo Giunchedi) [09:52:04] will add a blame-forwarded-for header :) [09:52:12] :D [09:53:14] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [09:53:14] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [09:54:58] (03PS3) 10JMeybohm: service: add TLS endpoint for mathoid 2/2 [puppet] - 10https://gerrit.wikimedia.org/r/629326 (https://phabricator.wikimedia.org/T255875) [09:56:51] (03CR) 10JMeybohm: [C: 03+2] service: add TLS endpoint for mathoid 2/2 [puppet] - 10https://gerrit.wikimedia.org/r/629326 (https://phabricator.wikimedia.org/T255875) (owner: 10JMeybohm) [10:00:04] mvolz: How many deployers does it take to do Services – Citoid / Zotero deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200924T1000). [10:00:49] (03PS12) 10Muehlenhoff: Manage /etc/apt/sources.list via Puppet [puppet] - 10https://gerrit.wikimedia.org/r/626693 (https://phabricator.wikimedia.org/T156562) [10:01:50] (03CR) 10Abijeet Patro: "This change is ready for review." [extensions/Translate] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/629640 (https://phabricator.wikimedia.org/T263546) (owner: 10Abijeet Patro) [10:01:52] (03CR) 10Muehlenhoff: [C: 03+1] "Sounds good, let's give it a shot" [puppet] - 10https://gerrit.wikimedia.org/r/629442 (https://phabricator.wikimedia.org/T242991) (owner: 10Jbond) [10:01:59] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [10:01:59] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [10:02:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:11] (03CR) 10Kormat: [C: 04-1] pontoon: use Python API for enroll (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/629378 (owner: 10Filippo Giunchedi) [10:05:22] !log kormat@cumin1001 dbctl commit (dc=all): 'db2138:3312 (re)pooling @ 75%: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12788 and previous config saved to /var/cache/conftool/dbconfig/20200924-100521-kormat.json [10:05:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:27] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [10:06:06] (03PS1) 10Ema: cache: upgrade Varnish to v6 in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/629642 (https://phabricator.wikimedia.org/T263557) [10:07:43] (03PS2) 10Ema: cache: upgrade Varnish to v6 in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/629642 (https://phabricator.wikimedia.org/T263557) [10:08:03] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/629642 (https://phabricator.wikimedia.org/T263557) (owner: 10Ema) [10:08:07] (03CR) 10jerkins-bot: [V: 04-1] cache: upgrade Varnish to v6 in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/629642 (https://phabricator.wikimedia.org/T263557) (owner: 10Ema) [10:08:49] (03PS1) 10Jbond: puppetdb: small refactor update types [puppet] - 10https://gerrit.wikimedia.org/r/629643 [10:08:54] (03PS3) 10Ema: cache: upgrade Varnish to v6 in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/629642 (https://phabricator.wikimedia.org/T263557) [10:09:12] (03CR) 10jerkins-bot: [V: 04-1] puppetdb: small refactor update types [puppet] - 10https://gerrit.wikimedia.org/r/629643 (owner: 10Jbond) [10:10:23] 10Operations, 10netops: Consider balancing VRRP primaries to cr1/cr2 - https://phabricator.wikimedia.org/T263212 (10ayounsi) This is now pushed to eqiad and codfw. Result can be seen on: https://librenms.wikimedia.org/graphs/id=16333/type=port_bits/ and https://librenms.wikimedia.org/graphs/id=16552/type=port... [10:10:35] (03PS2) 10Jbond: puppetdb: small refactor update types [puppet] - 10https://gerrit.wikimedia.org/r/629643 [10:11:11] 10Operations, 10ops-codfw: elastic2037 DIMM errors logged in racadm getsel - https://phabricator.wikimedia.org/T263714 (10elukey) The host went down again, I think that we'd need to replace those DIMMs :( @Papaul what do you think? [10:12:36] 10Operations, 10ops-codfw: elastic2037 DIMM errors logged in racadm getsel - https://phabricator.wikimedia.org/T263714 (10elukey) Last logs in `racadm getsel`: ` ------------------------------------------------------------------------------- Record: 11 Date/Time: 09/24/2020 09:10:29 Source: system... [10:12:37] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:13:07] (03PS1) 10Klausman: Drop MemorySwapMax=0 from analytics puppet roles [puppet] - 10https://gerrit.wikimedia.org/r/629645 [10:13:16] ACKNOWLEDGEMENT - Host elastic2037 is DOWN: PING CRITICAL - Packet loss = 100% Elukey T263714 [10:14:08] (03CR) 10jerkins-bot: [V: 04-1] Drop MemorySwapMax=0 from analytics puppet roles [puppet] - 10https://gerrit.wikimedia.org/r/629645 (owner: 10Klausman) [10:15:06] (03PS2) 10Klausman: Drop MemorySwapMax=0 from analytics puppet roles [puppet] - 10https://gerrit.wikimedia.org/r/629645 [10:15:08] (03PS1) 10JMeybohm: monitor_services: switch citoid monitor to https [puppet] - 10https://gerrit.wikimedia.org/r/629646 (https://phabricator.wikimedia.org/T255868) [10:15:26] (03PS3) 10Jbond: puppetdb: small refactor update types [puppet] - 10https://gerrit.wikimedia.org/r/629643 [10:15:57] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:16:30] (03CR) 10Kormat: [C: 03+1] "LGTM (but i haven't tested it)" [puppet] - 10https://gerrit.wikimedia.org/r/629379 (owner: 10Filippo Giunchedi) [10:16:37] (03CR) 10Jbond: [C: 03+2] puppetdb: small refactor update types [puppet] - 10https://gerrit.wikimedia.org/r/629643 (owner: 10Jbond) [10:16:57] (03PS3) 10Klausman: Drop MemorySwapMax=0 from analytics puppet roles [puppet] - 10https://gerrit.wikimedia.org/r/629645 [10:17:05] (03CR) 10JMeybohm: [C: 03+2] monitor_services: switch citoid monitor to https [puppet] - 10https://gerrit.wikimedia.org/r/629646 (https://phabricator.wikimedia.org/T255868) (owner: 10JMeybohm) [10:17:08] (03CR) 10Elukey: [C: 03+1] Drop MemorySwapMax=0 from analytics puppet roles [puppet] - 10https://gerrit.wikimedia.org/r/629645 (owner: 10Klausman) [10:18:54] (03CR) 10Klausman: [C: 03+2] Drop MemorySwapMax=0 from analytics puppet roles [puppet] - 10https://gerrit.wikimedia.org/r/629645 (owner: 10Klausman) [10:20:25] !log kormat@cumin1001 dbctl commit (dc=all): 'db2138:3312 (re)pooling @ 100%: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12789 and previous config saved to /var/cache/conftool/dbconfig/20200924-102025-kormat.json [10:20:27] (03CR) 10Jbond: "Noticed this as a conflict to another change, pretty sure this can go, i have removed puppetdb_major_version" [puppet] - 10https://gerrit.wikimedia.org/r/427928 (https://phabricator.wikimedia.org/T190318) (owner: 10Herron) [10:20:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:31] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [10:21:43] arturo: hey, my change just now (removing MemorySwapMax=0) might be of interest to you. toolforge/bastion-user-resource-control.conf has the same setting. Turns out, at least on Buster, it does not work. systemd complains with "Memory limit '0' out of range. Ignoring." [10:23:24] klausman: we don't use swap on VMs, but thanks! [10:23:36] or we shouldn't use, anyway [10:23:37] !log uploaded python3-wmflib_0.0.2 to apt.wikimedia.org buster-wikimedia [10:23:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:46] 10Operations, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO (2020-10-01 to 2020-12-31 (Q2)): scap configuration in puppet defaults to forge the git repo name with 'mediawiki/services/' - https://phabricator.wikimedia.org/T257413 (10hashar) [10:24:00] (03CR) 10Muehlenhoff: [C: 03+2] Manage /etc/apt/sources.list via Puppet [puppet] - 10https://gerrit.wikimedia.org/r/626693 (https://phabricator.wikimedia.org/T156562) (owner: 10Muehlenhoff) [10:24:24] klausman: oh, yes, in the toolforge bastion we use swap indeed [10:28:40] (03CR) 10Muehlenhoff: [C: 03+2] Retire stub firejail code in service::uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/622350 (owner: 10Muehlenhoff) [10:32:17] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/629642 (https://phabricator.wikimedia.org/T263557) (owner: 10Ema) [10:33:33] (03PS1) 10Elukey: WIP profile::hadoop::common: test regex [puppet] - 10https://gerrit.wikimedia.org/r/629647 [10:39:15] (03PS2) 10Elukey: WIP profile::hadoop::common: test regex [puppet] - 10https://gerrit.wikimedia.org/r/629647 [10:40:17] (03CR) 10jerkins-bot: [V: 04-1] WIP profile::hadoop::common: test regex [puppet] - 10https://gerrit.wikimedia.org/r/629647 (owner: 10Elukey) [10:42:37] (03PS1) 10Jbond: puppetdb::app: Add ability to change log level [puppet] - 10https://gerrit.wikimedia.org/r/629649 (https://phabricator.wikimedia.org/T263578) [10:44:07] (03PS3) 10Elukey: WIP profile::hadoop::common: test regex [puppet] - 10https://gerrit.wikimedia.org/r/629647 [10:45:07] (03CR) 10jerkins-bot: [V: 04-1] WIP profile::hadoop::common: test regex [puppet] - 10https://gerrit.wikimedia.org/r/629647 (owner: 10Elukey) [10:45:09] (03PS2) 10Jbond: puppetdb::app: Add ability to change log level [puppet] - 10https://gerrit.wikimedia.org/r/629649 (https://phabricator.wikimedia.org/T263578) [10:45:20] (03CR) 10Jbond: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1002/25375/" [puppet] - 10https://gerrit.wikimedia.org/r/629649 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond) [10:45:25] 10Operations, 10Maps, 10Product-Infrastructure-Team-Backlog (Kanban): Increase frequency of OSM replication - https://phabricator.wikimedia.org/T137939 (10MSantos) [10:45:32] (03PS1) 10Jbond: puppetdb: enable debug logging for puppetdb2002 [puppet] - 10https://gerrit.wikimedia.org/r/629651 (https://phabricator.wikimedia.org/T263578) [10:46:25] (03PS1) 10Ayounsi: Add damping on Anycast BGP sessions [homer/public] - 10https://gerrit.wikimedia.org/r/629652 (https://phabricator.wikimedia.org/T262372) [10:47:38] (03PS4) 10Elukey: WIP profile::hadoop::common: test regex [puppet] - 10https://gerrit.wikimedia.org/r/629647 [10:48:38] (03CR) 10jerkins-bot: [V: 04-1] WIP profile::hadoop::common: test regex [puppet] - 10https://gerrit.wikimedia.org/r/629647 (owner: 10Elukey) [10:48:42] (03PS3) 10Jbond: puppetdb::app: Add ability to change log level [puppet] - 10https://gerrit.wikimedia.org/r/629649 (https://phabricator.wikimedia.org/T263578) [10:48:52] (03PS2) 10Jbond: puppetdb: enable debug logging for puppetdb2002 [puppet] - 10https://gerrit.wikimedia.org/r/629651 (https://phabricator.wikimedia.org/T263578) [10:48:54] (03PS5) 10Giuseppe Lavagetto: services: retire the ORES http endpoint (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/628802 (https://phabricator.wikimedia.org/T244843) [10:48:56] (03PS5) 10Giuseppe Lavagetto: services: retire the ORES http endpoint (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/628803 (https://phabricator.wikimedia.org/T244843) [10:48:58] (03PS1) 10Giuseppe Lavagetto: mtail: convert mediawiki to use a real histogram [puppet] - 10https://gerrit.wikimedia.org/r/629653 (https://phabricator.wikimedia.org/T263727) [10:49:45] !log installing libproxy security updates [10:49:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:15] (03CR) 10Jbond: [C: 03+2] puppetdb::app: Add ability to change log level [puppet] - 10https://gerrit.wikimedia.org/r/629649 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond) [10:50:17] (03CR) 10jerkins-bot: [V: 04-1] mtail: convert mediawiki to use a real histogram [puppet] - 10https://gerrit.wikimedia.org/r/629653 (https://phabricator.wikimedia.org/T263727) (owner: 10Giuseppe Lavagetto) [10:50:19] (03CR) 10Jbond: [C: 03+2] puppetdb: enable debug logging for puppetdb2002 [puppet] - 10https://gerrit.wikimedia.org/r/629651 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond) [10:50:44] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:51:38] !log disable puppet fleet wide to deploy a puppetmaster change [10:51:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:51] (03PS1) 10Muehlenhoff: Add library hint for libproxy [puppet] - 10https://gerrit.wikimedia.org/r/629654 [10:51:58] (03CR) 10Jbond: [C: 03+2] puppetmaster: update web site to use strong ssl ciphers [puppet] - 10https://gerrit.wikimedia.org/r/629442 (https://phabricator.wikimedia.org/T242991) (owner: 10Jbond) [10:53:37] (03PS5) 10Elukey: WIP profile::hadoop::common: test regex [puppet] - 10https://gerrit.wikimedia.org/r/629647 [10:54:40] (03CR) 10jerkins-bot: [V: 04-1] WIP profile::hadoop::common: test regex [puppet] - 10https://gerrit.wikimedia.org/r/629647 (owner: 10Elukey) [10:57:48] PROBLEM - Blazegraph process -wdqs-categories- on wdqs1009 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 499 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [10:58:18] PROBLEM - Blazegraph Port for wdqs-categories on wdqs1009 is CRITICAL: connect to address 127.0.0.1 and port 9990: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [10:58:34] PROBLEM - Check systemd state on wdqs1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:59:49] (03PS2) 10Giuseppe Lavagetto: mtail: convert mediawiki to use a real histogram [puppet] - 10https://gerrit.wikimedia.org/r/629653 (https://phabricator.wikimedia.org/T263727) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for European mid-day backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200924T1100). [11:00:04] MatmaRex, kart_, and abijeet: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:10] * kart_ is here. [11:00:26] hello [11:00:48] hello [11:00:52] (03CR) 10jerkins-bot: [V: 04-1] mtail: convert mediawiki to use a real histogram [puppet] - 10https://gerrit.wikimedia.org/r/629653 (https://phabricator.wikimedia.org/T263727) (owner: 10Giuseppe Lavagetto) [11:01:08] I can deploy today [11:01:15] Urbanecm: thanks :) [11:01:56] 10Puppet, 10Patch-For-Review: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10jbond) I have enabled debug logging on pupetdb2002 [11:02:09] !log re-enable puppet fleet wide [11:02:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:51] (03PS2) 10Urbanecm: Simplify DiscussionTools config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629469 (owner: 10Bartosz Dziewoński) [11:03:59] (03CR) 10Nikerabbit: [C: 03+1] Fix validation of translation unit section names [extensions/Translate] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/629640 (https://phabricator.wikimedia.org/T263546) (owner: 10Abijeet Patro) [11:04:01] (03CR) 10Urbanecm: [C: 03+2] Simplify DiscussionTools config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629469 (owner: 10Bartosz Dziewoński) [11:04:51] (03Merged) 10jenkins-bot: Simplify DiscussionTools config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629469 (owner: 10Bartosz Dziewoński) [11:05:05] (03CR) 10Urbanecm: [C: 03+2] Fix validation of translation unit section names [extensions/Translate] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/629640 (https://phabricator.wikimedia.org/T263546) (owner: 10Abijeet Patro) [11:05:07] 10Operations, 10RESTBase, 10RESTBase-API, 10Traffic, 10Services (next): RESTBase support for www.wikimedia.org missing - https://phabricator.wikimedia.org/T133178 (10Physikerwelt) From the math perspective, the change to the new MW Rest API is already implemented but not yet reviewed. Thereafter, restbas... [11:05:40] (03PS2) 10Urbanecm: Move DiscussionTools out of beta on arwiki, cswiki, huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629470 (https://phabricator.wikimedia.org/T249394) (owner: 10Bartosz Dziewoński) [11:05:50] (03CR) 10Urbanecm: [C: 03+2] Move DiscussionTools out of beta on arwiki, cswiki, huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629470 (https://phabricator.wikimedia.org/T249394) (owner: 10Bartosz Dziewoński) [11:06:36] (03Merged) 10jenkins-bot: Move DiscussionTools out of beta on arwiki, cswiki, huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629470 (https://phabricator.wikimedia.org/T249394) (owner: 10Bartosz Dziewoński) [11:06:40] !log installing imagemagick security updates on stretch [11:06:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:40] (03PS1) 10Effie Mouzeli: push-notifications: enable service proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/629656 (https://phabricator.wikimedia.org/T260247) [11:08:57] MatmaRex: your two patches are at mwdebug2001 [11:09:48] (03CR) 10jerkins-bot: [V: 04-1] push-notifications: enable service proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/629656 (https://phabricator.wikimedia.org/T260247) (owner: 10Effie Mouzeli) [11:10:08] thanks, looking [11:11:30] (03PS3) 10Urbanecm: Enable ContentTranslation in Bashkir, Urdu and Welsh WPs as a default tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629616 (https://phabricator.wikimedia.org/T258504) (owner: 10KartikMistry) [11:11:31] Urbanecm: looks good [11:11:40] thanks, syncing [11:12:24] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:12:58] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:14:43] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 90c72912f26d91df6d28b1efd64e366aaabc5357: Move DiscussionTools out of beta on arwiki, cswiki, huwiki (T249394); d8553f35b4dd581f67bd568d773ff65f316fbfd3: Simplify DiscussionTools config (duration: 01m 11s) [11:14:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:48] T249394: Deploy Replying as opt-out Preference at the Arabic, Czech and Hungarian Wikipedias - https://phabricator.wikimedia.org/T249394 [11:15:15] (03CR) 10Urbanecm: [C: 03+2] Enable ContentTranslation in Bashkir, Urdu and Welsh WPs as a default tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629616 (https://phabricator.wikimedia.org/T258504) (owner: 10KartikMistry) [11:15:47] 10Operations, 10Traffic, 10netops, 10Epic: Capacity planning for (& optimization of) transport backhaul vs edge egress - https://phabricator.wikimedia.org/T263275 (10ayounsi) [11:16:06] (03Merged) 10jenkins-bot: Enable ContentTranslation in Bashkir, Urdu and Welsh WPs as a default tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629616 (https://phabricator.wikimedia.org/T258504) (owner: 10KartikMistry) [11:16:18] (03PS2) 10Muehlenhoff: Add library hint for libproxy [puppet] - 10https://gerrit.wikimedia.org/r/629654 [11:17:26] kart_: your patch is at mwdebug2001, can you test? [11:17:57] Urbanecm: testing.. [11:19:54] Urbanecm: all well. Please go ahead.. [11:20:00] 10Operations, 10Traffic: Analyze the impact of removing TLSv1/v1.1 on puppetmasters - https://phabricator.wikimedia.org/T242991 (10jbond) 05Open→03Resolved a:03jbond This has been deployed and every thing looks good, closing, please re open if you see any issues [11:20:02] 10Operations, 10Traffic: Start warning and deprecation process for all legacy TLS - https://phabricator.wikimedia.org/T238038 (10jbond) [11:20:03] syncing, thanks [11:20:22] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:21:00] !log disable puppet fleet wide to reduce log level on puppetdb [11:21:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:52] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:22:33] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: fdab74c443bc3328856e8441f4d2df8bc57c6f54: Enable ContentTranslation in Bashkir, Urdu and Welsh WPs as a default tool (T258504; T260022; T260024) (duration: 01m 05s) [11:22:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:41] T260024: Enable Content Translation in Urdu Wikipedia as a default tool - https://phabricator.wikimedia.org/T260024 [11:22:41] T260022: Enable Content Translation in Welsh Wikipedia as a default tool - https://phabricator.wikimedia.org/T260022 [11:22:41] T258504: Enable Content Translation in Bashkir Wikipedia as a default tool - https://phabricator.wikimedia.org/T258504 [11:22:54] (03PS2) 10Effie Mouzeli: push-notifications: enable service proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/629656 (https://phabricator.wikimedia.org/T260247) [11:23:22] (03Merged) 10jenkins-bot: Fix validation of translation unit section names [extensions/Translate] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/629640 (https://phabricator.wikimedia.org/T263546) (owner: 10Abijeet Patro) [11:23:26] (03PS1) 10Jbond: puppetdb reduce log level [puppet] - 10https://gerrit.wikimedia.org/r/629657 [11:23:43] Thank you, Urbanecm! [11:23:48] no problem [11:23:55] (03CR) 10Jbond: [C: 03+2] puppetdb reduce log level [puppet] - 10https://gerrit.wikimedia.org/r/629657 (owner: 10Jbond) [11:24:38] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for libproxy [puppet] - 10https://gerrit.wikimedia.org/r/629654 (owner: 10Muehlenhoff) [11:25:49] !log re-enable puppet fleet wide [11:25:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:03] abijeet: hello, can you test your patch at mwdebug2001 please? [11:26:41] Urbanecm, on it. [11:30:20] Urbanecm, everything looks good, please go ahead [11:30:27] syncing, thanks [11:30:55] Urbanecm, thank you for your help. [11:31:12] (03PS1) 10Hnowlan: changeprop/changeprop-jobqueue: increase memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/629658 [11:33:28] no problem abijeet :) [11:34:21] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.10/extensions/Translate/tag/TPSection.php: fa4900e1e6022e645be12505de30b696a9769e77: Fix validation of translation unit section names (T263546) (duration: 01m 07s) [11:34:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:27] T263546: Errors: Translation unit name must not contain underscore or slash - https://phabricator.wikimedia.org/T263546 [11:34:32] abijeet: done :) [11:35:20] (03CR) 10Filippo Giunchedi: pontoon: use Python API for enroll (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/629378 (owner: 10Filippo Giunchedi) [11:35:37] (03PS3) 10Filippo Giunchedi: pontoon: use Python API for enroll [puppet] - 10https://gerrit.wikimedia.org/r/629378 [11:35:39] (03PS3) 10Filippo Giunchedi: pontoon: cleanup/update ENC [puppet] - 10https://gerrit.wikimedia.org/r/629379 [11:40:19] Urbanecm, works as expected. :) [11:40:22] !log EU B&C window done [11:40:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:28] abijeet: excellent' [11:42:48] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:44:18] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:48:26] (03PS2) 10Lucas Werkmeister (WMDE): Remove unused $wgExtraLanguageNames['qqq'] assignment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628773 (https://phabricator.wikimedia.org/T263441) [11:48:27] (03PS2) 10Lucas Werkmeister (WMDE): Stop using $wmgExtraLanguageNames in CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628774 (https://phabricator.wikimedia.org/T263441) [11:48:29] (03PS2) 10Lucas Werkmeister (WMDE): Remove $wmgExtraLanguageNames from InitialiseSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628775 (https://phabricator.wikimedia.org/T263441) [11:48:31] (03PS3) 10Lucas Werkmeister (WMDE): Remove $wmgExtraLanguageNames from InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628776 (https://phabricator.wikimedia.org/T263441) [11:54:04] (03PS2) 10Hnowlan: changeprop/changeprop-jobqueue: increase memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/629658 [11:55:50] (03PS1) 10Elukey: profile::hadoop::common: use only TLS puppet certs [puppet] - 10https://gerrit.wikimedia.org/r/629663 (https://phabricator.wikimedia.org/T253957) [11:56:15] (03CR) 10jerkins-bot: [V: 04-1] changeprop/changeprop-jobqueue: increase memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/629658 (owner: 10Hnowlan) [11:57:35] (03CR) 10Ema: [C: 03+2] cache: upgrade Varnish to v6 in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/629642 (https://phabricator.wikimedia.org/T263557) (owner: 10Ema) [11:58:14] (03CR) 10Kormat: pontoon: use Python API for enroll (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/629378 (owner: 10Filippo Giunchedi) [12:02:22] !log cp4022: upgrade varnish to 6.0.6-1wm1 T263557 [12:02:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:28] T263557: Upgrade production cache nodes to Varnish 6 - https://phabricator.wikimedia.org/T263557 [12:03:23] (03CR) 10Elukey: [C: 03+2] profile::hadoop::common: use only TLS puppet certs [puppet] - 10https://gerrit.wikimedia.org/r/629663 (https://phabricator.wikimedia.org/T253957) (owner: 10Elukey) [12:04:25] (03PS4) 10Filippo Giunchedi: pontoon: use Python API for enroll [puppet] - 10https://gerrit.wikimedia.org/r/629378 [12:04:27] (03PS4) 10Filippo Giunchedi: pontoon: cleanup/update ENC [puppet] - 10https://gerrit.wikimedia.org/r/629379 [12:04:29] (03CR) 10Filippo Giunchedi: pontoon: use Python API for enroll (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/629378 (owner: 10Filippo Giunchedi) [12:04:31] (03CR) 10jerkins-bot: [V: 04-1] pontoon: use Python API for enroll [puppet] - 10https://gerrit.wikimedia.org/r/629378 (owner: 10Filippo Giunchedi) [12:04:59] (03CR) 10Kormat: [C: 03+1] "LGTM :)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/629378 (owner: 10Filippo Giunchedi) [12:05:02] 10Puppet, 10Patch-For-Review: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10jbond) Just a note that there is a file `puppetdb2002.codwfw.wmnet:~jbond/debug.log` which contains debug information for the period of the following ` lang=shell 2020-09-24T11:22:58.120Z IN... [12:06:27] (03PS3) 10JMeybohm: services_proxy: switch mathoid to the TLS endpoint [puppet] - 10https://gerrit.wikimedia.org/r/629327 (https://phabricator.wikimedia.org/T255875) [12:08:57] (03PS5) 10Filippo Giunchedi: pontoon: use Python API for enroll [puppet] - 10https://gerrit.wikimedia.org/r/629378 [12:08:59] (03PS5) 10Filippo Giunchedi: pontoon: cleanup/update ENC [puppet] - 10https://gerrit.wikimedia.org/r/629379 [12:09:10] !log text@ulsfo: rolling varnish upgrade to 6.0.6-1wm1 T263557 [12:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:16] T263557: Upgrade production cache nodes to Varnish 6 - https://phabricator.wikimedia.org/T263557 [12:09:16] (03CR) 10jerkins-bot: [V: 04-1] pontoon: use Python API for enroll [puppet] - 10https://gerrit.wikimedia.org/r/629378 (owner: 10Filippo Giunchedi) [12:10:15] (03PS6) 10Filippo Giunchedi: pontoon: use Python API for enroll [puppet] - 10https://gerrit.wikimedia.org/r/629378 [12:10:17] (03PS6) 10Filippo Giunchedi: pontoon: cleanup/update ENC [puppet] - 10https://gerrit.wikimedia.org/r/629379 [12:11:07] (03CR) 10Marostegui: [C: 03+2] filtered_tables.txt: Update columns after MCR changes [puppet] - 10https://gerrit.wikimedia.org/r/629621 (https://phabricator.wikimedia.org/T238966) (owner: 10Marostegui) [12:14:15] (03CR) 10Kormat: [C: 03+1] pontoon: use Python API for enroll [puppet] - 10https://gerrit.wikimedia.org/r/629378 (owner: 10Filippo Giunchedi) [12:18:20] (03PS9) 10Hashar: Explicitly mentions the repository in scap::sources [puppet] - 10https://gerrit.wikimedia.org/r/610254 (https://phabricator.wikimedia.org/T257413) [12:19:01] (03CR) 10Hashar: "Rebased for trivial conflict with https://gerrit.wikimedia.org/r/c/operations/puppet/+/629094/6/hieradata/cloud/eqiad1/deployment-prep/com" [puppet] - 10https://gerrit.wikimedia.org/r/610254 (https://phabricator.wikimedia.org/T257413) (owner: 10Hashar) [12:19:06] (03PS3) 10Effie Mouzeli: push-notifications: enable service proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/629656 (https://phabricator.wikimedia.org/T260247) [12:19:17] (03PS7) 10Hashar: scap::sources stop assuming mediawiki/services as a prefix [puppet] - 10https://gerrit.wikimedia.org/r/610267 (https://phabricator.wikimedia.org/T257413) [12:19:38] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: cleanup/update ENC [puppet] - 10https://gerrit.wikimedia.org/r/629379 (owner: 10Filippo Giunchedi) [12:19:42] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: use Python API for enroll [puppet] - 10https://gerrit.wikimedia.org/r/629378 (owner: 10Filippo Giunchedi) [12:19:47] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/610254 (https://phabricator.wikimedia.org/T257413) (owner: 10Hashar) [12:19:49] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission es2012.codfw.wmnet - https://phabricator.wikimedia.org/T263613 (10Marostegui) a:05Marostegui→03Papaul [12:19:51] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/610267 (https://phabricator.wikimedia.org/T257413) (owner: 10Hashar) [12:23:51] 10Operations, 10netops: Consider balancing VRRP primaries to cr1/cr2 - https://phabricator.wikimedia.org/T263212 (10BBlack) >>! In T263212#6490669, @ayounsi wrote: > Ideally we would take the links state into consideration: If the twin link is down alert at 80%, if it's up alert when the sum is at 80% of the i... [12:25:34] !log installing xorg-server security updates [12:25:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:18] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Papaul) I will be on site today the only thing i need for now is depool the server and power it down if it is not done yet. Thanks. [12:26:28] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:26:34] Hi _joe_ is it OK to push the latest version of wikifeeds to production? I am wondering if there are still things in flight for https://phabricator.wikimedia.org/T263043 that might cause a regression. [12:27:12] !log powering off db2125 for maintenance T260670 [12:27:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:17] T260670: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 [12:27:25] <_joe_> nemo-yiannis: not AFAIK. I'm waiting on the changes before reintroducing the thing that caused the regression [12:27:41] cool, thanks! [12:28:00] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:29:57] !log swift codfw-prod: rebalance only, no weight change [12:30:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:39] !log upload@ulsfo: rolling varnish upgrade to 6.0.6-1wm1 T263557 [12:30:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:44] T263557: Upgrade production cache nodes to Varnish 6 - https://phabricator.wikimedia.org/T263557 [12:32:55] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Kormat) @Papaul : server is depooled and powered down now. Cheers :) [12:36:33] 10Operations, 10Traffic, 10netops: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10ayounsi) Some thoughts/idea: * enable IPFIX on all/most of the routers interfaces In the current state of our setup this means double/triple accounting a flow as packets cross interface... [12:37:43] 10Operations, 10ops-codfw: elastic2037 DIMM errors logged in racadm getsel - https://phabricator.wikimedia.org/T263714 (10Papaul) @elukey I think we should replace the DIMM since this is he third time the server crashed (T263588) with the same error. Thanks will request a DIMM from Dell. [12:39:23] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Papaul) @Kormat thanks [12:42:28] (03PS1) 10Kormat: isort: Ignore debian/ dir [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/629673 [12:42:35] !log jgiannelos@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [12:42:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:21] !log installing netty-3.9 security updates [12:43:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:28] !log restarting wdqs-categories on wdqs1009 [12:43:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:42] RECOVERY - Blazegraph Port for wdqs-categories on wdqs1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9990 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [12:43:42] RECOVERY - Blazegraph process -wdqs-categories- on wdqs1009 is OK: PROCS OK: 1 process with UID = 499 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [12:43:44] (03CR) 10Kormat: [C: 03+2] isort: Ignore debian/ dir [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/629673 (owner: 10Kormat) [12:44:45] (03Merged) 10jenkins-bot: isort: Ignore debian/ dir [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/629673 (owner: 10Kormat) [12:44:46] (03CR) 10BBlack: [C: 03+1] Add damping on Anycast BGP sessions [homer/public] - 10https://gerrit.wikimedia.org/r/629652 (https://phabricator.wikimedia.org/T262372) (owner: 10Ayounsi) [12:46:18] (03CR) 10Hashar: "Puppet compiler https://puppet-compiler.wmflabs.org/compiler1001/564/deploy1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/610254 (https://phabricator.wikimedia.org/T257413) (owner: 10Hashar) [12:46:55] Urbanecm: can you review, https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/629371 - I plan to deploy this on Monday. [12:47:58] kart_: looks good, but maybe servers will disagree 🙂 [12:48:54] Urbanecm: haha :) [12:49:21] Urbanecm: yeah. Last time missing bits were pointing testwiki CX to testwiki DB. [12:49:27] yeah [12:49:32] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629371 (https://phabricator.wikimedia.org/T263417) (owner: 10KartikMistry) [12:49:40] Thanks! [12:49:49] I've +1'ed it, let's see what happens :) [12:49:58] OK. Monday! [12:50:10] !log upgrading bird on centtrallog1001 [12:50:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:17] (03PS1) 10Gehel: wdqs: fixed prometheus agent port for streaming updater [puppet] - 10https://gerrit.wikimedia.org/r/629674 [12:50:29] 10Operations, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO (2020-10-01 to 2020-12-31 (Q2)): scap configuration in puppet defaults to forge the git repo name with 'mediawiki/services/' - https://phabricator.wikimedia.org/T257413 (10hashar) I... [12:51:23] (03CR) 10DCausse: [C: 03+1] wdqs: fixed prometheus agent port for streaming updater [puppet] - 10https://gerrit.wikimedia.org/r/629674 (owner: 10Gehel) [12:51:25] (03PS8) 10Hashar: Switch CI to profile::java [puppet] - 10https://gerrit.wikimedia.org/r/605886 (https://phabricator.wikimedia.org/T253553) (owner: 10Muehlenhoff) [12:51:43] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/605886 (https://phabricator.wikimedia.org/T253553) (owner: 10Muehlenhoff) [12:51:46] (03Abandoned) 10DCausse: [wdqs] use an Integer instead of String for jmx_exporter port [puppet] - 10https://gerrit.wikimedia.org/r/626129 (owner: 10DCausse) [12:52:12] (03Abandoned) 10Gehel: wdqs: fixed prometheus agent port for streaming updater [puppet] - 10https://gerrit.wikimedia.org/r/629674 (owner: 10Gehel) [12:52:45] (03CR) 10jerkins-bot: [V: 04-1] Switch CI to profile::java [puppet] - 10https://gerrit.wikimedia.org/r/605886 (https://phabricator.wikimedia.org/T253553) (owner: 10Muehlenhoff) [12:53:23] (03Restored) 10Gehel: [wdqs] use an Integer instead of String for jmx_exporter port [puppet] - 10https://gerrit.wikimedia.org/r/626129 (owner: 10DCausse) [12:53:24] (03PS2) 10Gehel: [wdqs] use an Integer instead of String for jmx_exporter port [puppet] - 10https://gerrit.wikimedia.org/r/626129 (owner: 10DCausse) [12:54:42] (03CR) 10Gehel: [C: 03+2] [wdqs] use an Integer instead of String for jmx_exporter port [puppet] - 10https://gerrit.wikimedia.org/r/626129 (owner: 10DCausse) [12:55:13] (03PS9) 10Hashar: Switch CI to profile::java [puppet] - 10https://gerrit.wikimedia.org/r/605886 (https://phabricator.wikimedia.org/T253553) (owner: 10Muehlenhoff) [12:55:47] (03CR) 10Hashar: "Updated Hosts: header to please the commit message validator and updated the two releases hosts that have since been rebuild and given new" [puppet] - 10https://gerrit.wikimedia.org/r/605886 (https://phabricator.wikimedia.org/T253553) (owner: 10Muehlenhoff) [12:55:57] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/605886 (https://phabricator.wikimedia.org/T253553) (owner: 10Muehlenhoff) [12:56:43] (03PS1) 10Jbond: postgresql::server: add types and new params [puppet] - 10https://gerrit.wikimedia.org/r/629676 (https://phabricator.wikimedia.org/T263578) [12:56:45] (03CR) 10JMeybohm: [C: 03+1] push-notifications: enable service proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/629656 (https://phabricator.wikimedia.org/T260247) (owner: 10Effie Mouzeli) [12:56:48] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:57:05] (03CR) 10jerkins-bot: [V: 04-1] postgresql::server: add types and new params [puppet] - 10https://gerrit.wikimedia.org/r/629676 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond) [12:57:23] (03CR) 10JMeybohm: [C: 03+2] services_proxy: switch mathoid to the TLS endpoint [puppet] - 10https://gerrit.wikimedia.org/r/629327 (https://phabricator.wikimedia.org/T255875) (owner: 10JMeybohm) [12:58:22] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:58:37] !log switched mathoid service-proxy listener to use TLS - T255875 [12:58:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:42] T255875: Move mathoid to use TLS only - https://phabricator.wikimedia.org/T255875 [13:01:51] (03PS1) 10Kormat: Remove unused class attributes [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/629677 [13:02:25] (03CR) 10Ayounsi: [C: 03+2] Add damping on Anycast BGP sessions [homer/public] - 10https://gerrit.wikimedia.org/r/629652 (https://phabricator.wikimedia.org/T262372) (owner: 10Ayounsi) [13:02:50] (03Merged) 10jenkins-bot: Add damping on Anycast BGP sessions [homer/public] - 10https://gerrit.wikimedia.org/r/629652 (https://phabricator.wikimedia.org/T262372) (owner: 10Ayounsi) [13:08:21] (03PS2) 10Jbond: postgresql::server: add types and new params [puppet] - 10https://gerrit.wikimedia.org/r/629676 (https://phabricator.wikimedia.org/T263578) [13:08:44] (03CR) 10jerkins-bot: [V: 04-1] postgresql::server: add types and new params [puppet] - 10https://gerrit.wikimedia.org/r/629676 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond) [13:08:46] (03CR) 10Kormat: [C: 03+2] Remove unused class attributes [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/629677 (owner: 10Kormat) [13:09:39] (03Merged) 10jenkins-bot: Remove unused class attributes [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/629677 (owner: 10Kormat) [13:10:34] (03PS3) 10Jbond: postgresql::server: add types and new params [puppet] - 10https://gerrit.wikimedia.org/r/629676 (https://phabricator.wikimedia.org/T263578) [13:16:41] (03CR) 10Andrew Bogott: OpenStack: add initial manifests for OpenStack Barbican, a secrets API (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/629472 (https://phabricator.wikimedia.org/T263680) (owner: 10Andrew Bogott) [13:17:00] (03PS1) 10Kormat: wmfmariadb: Fix 2 lint warnings for unused variables. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/629679 [13:17:49] !log add damping to anycast BGP - T262372 [13:17:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:55] T262372: Configure BGP route damping on Anycast sessions - https://phabricator.wikimedia.org/T262372 [13:18:56] confirmed that the BGP sessions don't bounce [13:19:59] (03PS16) 10Andrew Bogott: OpenStack: add initial manifests for OpenStack Barbican, a secrets API [puppet] - 10https://gerrit.wikimedia.org/r/629472 (https://phabricator.wikimedia.org/T263680) [13:22:53] !log moved the hadoop cluster to puppet TLS certificates - T253957 [13:22:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:58] T253957: Fix TLS certificate location and expire for Hadoop/Presto/etc.. and add alarms on TLS cert expiry - https://phabricator.wikimedia.org/T253957 [13:24:34] 10Operations, 10Traffic: Consolidate edge bastion server into ganeti - https://phabricator.wikimedia.org/T257324 (10MoritzMuehlenhoff) > Security - are we ok with ssh bastions inside ganeti alongside other public service instances? Sounds fine to me. As long as we have two baremetal bastions in eqiad/codfw wh... [13:29:09] (03PS5) 10Filippo Giunchedi: profile: add alertmanager::alerts [puppet] - 10https://gerrit.wikimedia.org/r/629153 (https://phabricator.wikimedia.org/T258948) [13:30:15] (03PS1) 10Andrew Bogott: wmcs codfw1dev haproxy: add proxy for barbican api [puppet] - 10https://gerrit.wikimedia.org/r/629680 (https://phabricator.wikimedia.org/T263680) [13:31:32] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:32:23] !log Increased retention time for *.mediawiki.job.processMediaModeration topics in kafka main-eqiad and main-codfw to 31 days (as per request from Pchelolo ) [13:32:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:06] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:34:08] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:35:19] (03CR) 10Kormat: [C: 03+2] wmfmariadb: Fix 2 lint warnings for unused variables. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/629679 (owner: 10Kormat) [13:35:42] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:36:12] (03PS6) 10Elukey: WIP profile::hadoop::common: test regex [puppet] - 10https://gerrit.wikimedia.org/r/629647 [13:36:18] (03Merged) 10jenkins-bot: wmfmariadb: Fix 2 lint warnings for unused variables. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/629679 (owner: 10Kormat) [13:38:29] (03CR) 10Muehlenhoff: [C: 03+2] Add a helper to dump/restore memcached for reboots [puppet] - 10https://gerrit.wikimedia.org/r/629344 (https://phabricator.wikimedia.org/T233933) (owner: 10Muehlenhoff) [13:38:31] (03PS5) 10Muehlenhoff: Add a helper to dump/restore memcached for reboots [puppet] - 10https://gerrit.wikimedia.org/r/629344 (https://phabricator.wikimedia.org/T233933) [13:38:51] (03PS2) 10JMeybohm: service: add TLS endpoint for zotero 1/3 [puppet] - 10https://gerrit.wikimedia.org/r/629334 (https://phabricator.wikimedia.org/T255869) [13:46:04] (03PS7) 10Elukey: WIP profile::hadoop::common: test regex [puppet] - 10https://gerrit.wikimedia.org/r/629647 [13:46:37] 10Operations, 10Traffic: Upgrade production cache nodes to Varnish 6 - https://phabricator.wikimedia.org/T263557 (10ema) ulsfo upgraded! [13:47:24] (03CR) 10Alexandros Kosiaris: [C: 03+1] service: add TLS endpoint for zotero 1/3 [puppet] - 10https://gerrit.wikimedia.org/r/629334 (https://phabricator.wikimedia.org/T255869) (owner: 10JMeybohm) [13:47:39] (03CR) 10Alexandros Kosiaris: [C: 03+1] service: add TLS endpoint for zotero 2/3 [puppet] - 10https://gerrit.wikimedia.org/r/629335 (https://phabricator.wikimedia.org/T255869) (owner: 10JMeybohm) [13:47:53] (03CR) 10Alexandros Kosiaris: [C: 03+1] service: add TLS endpoint for zotero 3/3 [puppet] - 10https://gerrit.wikimedia.org/r/629336 (https://phabricator.wikimedia.org/T255869) (owner: 10JMeybohm) [13:48:30] (03CR) 10Alexandros Kosiaris: [C: 03+1] services_proxy: switch zotero to the TLS endpoint [puppet] - 10https://gerrit.wikimedia.org/r/629337 (https://phabricator.wikimedia.org/T255869) (owner: 10JMeybohm) [13:48:49] 10Operations, 10Analytics: Augment NEL reports with GeoIP country code and network AS number - https://phabricator.wikimedia.org/T263496 (10CDanis) >>! In T263496#6485312, @Ottomata wrote: >> The long-term answer (which might be stream processing stuff?) > is stream processing stuff > >> In the very short ter... [13:48:54] (03CR) 10Nikerabbit: "In theory it is not needed. However all my sites have it, so I wouldn't know for sure without testing." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628773 (https://phabricator.wikimedia.org/T263441) (owner: 10Lucas Werkmeister (WMDE)) [13:49:05] (03PS8) 10Elukey: WIP profile::hadoop::common: test regex [puppet] - 10https://gerrit.wikimedia.org/r/629647 [13:49:10] (03CR) 10Alexandros Kosiaris: [C: 03+1] lvs: Remove zotero non-TLS endpoint 1/2 [puppet] - 10https://gerrit.wikimedia.org/r/629338 (https://phabricator.wikimedia.org/T255869) (owner: 10JMeybohm) [13:50:03] (03CR) 10Alexandros Kosiaris: [C: 03+1] lvs: Remove zotero non-TLS endpoint 2/2 [puppet] - 10https://gerrit.wikimedia.org/r/629339 (https://phabricator.wikimedia.org/T255869) (owner: 10JMeybohm) [13:50:14] (03PS1) 10Ayounsi: Depool eqiad for row D recabling [dns] - 10https://gerrit.wikimedia.org/r/629681 (https://phabricator.wikimedia.org/T256112) [13:50:25] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): cloudvirt1033 psu redundancy alert - https://phabricator.wikimedia.org/T263145 (10Cmjohnson) I cleared the log on idrac, I did have to remove power to it ...rack C8 power was all messed up, nothing was correct, and moving a server into that rack requ... [13:50:27] 10Operations, 10netops: Consider balancing VRRP primaries to cr1/cr2 - https://phabricator.wikimedia.org/T263212 (10CDanis) >>! In T263212#6490669, @ayounsi wrote: > This is now pushed to eqiad and codfw. Result can be seen on: > https://librenms.wikimedia.org/graphs/id=16333/type=port_bits/ > and > https://l... [13:51:39] (03CR) 10Ayounsi: [C: 03+2] Depool eqiad for row D recabling [dns] - 10https://gerrit.wikimedia.org/r/629681 (https://phabricator.wikimedia.org/T256112) (owner: 10Ayounsi) [13:52:51] !log depool eqiad for row D recabling - T256112 [13:52:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:57] T256112: eqiad row D switch fabric recabling - https://phabricator.wikimedia.org/T256112 [13:53:59] 10Operations, 10netops: Configure BGP route damping on Anycast sessions - https://phabricator.wikimedia.org/T262372 (10ayounsi) 05Open→03Resolved This is all done. [13:54:10] 10Operations, 10ops-eqiad: kakfa-jumbo1008 psu redundacy fail - https://phabricator.wikimedia.org/T263262 (10Cmjohnson) 05Open→03Resolved power cable was not properly seated...corrected it Record: 19 Date/Time: 09/24/2020 13:52:49 Source: system Severity: Ok Description: The power supplies... [13:54:35] 10Operations, 10ops-eqiad: an-worker1115 lost PSU redundancy - https://phabricator.wikimedia.org/T263569 (10Cmjohnson) 05Open→03Resolved the power cable was not properly seated. Fixed [13:58:44] !log upgrading mariadb on cloudcontrol-2001/2003/2004 [13:58:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:59] (03PS9) 10Elukey: WIP profile::hadoop::common: test regex [puppet] - 10https://gerrit.wikimedia.org/r/629647 [14:02:39] 10Operations, 10ops-eqiad, 10DBA: db1139 memory errors on boot 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10Cmjohnson) @Jclark-ctr Have you had any discussion with HPE about this? [14:02:58] (03CR) 10jerkins-bot: [V: 04-1] WIP profile::hadoop::common: test regex [puppet] - 10https://gerrit.wikimedia.org/r/629647 (owner: 10Elukey) [14:03:27] 10Operations, 10Analytics: Augment NEL reports with GeoIP country code and network AS number - https://phabricator.wikimedia.org/T263496 (10JAllemandou) Question on the need for data @CDanis : Is the data augmentation needed in stream, or would refinement on the cluster be sufficient? [14:03:32] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 45.03 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [14:05:03] (03CR) 10JMeybohm: [C: 03+2] service: add TLS endpoint for zotero 1/3 [puppet] - 10https://gerrit.wikimedia.org/r/629334 (https://phabricator.wikimedia.org/T255869) (owner: 10JMeybohm) [14:06:10] 10Operations, 10ops-eqiad, 10serviceops: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 (10Cmjohnson) a:05Cmjohnson→03RobH @robh power has been pulled and flea power drained [14:06:38] (03PS1) 10Volans: dns: replace module with the one in wmflib [software/spicerack] - 10https://gerrit.wikimedia.org/r/629686 (https://phabricator.wikimedia.org/T257905) [14:06:51] 10Operations, 10Analytics: Augment NEL reports with GeoIP country code and network AS number - https://phabricator.wikimedia.org/T263496 (10CDanis) >>! In T263496#6491422, @JAllemandou wrote: > Question on the need for data @CDanis : Is the data augmentation needed in stream, or would refinement on the cluster... [14:09:16] PROBLEM - Check systemd state on ms-be2054 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:09:26] !log removing the cable connected to FPC1:1/0 (DAC 3m) FPC8:1/0 (DAC 3m) [14:09:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:31] !log running puppet on lvs servers - T255869 [14:09:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:38] T255869: Move zotero to use TLS only - https://phabricator.wikimedia.org/T255869 [14:11:16] (03PS3) 10Hnowlan: changeprop/changeprop-jobqueue: increase memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/629658 [14:12:48] PROBLEM - PyBal connections to etcd on lvs2009 is CRITICAL: CRITICAL: 57 connections established with conf2001.codfw.wmnet:2379 (min=58) https://wikitech.wikimedia.org/wiki/PyBal [14:13:32] PROBLEM - Host syslog.anycast.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:14:14] (03CR) 10Elukey: [C: 03+1] dns: replace module with the one in wmflib [software/spicerack] - 10https://gerrit.wikimedia.org/r/629686 (https://phabricator.wikimedia.org/T257905) (owner: 10Volans) [14:15:14] are we supposed to see syslog.anycast.wmnet down? [14:15:30] RECOVERY - IPMI Sensor Status on kafka-jumbo1008 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:15:37] godog: --^ [14:15:52] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.16:4969]) https://wikitech.wikimedia.org/wiki/PyBal [14:15:57] let me know if I should look, but in the middle of a maintenance right now [14:16:02] !log [Netops] Disable unused VC ports to not risk them going online at connect: - T256112 [14:16:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:09] PyBal IPVS diff is me again [14:16:10] T256112: eqiad row D switch fabric recabling - https://phabricator.wikimedia.org/T256112 [14:16:10] PROBLEM - PyBal connections to etcd on lvs1016 is CRITICAL: CRITICAL: 99 connections established with conf1004.eqiad.wmnet:4001 (min=100) https://wikitech.wikimedia.org/wiki/PyBal [14:16:18] XioNoX: there is a syslog.anycast.wmnet down alert, that is weird [14:16:22] expected? [14:16:42] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.16:4969]) https://wikitech.wikimedia.org/wiki/PyBal [14:16:52] RECOVERY - Juniper virtual chassis ports on asw2-d-eqiad is OK: OK: UP: 20 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [14:16:54] !log restart pybal on lvs1016.eqiad.wmnet,lvs2010.codfw.wmnet - T255869 [14:16:59] can't ping 10.3.0.4 [14:17:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:01] T255869: Move zotero to use TLS only - https://phabricator.wikimedia.org/T255869 [14:17:33] I'm in a meeting, cc moritzm re: anycast [14:17:52] (03PS4) 10Jbond: postgresql::server: add types and new params [puppet] - 10https://gerrit.wikimedia.org/r/629676 (https://phabricator.wikimedia.org/T263578) [14:18:18] PROBLEM - PyBal connections to etcd on lvs1015 is CRITICAL: CRITICAL: 67 connections established with conf1004.eqiad.wmnet:4001 (min=68) https://wikitech.wikimedia.org/wiki/PyBal [14:18:27] !log restart pybal on lvs1015.eqiad.wmnet,lvs2009.codfw.wmnet - T255869 [14:18:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:00] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:19:23] !log remove damping on anycast group for cr2-codfw [14:19:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:41] okok sounds that it might be only on the eqiad side due to maintenance then [14:19:42] RECOVERY - PyBal connections to etcd on lvs1015 is OK: OK: 68 connections established with conf1004.eqiad.wmnet:4001 (min=68) https://wikitech.wikimedia.org/wiki/PyBal [14:19:42] RECOVERY - PyBal connections to etcd on lvs1016 is OK: OK: 100 connections established with conf1004.eqiad.wmnet:4001 (min=100) https://wikitech.wikimedia.org/wiki/PyBal [14:19:42] RECOVERY - PyBal connections to etcd on lvs2009 is OK: OK: 58 connections established with conf2001.codfw.wmnet:2379 (min=58) https://wikitech.wikimedia.org/wiki/PyBal [14:19:42] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:19:44] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:19:45] I can ping it back [14:19:58] volans: yeah it got removed because of damping [14:20:02] in both sites [14:20:11] I disabled anycast damping on cr2-codfw [14:20:18] k [14:20:22] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 182 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:20:24] (03PS10) 10Elukey: WIP profile::hadoop::common: test regex [puppet] - 10https://gerrit.wikimedia.org/r/629647 [14:20:42] will have a look after this maintenance, unless someone knows why it flapped? [14:21:28] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 5 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:24:00] RECOVERY - Host syslog.anycast.wmnet is UP: PING OK - Packet loss = 0%, RTA = 33.56 ms [14:24:48] (03PS1) 10Tchanders: Enable mobile block notice tracking in MobileFrontend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629691 (https://phabricator.wikimedia.org/T260218) [14:25:17] (03CR) 10jerkins-bot: [V: 04-1] Enable mobile block notice tracking in MobileFrontend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629691 (https://phabricator.wikimedia.org/T260218) (owner: 10Tchanders) [14:25:24] (03PS5) 10Jbond: postgresql::server: add types and new params [puppet] - 10https://gerrit.wikimedia.org/r/629676 (https://phabricator.wikimedia.org/T263578) [14:25:48] (03PS1) 10Volans: sre.hosts.decommission: import from new library [cookbooks] - 10https://gerrit.wikimedia.org/r/629692 (https://phabricator.wikimedia.org/T257905) [14:26:18] (03CR) 10Volans: [C: 04-2] "Do not merge until the depend-on has been released to production" [cookbooks] - 10https://gerrit.wikimedia.org/r/629692 (https://phabricator.wikimedia.org/T257905) (owner: 10Volans) [14:27:40] (03CR) 10Volans: [C: 03+2] dns: replace module with the one in wmflib [software/spicerack] - 10https://gerrit.wikimedia.org/r/629686 (https://phabricator.wikimedia.org/T257905) (owner: 10Volans) [14:27:45] (03CR) 10Jbond: [C: 03+2] postgresql::server: add types and new params [puppet] - 10https://gerrit.wikimedia.org/r/629676 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond) [14:28:09] (03PS2) 10JMeybohm: service: add TLS endpoint for zotero 2/3 [puppet] - 10https://gerrit.wikimedia.org/r/629335 (https://phabricator.wikimedia.org/T255869) [14:28:37] !log [Netops] In window: turn VC-ports on/off for proper cabling: - T256112 [14:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:45] T256112: eqiad row D switch fabric recabling - https://phabricator.wikimedia.org/T256112 [14:29:06] (03CR) 10Ppchelko: [C: 04-1] changeprop/changeprop-jobqueue: increase memory limits (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/629658 (owner: 10Hnowlan) [14:29:18] (03CR) 10JMeybohm: [C: 03+2] service: add TLS endpoint for zotero 2/3 [puppet] - 10https://gerrit.wikimedia.org/r/629335 (https://phabricator.wikimedia.org/T255869) (owner: 10JMeybohm) [14:29:34] PROBLEM - Host mw1350 is DOWN: PING CRITICAL - Packet loss = 100% [14:29:34] PROBLEM - Host es1018 is DOWN: PING CRITICAL - Packet loss = 100% [14:29:34] (03PS11) 10Elukey: profile::hadoop::common: get the datanode mountpoints from facter [puppet] - 10https://gerrit.wikimedia.org/r/629647 [14:29:38] PROBLEM - Host mw1349 is DOWN: PING CRITICAL - Packet loss = 100% [14:29:38] PROBLEM - Host restbase-dev1006 is DOWN: PING CRITICAL - Packet loss = 100% [14:29:40] PROBLEM - Host snapshot1009 is DOWN: PING CRITICAL - Packet loss = 100% [14:29:42] PROBLEM - Host stat1005 is DOWN: PING CRITICAL - Packet loss = 100% [14:29:48] PROBLEM - Host mw1357 is DOWN: PING CRITICAL - Packet loss = 100% [14:29:50] PROBLEM - Host mw1351 is DOWN: PING CRITICAL - Packet loss = 100% [14:29:52] PROBLEM - Host wdqs1008 is DOWN: PING CRITICAL - Packet loss = 100% [14:29:53] (03CR) 10jerkins-bot: [V: 04-1] dns: replace module with the one in wmflib [software/spicerack] - 10https://gerrit.wikimedia.org/r/629686 (https://phabricator.wikimedia.org/T257905) (owner: 10Volans) [14:30:04] er, looks like we lost D1 [14:30:04] PROBLEM - Host logstash1012 is DOWN: PING CRITICAL - Packet loss = 100% [14:30:04] D1: Initial commit - https://phabricator.wikimedia.org/D1 [14:30:12] RECOVERY - Host mw1350 is UP: PING WARNING - Packet loss = 50%, RTA = 0.40 ms [14:30:12] RECOVERY - Host es1018 is UP: PING WARNING - Packet loss = 66%, RTA = 0.66 ms [14:30:12] RECOVERY - Host mw1357 is UP: PING WARNING - Packet loss = 75%, RTA = 3.15 ms [14:30:12] RECOVERY - Host snapshot1009 is UP: PING WARNING - Packet loss = 50%, RTA = 1.76 ms [14:30:14] RECOVERY - Host stat1005 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [14:30:14] RECOVERY - Host logstash1012 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [14:30:14] RECOVERY - Host mw1351 is UP: PING OK - Packet loss = 0%, RTA = 0.79 ms [14:30:16] RECOVERY - Host wdqs1008 is UP: PING OK - Packet loss = 0%, RTA = 0.76 ms [14:30:16] RECOVERY - Host restbase-dev1006 is UP: PING OK - Packet loss = 0%, RTA = 0.86 ms [14:30:18] RECOVERY - Host mw1349 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [14:30:48] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:31:00] PROBLEM - Host dumpsdata1002 is DOWN: PING CRITICAL - Packet loss = 100% [14:31:20] PROBLEM - Host wdqs1008 is DOWN: PING CRITICAL - Packet loss = 100% [14:31:28] RECOVERY - Host wdqs1008 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [14:31:36] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [14:31:41] ok, looks like it's back [14:31:48] both ports are properly up [14:31:52] (03PS2) 10Tchanders: Enable mobile block notice tracking in MobileFrontend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629691 (https://phabricator.wikimedia.org/T260218) [14:32:14] 04Critical Alert for device asw2-d-eqiad.mgmt.eqiad.wmnet - Emergency syslog message [14:32:14] RECOVERY - Host dumpsdata1002 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [14:32:46] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-me [14:33:36] (03CR) 10MSantos: [C: 03+1] push-notifications: enable service proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/629656 (https://phabricator.wikimedia.org/T260247) (owner: 10Effie Mouzeli) [14:33:52] (03CR) 10Volans: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/629686 (https://phabricator.wikimedia.org/T257905) (owner: 10Volans) [14:34:00] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 239 threshold =0.15 breach: active_primary_shards: 763, delayed_unassigned_shards: 0, number_of_data_nodes: 6, active_shards: 1291, relocating_shards: 0, initializing_shards: 4, number_of_pending_tasks: 0, cluster_name: cloudelastic-chi-eqiad, number_of_nodes: 6, timed_out: False, unassigned_shards: 23 [14:34:00] percent_as_number: 84.37908496732027, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, status: yellow https://wikitech.wikimedia.org/wiki/Search%23Administration [14:34:04] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1006 is CRITICAL: CRITICAL - elasticsearch inactive shards 236 threshold =0.15 breach: number_of_in_flight_fetch: 0, number_of_pending_tasks: 1, active_shards: 1294, status: yellow, timed_out: False, active_primary_shards: 763, task_max_waiting_in_queue_millis: 0, number_of_nodes: 6, cluster_name: cloudelastic-chi-eqiad, active_shards_percent_as_number: 84.575 [14:34:04] cating_shards: 0, number_of_data_nodes: 6, initializing_shards: 4, unassigned_shards: 232, delayed_unassigned_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [14:34:18] PROBLEM - Check systemd state on mw1349 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:34:38] PROBLEM - MariaDB Replica Lag: s7 on db1125 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 348.92 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:34:48] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [14:34:48] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [14:35:00] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1002 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: relocating_shards: 0, status: yellow, number_of_in_flight_fetch: 0, unassigned_shards: 60, timed_out: False, cluster_name: cloudelastic-chi-eqiad, number_of_data_nodes: 6, initializing_shards: 4, task_max_waiting_in_queue_millis: 329, active_primary_shards: 763, active_shards: 1466, number_of_pendi [14:35:00] yed_unassigned_shards: 0, active_shards_percent_as_number: 95.81699346405229, number_of_nodes: 6 https://wikitech.wikimedia.org/wiki/Search%23Administration [14:35:04] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1006 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: number_of_pending_tasks: 0, delayed_unassigned_shards: 0, cluster_name: cloudelastic-chi-eqiad, active_primary_shards: 763, number_of_nodes: 6, status: yellow, task_max_waiting_in_queue_millis: 0, number_of_data_nodes: 6, timed_out: False, relocating_shards: 0, unassigned_shards: 48, active_shards: [14:35:04] rds_percent_as_number: 96.60130718954248, initializing_shards: 4, number_of_in_flight_fetch: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [14:35:34] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:35:40] RECOVERY - MariaDB Replica Lag: s7 on db1125 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:35:53] (03CR) 10Ottomata: [C: 03+1] "WOW nice one." [puppet] - 10https://gerrit.wikimedia.org/r/629647 (owner: 10Elukey) [14:35:58] (03CR) 10Mholloway: "Does this need a +2 from SRE, or can PI merge and deploy at our convenience?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/629656 (https://phabricator.wikimedia.org/T260247) (owner: 10Effie Mouzeli) [14:36:52] (03CR) 10Effie Mouzeli: "> Patch Set 3:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/629656 (https://phabricator.wikimedia.org/T260247) (owner: 10Effie Mouzeli) [14:37:02] (03CR) 10Effie Mouzeli: [C: 03+2] push-notifications: enable service proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/629656 (https://phabricator.wikimedia.org/T260247) (owner: 10Effie Mouzeli) [14:38:14] 04̶C̶r̶i̶t̶i̶c̶a̶l Device asw2-d-eqiad.mgmt.eqiad.wmnet recovered from Emergency syslog message [14:38:36] 10Operations, 10ops-eqiad, 10serviceops: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` mw1360.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202009241438_robh_29643... [14:38:38] 10Operations, 10ops-eqiad, 10serviceops: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1360.eqiad.wmnet'] ` Of which those **FAILED**: ` ['mw1360.eqiad.wmnet'] ` [14:38:46] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:38:52] (03PS1) 10Volans: dns: exit with 0 if no changes and --icinga-check [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/629694 (https://phabricator.wikimedia.org/T258729) [14:39:02] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [14:39:11] (03Merged) 10jenkins-bot: push-notifications: enable service proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/629656 (https://phabricator.wikimedia.org/T260247) (owner: 10Effie Mouzeli) [14:39:16] 10Operations, 10ops-eqiad, 10serviceops: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` mw1360.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202009241439_robh_30270... [14:40:14] (03CR) 10Volans: [C: 03+2] "The systemd unit fails when there are no changes." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/629694 (https://phabricator.wikimedia.org/T258729) (owner: 10Volans) [14:41:08] (03PS12) 10Elukey: profile::hadoop::common: get the datanode mountpoints from facter [puppet] - 10https://gerrit.wikimedia.org/r/629647 [14:41:24] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [14:42:29] (03CR) 10Volans: [C: 03+2] dns: replace module with the one in wmflib [software/spicerack] - 10https://gerrit.wikimedia.org/r/629686 (https://phabricator.wikimedia.org/T257905) (owner: 10Volans) [14:44:42] PROBLEM - Host db2125.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:44:57] alright all the recabling is done [14:46:00] (03Merged) 10jenkins-bot: dns: replace module with the one in wmflib [software/spicerack] - 10https://gerrit.wikimedia.org/r/629686 (https://phabricator.wikimedia.org/T257905) (owner: 10Volans) [14:49:08] PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: /_info/name (retrieve service name) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not found for a nonexistent title) timed out before a re [14:49:08] ed https://wikitech.wikimedia.org/wiki/Proton [14:49:10] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={swagger_check_citoid_cluster_eqiad,swagger_check_eventgate_analytics_external_cluster_eqiad,swagger_check_mathoid_cluster_eqiad,swagger_check_mathoid_http_cluster_eqiad,swagger_check_mobileapps_cluster_eqiad,swagger_check_sessionstore_eqiad,swagger_check_termbox_eqiad,swagger_check_wikifeeds_eqiad} site=eqiad https://wikitech.wikimedia.org/wiki/Promethe [14:49:10] ob_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:50:30] RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [14:50:36] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-me [14:51:16] (03PS13) 10Elukey: profile::hadoop::common: get the datanode mountpoints from facter [puppet] - 10https://gerrit.wikimedia.org/r/629647 [14:51:47] (03PS2) 10JMeybohm: service: add TLS endpoint for zotero 3/3 [puppet] - 10https://gerrit.wikimedia.org/r/629336 (https://phabricator.wikimedia.org/T255869) [14:51:52] PROBLEM - Check systemd state on es1025 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:52:02] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:52:11] !log robh@cumin1001 START - Cookbook sre.hosts.downtime [14:52:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:24] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [14:54:11] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:54:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:31] (03CR) 10JMeybohm: [C: 03+2] service: add TLS endpoint for zotero 3/3 [puppet] - 10https://gerrit.wikimedia.org/r/629336 (https://phabricator.wikimedia.org/T255869) (owner: 10JMeybohm) [14:55:50] 10Operations: FY2020-2021 Q1 codfw -> eqiad switchback - https://phabricator.wikimedia.org/T243318 (10Marostegui) [14:56:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2127 (re)pooling @ 33%: Slowly repool db2127 ', diff saved to https://phabricator.wikimedia.org/P12792 and previous config saved to /var/cache/conftool/dbconfig/20200924-145617-root.json [14:56:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:36] (03PS1) 10Ayounsi: Revert "Depool eqiad for row D recabling" [dns] - 10https://gerrit.wikimedia.org/r/629519 [14:58:46] RECOVERY - Check systemd state on ms-be2054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:59:57] (03CR) 10Ayounsi: [C: 03+2] Revert "Depool eqiad for row D recabling" [dns] - 10https://gerrit.wikimedia.org/r/629519 (owner: 10Ayounsi) [15:00:00] 10Operations, 10ops-eqiad, 10serviceops: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1360.eqiad.wmnet'] ` Of which those **FAILED**: ` ['mw1360.eqiad.wmnet'] ` [15:00:30] !log repool eqiad - T256112 [15:00:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:38] T256112: eqiad row D switch fabric recabling - https://phabricator.wikimedia.org/T256112 [15:01:49] 10Operations, 10ops-eqiad, 10netops: asw2-d1-eqiad:VCP failure - https://phabricator.wikimedia.org/T252797 (10ayounsi) [15:01:51] 10Operations, 10netops, 10Sustainability (Incident Followup): D1<->D8 VC link failure - https://phabricator.wikimedia.org/T251663 (10ayounsi) [15:01:53] 10Operations, 10ops-eqiad, 10netops: Three ports on asw2-d-eqiad are not working as expected - https://phabricator.wikimedia.org/T247881 (10ayounsi) [15:02:04] 10Operations, 10ops-eqiad, 10DBA, 10netops, and 3 others: Upgrade eqiad rack D4 to 10G switch - https://phabricator.wikimedia.org/T196487 (10ayounsi) [15:02:05] 10Operations, 10ops-eqiad, 10netops: eqiad row D switch fabric recabling - https://phabricator.wikimedia.org/T256112 (10ayounsi) 05Open→03Resolved All done * we briefly (<5s) lost `D1` * some disabled ports automatically re-enabled themselves, causing some latency issues `1/1 Auto-Configured -... [15:02:26] 10Operations, 10netops, 10Sustainability (Incident Followup): D1<->D8 VC link failure - https://phabricator.wikimedia.org/T251663 (10ayounsi) 05Open→03Resolved Solved in T256112. [15:03:00] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01372 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [15:03:00] 10Operations, 10ops-eqiad, 10netops: asw2-d1-eqiad:VCP failure - https://phabricator.wikimedia.org/T252797 (10ayounsi) 05Stalled→03Resolved Solved in T256112. [15:04:44] 10Operations, 10Analytics: Augment NEL reports with GeoIP country code and network AS number - https://phabricator.wikimedia.org/T263496 (10JAllemandou) > Currently these reports are going to Logstash; I don't think there's any refinement possible there? Not the refinement we do usually on the cluster indeed.... [15:05:05] 10Operations, 10ops-eqiad, 10DBA, 10netops, and 3 others: Upgrade eqiad rack D4 to 10G switch - https://phabricator.wikimedia.org/T196487 (10ayounsi) a:05ayounsi→03Cmjohnson [15:05:59] (03PS3) 10JMeybohm: services_proxy: switch zotero to the TLS endpoint [puppet] - 10https://gerrit.wikimedia.org/r/629337 (https://phabricator.wikimedia.org/T255869) [15:07:39] (03PS1) 10Jbond: postgress::server: add Types and additinal options [puppet] - 10https://gerrit.wikimedia.org/r/629698 (https://phabricator.wikimedia.org/T263578) [15:07:47] (03CR) 10JMeybohm: [C: 03+2] services_proxy: switch zotero to the TLS endpoint [puppet] - 10https://gerrit.wikimedia.org/r/629337 (https://phabricator.wikimedia.org/T255869) (owner: 10JMeybohm) [15:08:10] (03CR) 10jerkins-bot: [V: 04-1] postgress::server: add Types and additinal options [puppet] - 10https://gerrit.wikimedia.org/r/629698 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond) [15:10:18] !log switched zotero service-proxy listener to use TLS - T255869 [15:10:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:26] T255869: Move zotero to use TLS only - https://phabricator.wikimedia.org/T255869 [15:11:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2127 (re)pooling @ 66%: Slowly repool db2127 ', diff saved to https://phabricator.wikimedia.org/P12793 and previous config saved to /var/cache/conftool/dbconfig/20200924-151120-root.json [15:11:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:28] (03PS1) 10Kormat: db1109: Re-enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/629699 [15:12:25] (03CR) 10Kormat: [C: 03+2] db1109: Re-enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/629699 (owner: 10Kormat) [15:12:37] (03PS1) 10Jforrester: HookRunner: onAbuseFilterGenerateUserVars should run generateUserVars [extensions/AbuseFilter] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/629521 (https://phabricator.wikimedia.org/T263750) [15:15:09] (03PS1) 10Kormat: mariadb: Promote db1104 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/629707 (https://phabricator.wikimedia.org/T239238) [15:15:30] !log mw1360 scap and repooled post work via T262151 [15:15:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:36] T262151: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 [15:21:10] (03CR) 10Kormat: [C: 04-2] "Don't merge until maintenance window." [puppet] - 10https://gerrit.wikimedia.org/r/629707 (https://phabricator.wikimedia.org/T239238) (owner: 10Kormat) [15:22:15] (03PS1) 10Kormat: wmnet: Update s8-master alias [dns] - 10https://gerrit.wikimedia.org/r/629716 (https://phabricator.wikimedia.org/T239238) [15:22:37] (03CR) 10Kormat: [C: 04-2] "Don't merge until maintenance window." [dns] - 10https://gerrit.wikimedia.org/r/629716 (https://phabricator.wikimedia.org/T239238) (owner: 10Kormat) [15:22:41] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.004988 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [15:22:46] (03PS1) 10CDanis: WIP: extend NEL to group1 wikis; lower sampling rate to 5% [puppet] - 10https://gerrit.wikimedia.org/r/629717 [15:22:53] 10Operations, 10ops-eqiad, 10serviceops: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 (10RobH) Host is now online (reimaged) and returned to service post scap pull and repool. Set to active in netbox. However, its not in the DSH node groups, and the directions aren't clear on where th... [15:26:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2127 (re)pooling @ 100%: Slowly repool db2127 ', diff saved to https://phabricator.wikimedia.org/P12794 and previous config saved to /var/cache/conftool/dbconfig/20200924-152626-root.json [15:26:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:09] 10Operations, 10Traffic, 10netops: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10faidon) I wonder as what kind of ASN would these flows show up as, as well as whether we could have a dimension to be able to differentiate between internet traffic, and backhaul traffi... [15:27:11] (03CR) 10Herron: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/427928 (https://phabricator.wikimedia.org/T190318) (owner: 10Herron) [15:27:15] jouncebot: now [15:27:16] No deployments scheduled for the next 0 hour(s) and 32 minute(s) [15:27:23] jouncebot: next [15:27:23] In 0 hour(s) and 32 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200924T1600) [15:27:56] (03Abandoned) 10Herron: puppetmaster: remove support for puppetdb 2.x [puppet] - 10https://gerrit.wikimedia.org/r/427928 (https://phabricator.wikimedia.org/T190318) (owner: 10Herron) [15:27:58] (03PS2) 10CDanis: WIP: extend NEL to group1 wikis; lower sampling rate to 5% [puppet] - 10https://gerrit.wikimedia.org/r/629717 [15:29:20] dancy: heads-up, I'm going to backport https://gerrit.wikimedia.org/r/c/mediawiki/extensions/AbuseFilter/+/629521, which fixes a blocker T263750 [15:29:21] T263750: abuse filter matching global_user_groups doesn't work any more - https://phabricator.wikimedia.org/T263750 [15:29:34] Gotcha. [15:29:36] (03CR) 10Urbanecm: [C: 03+2] HookRunner: onAbuseFilterGenerateUserVars should run generateUserVars [extensions/AbuseFilter] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/629521 (https://phabricator.wikimedia.org/T263750) (owner: 10Jforrester) [15:30:10] 10Operations, 10Traffic, 10netops: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10CDanis) >>! In T263277#6491680, @faidon wrote: > The per-ASN views we have right now for front-facing traffic are priceless, and it would be a pity to make navigating these more difficu... [15:31:16] (03PS1) 10Hnowlan: restbase102[89]/restbase1030: add cassandra hosts for new nodes [dns] - 10https://gerrit.wikimedia.org/r/629723 (https://phabricator.wikimedia.org/T261512) [15:31:40] hello, anyone around who can answer questions about varnish caching? (or whatever else we have in front of mediawiki that does caching?) [15:31:50] for example, this page is cached when you visit it as anon user: https://cs.wikipedia.org/wiki/Diskuse:Světla_velkoměsta how can i tell when will that cache be purged/invalidated? [15:32:09] is this the right documentation, or is it outdated, or about something else? https://wikitech.wikimedia.org/wiki/Varnish#TTL [15:32:31] (03CR) 10Jbond: [C: 03+1] "nice looks good to me" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/629647 (owner: 10Elukey) [15:33:37] (03PS1) 10Arturo Borrero Gonzalez: openstack: dns_floating_ip_updater: only run in one control plane node [puppet] - 10https://gerrit.wikimedia.org/r/629724 [15:33:50] (03PS3) 10CDanis: Extend NEL to group1 wikis; lower sampling rate to 5% [puppet] - 10https://gerrit.wikimedia.org/r/629717 (https://phabricator.wikimedia.org/T257527) [15:34:26] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [15:35:11] (03PS4) 10CDanis: Extend NEL to group1 wikis; lower sampling rate to 5% [puppet] - 10https://gerrit.wikimedia.org/r/629717 (https://phabricator.wikimedia.org/T257527) [15:35:58] (03PS2) 10Jbond: postgress::server: add Types and additinal options [puppet] - 10https://gerrit.wikimedia.org/r/629698 (https://phabricator.wikimedia.org/T263578) [15:36:34] (03CR) 10jerkins-bot: [V: 04-1] postgress::server: add Types and additinal options [puppet] - 10https://gerrit.wikimedia.org/r/629698 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond) [15:38:50] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [15:39:01] (03PS3) 10Jbond: postgress::server: add Types and additinal options [puppet] - 10https://gerrit.wikimedia.org/r/629698 (https://phabricator.wikimedia.org/T263578) [15:41:20] (03CR) 10jerkins-bot: [V: 04-1] postgress::server: add Types and additinal options [puppet] - 10https://gerrit.wikimedia.org/r/629698 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond) [15:41:22] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack: add initial manifests for OpenStack Barbican, a secrets API [puppet] - 10https://gerrit.wikimedia.org/r/629472 (https://phabricator.wikimedia.org/T263680) (owner: 10Andrew Bogott) [15:41:58] (03CR) 10Andrew Bogott: [C: 03+2] wmcs codfw1dev haproxy: add proxy for barbican api [puppet] - 10https://gerrit.wikimedia.org/r/629680 (https://phabricator.wikimedia.org/T263680) (owner: 10Andrew Bogott) [15:43:43] !log Rename all local Oversight accounts but enwiki to Oversight~dbname, see task for full list (T263760) [15:43:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:50] T263760: User:Oversight should be unified at all projects - https://phabricator.wikimedia.org/T263760 [15:43:52] RECOVERY - Host db2125.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.00 ms [15:44:08] !log Run `mwscript extensions/CentralAuth/maintenance/migrateAccount.php --wiki=enwiki --username=Oversight` (T263760) [15:44:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:29] (03PS1) 10Effie Mouzeli: varnish: check for pageview=0 value in X-Analytics header [puppet] - 10https://gerrit.wikimedia.org/r/629735 (https://phabricator.wikimedia.org/T262202) [15:45:31] (03CR) 10Volans: [C: 04-1] "Temporary -1, I've commented on the task, would like to know more why they are in this state." [dns] - 10https://gerrit.wikimedia.org/r/629723 (https://phabricator.wikimedia.org/T261512) (owner: 10Hnowlan) [15:46:37] !log Run `mwscript extensions/CentralAuth/maintenance/migrateAccount.php --wiki=simplewiki --username="Oversight~simplewiki"` (T263760) [15:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:37] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/629737 (owner: 10CRusnov) [15:50:05] i found the answer to my caching questions, and it's written here: https://wikitech.wikimedia.org/wiki/Caching_overview "max-age is 14 days" [15:50:05] (03PS4) 10Jbond: postgress::server: add Types and additinal options [puppet] - 10https://gerrit.wikimedia.org/r/629698 (https://phabricator.wikimedia.org/T263578) [15:50:36] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [15:51:09] (03CR) 10jerkins-bot: [V: 04-1] postgress::server: add Types and additinal options [puppet] - 10https://gerrit.wikimedia.org/r/629698 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond) [15:51:25] (03PS4) 10Hnowlan: changeprop/changeprop-jobqueue: increase memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/629658 [15:51:27] MatmaRex: I actually think it's much less than that nowadays and the docs are out of date [15:52:10] cdanis: aha, i knew it, i just had to post the wrong answer rather than a question! :D [15:52:31] cdanis: i actually just curl-ed about a hundred random pages, and the last-modified dates are about 14 days old, so that seemed right to me [15:52:33] btw #wikimedia-traffic or #wikimedia-sre is a better place for such questions; more humans and less bot-spam [15:52:53] (03CR) 10Ppchelko: "Changeprop doesn't use envoy sidecar for service discovery huh? Maybe we should just use that?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/629658 (owner: 10Hnowlan) [15:52:59] sure, but, the last-modified date *doesn't* necessarily correlate with that -- it's provided by mediawiki (based on the revisions) and doesn't indicate anything about caching [15:53:26] MatmaRex: https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/varnish/templates/wikimedia-frontend.vcl.erb$804 [15:53:28] i imagine it's never more recent than the date when it was cached, though? [15:53:44] and "Wed, 09 Sep 2020 18:28:03 GMT" is the oldest date i got [15:54:03] 10Operations, 10Recommendation-API, 10serviceops: recommendation-api alerting and api errors - https://phabricator.wikimedia.org/T262587 (10crusnov) p:05Triage→03Medium [15:54:20] (03CR) 10Andrew Bogott: [C: 03+1] "thanks for this!" [puppet] - 10https://gerrit.wikimedia.org/r/629724 (owner: 10Arturo Borrero Gonzalez) [15:54:26] you can have a freshly-cached page (just fetched from the appservers) that has a last-modified date of two months ago [15:54:35] it's orthogonal [15:55:10] cdanis: thanks… although now i'm just more confused than before [15:55:29] (03PS5) 10Jbond: postgress::server: add Types and additinal options [puppet] - 10https://gerrit.wikimedia.org/r/629698 (https://phabricator.wikimedia.org/T263578) [15:55:31] Last-Modified is literally when the page was last edited [15:55:54] the Traffic layer will cache pages for up to 24 hours, but, they often get purged before that (e.g. when edited) [15:55:55] (03Merged) 10jenkins-bot: HookRunner: onAbuseFilterGenerateUserVars should run generateUserVars [extensions/AbuseFilter] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/629521 (https://phabricator.wikimedia.org/T263750) (owner: 10Jforrester) [15:56:44] cdanis: last-modified definitely does not match the edit dates in my testing. e.g. https://cs.wikipedia.org/wiki/Diskuse:Fal%C4%8Dtina [15:56:55] it has "last-modified: Thu, 10 Sep 2020 15:45:02 GMT" but the page was edited in 2010 [15:57:26] i would believe if you said it's a date of when it was invalidated by a template edit, but that page also has no templates [15:57:57] okay, you are right about that [15:58:05] it does appear to be the time of the cache miss occurring [15:58:07] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.10/extensions/AbuseFilter/includes/Hooks/AbuseFilterHookRunner.php: 5e88c36fa4111cde33dafb0d7ac31a854b95e5ea: HookRunner: onAbuseFilterGenerateUserVars should run generateUserVars (T263750) (duration: 01m 06s) [15:58:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:13] T263750: abuse filter matching global_user_groups doesn't work any more - https://phabricator.wikimedia.org/T263750 [15:58:14] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [15:58:59] cdanis: i have no idea where the last-modified date comes from, to be honest, so if you're saying it's only cached for 24 hours, then i'll believe you [15:59:10] dancy: T263750 deployed [15:59:29] 👍🏾 [15:59:38] i guess i can just check tomorrow to see if it changed, heh [15:59:40] (03CR) 10Hnowlan: "> Patch Set 4:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/629658 (owner: 10Hnowlan) [16:00:04] jbond42 and cdanis: (Dis)respected human, time to deploy Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200924T1600). Please do the needful. [16:00:09] so I am curious in what context you need to know this, MatmaRex, because for the case of normal edits and such, purges should take care of things relatively quickly [16:01:20] Hello. Someone is worried on #wikipedia-fr about the expired date of our Digicert certificate. [16:01:20] whatever max-age there is, it is also only a maximum; caches can discard low-popularity objects early for all sorts of reasons [16:01:24] notAfter=Oct 6 12:00:00 2020 GMT [16:01:33] Is this something in the radar? [16:01:34] cdanis: we deployed a config change this morning that adds a new ResourceLoader module to pages like the ones i linked (which is a change to the HTML output) and i'd like to know when it will take [16:01:56] (for *.wikipedia.org frontends) [16:01:58] manually purging pages (?action=purge) results in the expected output [16:01:59] dereckson: yes, we have monitoring for such, and there's an ongoing ticket about it (unfortunately private, because it's a procurement ticket with prices) [16:02:17] thanks cdanis to confirm:) [16:02:19] dereckson: we also have several backup certs we can switch to -- different datacenters serve different certs, to make sure they all stay working, but we have contingency plans [16:02:41] so even if the contract renewal doesn't happen in time, there's options :) [16:02:47] (you should be seeing buttons like [reply] or [odpovědět] next to talk page comments) [16:02:53] 10Operations, 10observability, 10serviceops: Strongswan Icinga check: do not report issues about depooled hosts - https://phabricator.wikimedia.org/T148976 (10crusnov) [16:03:03] MatmaRex: ah, was this changes to the JS or the generated page in Mediawiki or similar? [16:04:07] !log pfw3-eqiad> restart security-log gracefully [16:04:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:42] cdanis: no, the generated HTML page. effectively the "RLPAGEMODULES=" part in the source [16:05:12] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:06:06] (03PS1) 10HitomiAkane: Bug: T262218 Change-Id: I6e643081accbded564f29fa4ed6bde0e580aca21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629738 (https://phabricator.wikimedia.org/T262218) [16:06:30] MatmaRex: got it. those should all expire out of caches in 24 hours, yeah, I've verified in both the Varnish and ATS-BE configurations [16:06:46] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:07:22] cdanis: alright, that's great to know. thank you [16:07:49] cdanis: also, are these docs about this, and do they look right? https://wikitech.wikimedia.org/wiki/Varnish#TTL [16:08:10] PROBLEM - Host db2125.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:08:14] (03PS2) 10HitomiAkane: Creation of patroller group on arz.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629738 (https://phabricator.wikimedia.org/T262218) [16:08:37] (03PS6) 10Jbond: postgress::server: add Types and additinal options [puppet] - 10https://gerrit.wikimedia.org/r/629698 (https://phabricator.wikimedia.org/T263578) [16:08:41] MatmaRex: yep, ttl_cap is what we call the maximum in Varnish [16:08:53] (03CR) 10Ppchelko: [C: 03+1] "Let's leave it for the next round of improvement and fix the issue at hand.." [deployment-charts] - 10https://gerrit.wikimedia.org/r/629658 (owner: 10Hnowlan) [16:08:53] I also just did https://wikitech.wikimedia.org/w/index.php?title=Caching_overview&type=revision&diff=1882836&oldid=1876516 [16:09:32] thanks [16:11:22] 10Operations, 10ops-eqiad, 10serviceops: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 (10RobH) After checkign with @Joe via irc, it seems this should automatically be added back into DSH and clear after the puppet run and repooling, but has not. All other checks green, but I'd like to... [16:11:26] (03CR) 10Hnowlan: [C: 03+2] changeprop/changeprop-jobqueue: increase memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/629658 (owner: 10Hnowlan) [16:12:02] (03CR) 10Volans: [C: 03+1] "LGTM, I've allocated them on Netbox too." [dns] - 10https://gerrit.wikimedia.org/r/629723 (https://phabricator.wikimedia.org/T261512) (owner: 10Hnowlan) [16:12:07] (03CR) 10Jbond: [C: 03+2] postgress::server: add Types and additinal options [puppet] - 10https://gerrit.wikimedia.org/r/629698 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond) [16:12:59] (03CR) 10CRusnov: [C: 03+2] interface_automation: Fix bugs in primary assignment logic [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/629737 (owner: 10CRusnov) [16:13:17] 10Operations, 10Editing-team, 10MassMessage, 10WMF-JobQueue, 10Platform Team Workboards (Clinic Duty Team): Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Keegan) At least one message (of 111) I sent out yesterday duplicated, on the Italian Wikipedia ( 10Operations, 10ops-eqiad, 10serviceops: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 (10Volans) It looks it's marked as inactive on conftool: ` $ confctl select 'name=mw1360.eqiad.wmnet' get {"mw1360.eqiad.wmnet": {"weight": 30, "pooled": "inactive"}, "tags": "dc=eqiad,cluster=api_apps... [16:13:43] (03CR) 10BBlack: Extend NEL to group1 wikis; lower sampling rate to 5% (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/629717 (https://phabricator.wikimedia.org/T257527) (owner: 10CDanis) [16:13:46] RECOVERY - Host db2125.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.03 ms [16:13:52] (03CR) 10Cwhite: "I like the approach. It has the benefit of fewer lines of code than the current implementation and is clearer about what constitutes vali" [puppet] - 10https://gerrit.wikimedia.org/r/626723 (owner: 10Jbond) [16:13:57] (03Merged) 10jenkins-bot: changeprop/changeprop-jobqueue: increase memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/629658 (owner: 10Hnowlan) [16:15:29] (03PS5) 10CDanis: Extend NEL to group1 wikis; lower sampling rate to 5% [puppet] - 10https://gerrit.wikimedia.org/r/629717 (https://phabricator.wikimedia.org/T257527) [16:16:46] (03CR) 10BBlack: [C: 03+1] Extend NEL to group1 wikis; lower sampling rate to 5% [puppet] - 10https://gerrit.wikimedia.org/r/629717 (https://phabricator.wikimedia.org/T257527) (owner: 10CDanis) [16:18:11] (03CR) 10CDanis: [C: 03+2] "0 tests failed, 0 tests skipped, 22 tests passed" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/629717 (https://phabricator.wikimedia.org/T257527) (owner: 10CDanis) [16:18:48] PROBLEM - Check systemd state on deploy1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:56] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'citoid' for release 'staging' . [16:19:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:00] RECOVERY - Check systemd state on deploy1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:22:51] (03PS1) 10Mholloway: Update chromium-render to 2020-09-24-145544-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/629748 (https://phabricator.wikimedia.org/T262890) [16:23:08] (03CR) 10Mholloway: [C: 04-1] "Hold for deployment window" [deployment-charts] - 10https://gerrit.wikimedia.org/r/629748 (https://phabricator.wikimedia.org/T262890) (owner: 10Mholloway) [16:23:57] (03CR) 10Jbond: "Thanks" [puppet] - 10https://gerrit.wikimedia.org/r/626723 (owner: 10Jbond) [16:26:22] !log properly pooled mw1360 this time T262151 [16:26:24] !log jayme@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'citoid' for release 'production' . [16:26:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:29] T262151: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 [16:26:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:50] 10Operations, 10serviceops, 10User-WDoran, 10User-brennen: Canaries canaries canaries - https://phabricator.wikimedia.org/T210143 (10brennen) [16:36:05] 10Operations, 10ops-eqiad, 10serviceops: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 (10RobH) 05Open→03Resolved Ok, all is now green for the host in icinga and it shows in pooled/in service state. [16:41:04] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Papaul) main board replaced and upgrade BIOS and IDRAC on the new board. @Kormat you can repool the server and resolve this task for now when... [16:46:56] 10Operations, 10ops-codfw, 10serviceops: decom wtp2005 (was: wtp2005 hardware issue) - https://phabricator.wikimedia.org/T257903 (10Papaul) 05Open→03Resolved Disks removed from server and unrack [16:48:26] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [16:49:03] (03CR) 10Volans: [C: 03+1] "LGTM" (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/626660 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel) [16:51:09] nice the check works! [16:54:46] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:55:58] (03PS1) 10Hnowlan: changeprop/cpjobqueue: change memory limit upwards. [deployment-charts] - 10https://gerrit.wikimedia.org/r/629752 [16:56:05] !log volans@cumin1001 START - Cookbook sre.dns.netbox [16:56:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:24] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:00:03] (03PS1) 10Jbond: wmcs::postgres: fix type [puppet] - 10https://gerrit.wikimedia.org/r/629753 [17:00:05] chrisalbon and accraze: I, the Bot under the Fountain, allow thee, The Deployer, to do Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200924T1700). [17:00:05] (03PS1) 10Jbond: puppetmaste::puppetdb: add passthrough params to puppetdb postgres config [puppet] - 10https://gerrit.wikimedia.org/r/629754 [17:00:17] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:00:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:26] !log jayme@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'citoid' for release 'production' . [17:01:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:03] (03CR) 10Jbond: [C: 03+2] wmcs::postgres: fix type [puppet] - 10https://gerrit.wikimedia.org/r/629753 (owner: 10Jbond) [17:02:33] (03CR) 10Jbond: [C: 03+2] puppetmaste::puppetdb: add passthrough params to puppetdb postgres config [puppet] - 10https://gerrit.wikimedia.org/r/629754 (owner: 10Jbond) [17:02:48] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:04:22] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:04:32] (03CR) 10Ppchelko: [C: 03+1] changeprop/cpjobqueue: change memory limit upwards. [deployment-charts] - 10https://gerrit.wikimedia.org/r/629752 (owner: 10Hnowlan) [17:04:53] !log syncing facts to puppet compiler hosts [17:04:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:00] (03CR) 10Mholloway: [C: 03+2] Update chromium-render to 2020-09-24-145544-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/629748 (https://phabricator.wikimedia.org/T262890) (owner: 10Mholloway) [17:06:38] (03CR) 10Hnowlan: [C: 03+2] changeprop/cpjobqueue: change memory limit upwards. [deployment-charts] - 10https://gerrit.wikimedia.org/r/629752 (owner: 10Hnowlan) [17:07:37] (03Merged) 10jenkins-bot: Update chromium-render to 2020-09-24-145544-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/629748 (https://phabricator.wikimedia.org/T262890) (owner: 10Mholloway) [17:08:06] (03PS1) 10Jcrespo: Revert "mariadb-backups: Reorganize remote backups (snapshots) for speedup" [puppet] - 10https://gerrit.wikimedia.org/r/629525 [17:08:58] (03Merged) 10jenkins-bot: changeprop/cpjobqueue: change memory limit upwards. [deployment-charts] - 10https://gerrit.wikimedia.org/r/629752 (owner: 10Hnowlan) [17:09:03] !log mholloway-shell@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'push-notifications' for release 'main' . [17:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:47] (03CR) 10Jcrespo: [C: 03+2] Revert "mariadb-backups: Reorganize remote backups (snapshots) for speedup" [puppet] - 10https://gerrit.wikimedia.org/r/629525 (owner: 10Jcrespo) [17:11:17] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' . [17:11:17] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [17:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:45] (03PS1) 10Jbond: puppetmaster::puppetdb: add postgress logging information [puppet] - 10https://gerrit.wikimedia.org/r/629756 (https://phabricator.wikimedia.org/T263578) [17:11:51] !log mholloway-shell@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'push-notifications' for release 'main' . [17:11:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:29] (03PS1) 10JMeybohm: Revert "services_proxy: switch zotero to the TLS endpoint" [puppet] - 10https://gerrit.wikimedia.org/r/629766 [17:13:53] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Revert "services_proxy: switch zotero to the TLS endpoint" [puppet] - 10https://gerrit.wikimedia.org/r/629766 (owner: 10JMeybohm) [17:14:12] (03CR) 10Jbond: [C: 03+2] puppetmaster::puppetdb: add postgress logging information [puppet] - 10https://gerrit.wikimedia.org/r/629756 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond) [17:14:23] !log mholloway-shell@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'push-notifications' for release 'main' . [17:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:31] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'changeprop' for release 'production' . [17:14:35] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [17:14:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:04] !log disable puppet fleet wide to update puppetdb postgres loggin [17:15:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:53] !log mholloway-shell@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'proton' for release 'production' . [17:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:06] !log jayme@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'citoid' for release 'production' . [17:16:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:17] !log jayme@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'citoid' for release 'production' . [17:17:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:09] !log mholloway-shell@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'proton' for release 'production' . [17:19:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:00] !log enable puppet fleet wide post update puppetdb postgres logging [17:21:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:52] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [17:24:23] !log mholloway-shell@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'proton' for release 'production' . [17:24:24] nice recovers too, all works [17:24:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:11] still trying to sync the puppet facts to all compilers.. [17:26:09] (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/629758 [17:26:11] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [deployment-charts] - 10https://gerrit.wikimedia.org/r/629758 (owner: 10PipelineBot) [17:26:20] 10Puppet, 10Patch-For-Review: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10jbond) I have added the following settings to postgresql on puppetdb1002 which should allow us to [[ https://puppet.com/docs/puppetdb/6.0/pdb_support_guide.html | dig further into slow querie... [17:28:26] arrr. the sync script has hardcoded hostnames in it and those don't work anymore [17:29:05] the example from the docs sets the host names but that won't work if they are set again in the script [17:29:43] (03CR) 10Cwhite: [C: 03+1] "> Patch Set 7:" [puppet] - 10https://gerrit.wikimedia.org/r/626723 (owner: 10Jbond) [17:39:42] (03CR) 10Dzahn: [V: 03+1 C: 03+2] syslog::centralserver: convert role to profile, fix lint issues [puppet] - 10https://gerrit.wikimedia.org/r/628973 (owner: 10Dzahn) [17:39:44] (03CR) 10Thcipriani: "🎉" [deployment-charts] - 10https://gerrit.wikimedia.org/r/629758 (owner: 10PipelineBot) [17:41:32] (03CR) 10Dzahn: "confirmed NOOP in prod - centralllog2001, then centrallog1001" [puppet] - 10https://gerrit.wikimedia.org/r/628973 (owner: 10Dzahn) [17:45:23] (03PS2) 10Dzahn: kafka::certificate: add data types, hiera()->lookup() [puppet] - 10https://gerrit.wikimedia.org/r/628969 [17:48:19] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/25411/" [puppet] - 10https://gerrit.wikimedia.org/r/628969 (owner: 10Dzahn) [17:50:58] (03CR) 10Dzahn: "confirmed NOOP in prod, cp3052, cp3059, cp2027" [puppet] - 10https://gerrit.wikimedia.org/r/628969 (owner: 10Dzahn) [17:53:23] (03CR) 10Tchanders: Enable Special:Investigate on itwiki and svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627481 (owner: 10Tchanders) [17:54:49] (03PS2) 10Tchanders: Enable Special:Investigate on itwiki and svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627481 [17:57:07] (03CR) 10Aezell: [C: 03+1] Enable Special:Investigate on itwiki and svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627481 (owner: 10Tchanders) [17:57:21] (03CR) 10Aezell: [C: 03+1] Enable mobile block notice tracking in MobileFrontend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629691 (https://phabricator.wikimedia.org/T260218) (owner: 10Tchanders) [17:57:33] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [17:57:34] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:57:39] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [17:57:39] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:57:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:44] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [17:57:45] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:57:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:03] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/25412/install3001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/629495 (https://phabricator.wikimedia.org/T252526) (owner: 10Dzahn) [17:58:09] (03PS2) 10Dzahn: site: add installserver::light role to new install servers [puppet] - 10https://gerrit.wikimedia.org/r/629495 (https://phabricator.wikimedia.org/T252526) [18:00:04] RoanKattouw, Niharika, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy Morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200924T1800). [18:00:04] Tchanders: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:39] (03CR) 10Bstorm: [C: 03+1] openstack: dns_floating_ip_updater: only run in one control plane node [puppet] - 10https://gerrit.wikimedia.org/r/629724 (owner: 10Arturo Borrero Gonzalez) [18:00:42] (03PS1) 10Ottomata: refine - Exclude CitationUsage; schema has been deleted [puppet] - 10https://gerrit.wikimedia.org/r/629763 [18:01:44] !log temp. disabled puppet on install4001/install5001 - applying install_server role to new servers, starting with install3001 [18:01:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:15] (03CR) 10Ottomata: [C: 03+2] refine - Exclude CitationUsage; schema has been deleted [puppet] - 10https://gerrit.wikimedia.org/r/629763 (owner: 10Ottomata) [18:04:10] (03CR) 10Dzahn: "thanks for merging" [puppet] - 10https://gerrit.wikimedia.org/r/629427 (owner: 10Dzahn) [18:07:14] Hey, is anyone around to do the deployment window? [18:08:05] 10Operations, 10Gerrit, 10LDAP-Access-Requests: Add hashar to `archiva-deployers` LDAP group - https://phabricator.wikimedia.org/T263721 (10thcipriani) Approved as #together manager [18:18:43] Tchanders: I can do that [18:19:12] Urbanecm: Thanks [18:19:16] Tchanders: aren't you a deployer though? [18:20:16] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, 10serviceops, and 2 others: Move feed assembly from RESTBase to Wikifeeds - https://phabricator.wikimedia.org/T263133 (10Pchelolo) [18:20:21] (03CR) 10Urbanecm: [C: 03+2] Enable Special:Investigate on itwiki and svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627481 (owner: 10Tchanders) [18:20:22] Urbanecm: Technically yes, in that I've been shown how and given the access... but I haven't had a chance to actually do it under someone's guidance yet [18:20:27] 10Operations, 10Gerrit, 10LDAP-Access-Requests: Add hashar to `archiva-deployers` LDAP group - https://phabricator.wikimedia.org/T263721 (10hashar) + @Reedy and @jrbs who are LDAP admins and should have the proper sudo rule to add a user to a group ( https://wikitech.wikimedia.org/wiki/LDAP#Add_a_user_to_a_g... [18:20:53] Tchanders: I can supervise your first deployment if you want me to :) [18:21:19] (03Merged) 10jenkins-bot: Enable Special:Investigate on itwiki and svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627481 (owner: 10Tchanders) [18:21:24] (I am also happy to deploy it myself, it's up to you Tchanders) [18:23:00] 10Puppet, 10Patch-For-Review: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10jbond) Onething i have noticed is that the kubernetes hosts often take a long time, looking at there facts i notice that they have a large set for both the partisions and mountpoint facts with... [18:23:14] I'll be starting the rollout of 1.36.0-wmf.10 to group2 in about 40 minutes. [18:23:23] dancy: ack [18:23:44] Urbanecm: That would be great actually! Do you think we could set something up for next time? [18:24:12] Tchanders: I'm happy to do it right now, with those two patches :-) [18:24:34] Urbanecm: OK, if you have time! [18:24:39] absolutely :) [18:25:16] Tchanders: instructions are at https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers, feel free to ask me any questions [18:26:16] Tchanders: note that because we run via codfw, you need to use mwdebug2xxx.codfw.wmnet hosts [18:26:26] (but you still need to use deploy1001, just as usually) [18:29:37] Tchanders: just saw your PM, I'd rather keep all conversation about a deployment in this channel. [18:29:52] anyway, I'm waiting [18:30:10] Sure, just letting you know I'm taking some time to familiarise myself again... [18:30:19] that's fine :) [18:35:27] Urbanecm: Re your comment about codfw, are you saying that I should replace eqiad.wmnet with codfw.wmnet in all steps of the instructions? [18:36:11] Tchanders: you need to replace whole machine names. In the number of each host, the first digit always refers to the DC (1 is eqiad, 2 is codfw) [18:36:48] so, the currently-active deployment host is deploy1001.eqiad.wmnet, but the currently-active debug host is mwdebug2001.codfw.wmnet (and mwdebug2002.codfw.wmnet, it's up to you which one you use) [18:37:19] so instead of ssh'ing to mwdebug1002 and then running scap pull as instructions say, you need to ssh to mwdebug2001 (or mwdebug2002) and run scap pull there [18:37:32] Tchanders: does that answer your question? [18:37:50] Urbanecm: I think so [18:37:55] cool [18:39:21] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={squid,swagger_check_restbase_esams} site={eqsin,esams} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:39:47] Urbanecm: So I'm going to use deployment.eqiad.wmnet and mwdebug2001.codfw.wmnet (or 2)? [18:39:55] exactly [18:42:44] (03PS3) 10Ryan Kemper: [wcqs] add favicon for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/624704 (https://phabricator.wikimedia.org/T258835) (owner: 10DCausse) [18:45:20] (03CR) 10Ryan Kemper: [C: 03+2] [wcqs] add favicon for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/624704 (https://phabricator.wikimedia.org/T258835) (owner: 10DCausse) [18:48:35] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:50:05] Tchanders: just asking, does everything go all right? [18:50:26] Urbanecm: Halfway through the first one, change looks good [18:50:35] ack, thanks Tchanders :) [18:58:24] !log tchanders@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:627481|Enable Special:Investigate on itwiki and svwiki (T262436)]] (duration: 01m 05s) [18:58:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:31] T262436: Deploy Special:Investigate to Spanish, Swedish and Italian wikipedias - https://phabricator.wikimedia.org/T262436 [18:59:14] cool! [19:00:02] (03PS3) 10Urbanecm: Enable mobile block notice tracking in MobileFrontend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629691 (https://phabricator.wikimedia.org/T260218) (owner: 10Tchanders) [19:00:04] dancy and twentyafterfour: (Dis)respected human, time to deploy Mediawiki train - American Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200924T1900). Please do the needful. [19:00:09] dancy: please hold the train a bit [19:00:23] Ok. Message me when you're done. [19:00:32] Tchanders: congratulations on your first deployment 🙂 [19:00:36] Urbanecm: Thanks for the help! [19:00:40] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10Nuria) assigning to @mforns [19:00:49] Congrats Tchanders! [19:01:14] Tchanders: what should we do with https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/629691? I can deploy that quickly, or if dancy (as the train conductor) is fine with that, you can do that too 🙂 [19:01:19] (we can also re-schedule) [19:01:22] dancy: Thanks! I don't need to hold up the train with the second one - happy to do it in a later window... [19:01:37] Urbanecm: Or you can do it quickly - I don't mind [19:01:46] (03CR) 10Urbanecm: [C: 03+2] Enable mobile block notice tracking in MobileFrontend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629691 (https://phabricator.wikimedia.org/T260218) (owner: 10Tchanders) [19:02:06] should be quick enough :) [19:02:21] Urbanecm: So you or me? [19:02:28] I'll do it :) [19:02:31] (03Merged) 10jenkins-bot: Enable mobile block notice tracking in MobileFrontend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629691 (https://phabricator.wikimedia.org/T260218) (owner: 10Tchanders) [19:02:33] Cool, thanks [19:03:28] it's syncing :) [19:04:05] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: bcf9fcbe3b82ab85b8f97206ceca45b64619c362: Enable mobile block notice tracking in MobileFrontend (T260218) (duration: 01m 04s) [19:04:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:10] Tchanders: should be live [19:04:11] T260218: Add logging for measuring impact of our work on improving the mobile block messages - https://phabricator.wikimedia.org/T260218 [19:04:16] dancy: I'm done, thanks for your patience [19:04:31] No prob. Starting group2 rollout now. [19:04:47] 10Puppet, 10Patch-For-Review: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10jbond) Just as a data point maps2007 which is the server in the initial report does not have an overly large factset Also noticed that `/var/lib/puppetdb/stockpile/cmd` directory on puppetdb2... [19:05:06] Urbanecm: Thanks [19:05:53] (03PS1) 10Ahmon Dancy: group2 wikis to 1.36.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629794 [19:05:55] (03CR) 10Ahmon Dancy: [C: 03+2] group2 wikis to 1.36.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629794 (owner: 10Ahmon Dancy) [19:06:19] np Tchanders [19:06:35] (03Merged) 10jenkins-bot: group2 wikis to 1.36.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629794 (owner: 10Ahmon Dancy) [19:07:50] (03CR) 10Dzahn: mail::smarthost::wmcs: convert role to profile, fix lint issues (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/628972 (owner: 10Dzahn) [19:08:11] (03PS2) 10Dzahn: mail::smarthost::wmcs: convert role to profile, fix lint issues [puppet] - 10https://gerrit.wikimedia.org/r/628972 [19:08:14] !log dancy@deploy1001 rebuilt and synchronized wikiversions files: group2 wikis to 1.36.0-wmf.10 [19:08:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:26] (03CR) 10Jbond: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/628972 (owner: 10Dzahn) [19:19:20] (03CR) 10Dzahn: "hosts using this: https://openstack-browser.toolforge.org/puppetclass/role::mail::smarthost::wmcs" [puppet] - 10https://gerrit.wikimedia.org/r/628972 (owner: 10Dzahn) [19:20:09] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/25417/mx-out01.cloudinfra.eqiad.wmflabs/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/628972 (owner: 10Dzahn) [19:21:58] (03CR) 10Dzahn: [V: 03+1 C: 03+2] mail::smarthost::wmcs: convert role to profile, fix lint issues [puppet] - 10https://gerrit.wikimedia.org/r/628972 (owner: 10Dzahn) [19:23:42] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.2722 ge 0.1 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [19:25:47] (03CR) 10Dzahn: "confirmed on mx-out02.cloudinfra - this changed nothing - totally unrelated this host does changes on every run, related to NRPE, rsyslog." [puppet] - 10https://gerrit.wikimedia.org/r/628972 (owner: 10Dzahn) [19:28:39] (03CR) 10Dzahn: "same on mx-out01. this change is NOOP but unrelated things get repeated every run it looks" [puppet] - 10https://gerrit.wikimedia.org/r/628972 (owner: 10Dzahn) [19:29:33] !log andrew@deploy1001 Started deploy [horizon/deploy@e5890b9]: dev [19:29:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:17] !log andrew@deploy1001 Finished deploy [horizon/deploy@e5890b9]: dev (duration: 00m 44s) [19:30:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:44] (03PS1) 10Ebernhardson: Revert "cloudelastic: envoy sits in front now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629771 [19:33:45] twentyafterfour: ^^ needs to ship out, can you let me know after train is done? [19:33:57] (03PS1) 10Mholloway: Update wikifeeds to 2020-09-24-191356-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/629803 (https://phabricator.wikimedia.org/T263133) [19:38:27] !log andrew@deploy1001 Started deploy [horizon/deploy@e5890b9]: (no justification provided) [19:38:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:36] !log andrew@deploy1001 Finished deploy [horizon/deploy@e5890b9]: (no justification provided) (duration: 01m 08s) [19:39:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:33] ebernhardson: I'm backup conductor. cc dancy ^^^ [19:41:20] !log andrew@deploy1001 Started deploy [horizon/deploy@e5890b9]: (no justification provided) [19:41:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:56] !log andrew@deploy1001 Finished deploy [horizon/deploy@e5890b9]: (no justification provided) (duration: 00m 36s) [19:41:59] ebernhardson: it's probably ok to go now but I'd like Dancy to weigh in on that [19:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:32] ebernhardson>: Train is done. Everything looks good so have at it. [19:43:16] dancy: alright, thanks! [19:44:19] (03CR) 10Ebernhardson: [C: 03+2] "unbreak" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629771 (owner: 10Ebernhardson) [19:44:20] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [19:45:06] (03Merged) 10jenkins-bot: Revert "cloudelastic: envoy sits in front now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629771 (owner: 10Ebernhardson) [19:47:25] !log ebernhardson@deploy1001 Synchronized wmf-config/ProductionServices.php: Revert: cloudelastic: envoy sits in front now (duration: 00m 59s) [19:47:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:34] 10Operations, 10Parsing-Team, 10Platform Engineering, 10TechCom, and 5 others: Strategy for storing parser output for "old revision" (Popular diffs and permalinks) - https://phabricator.wikimedia.org/T244058 (10daniel) [19:47:54] 10Operations, 10Parsing-Team, 10TechCom, 10serviceops, and 4 others: Strategy for storing parser output for "old revision" (Popular diffs and permalinks) - https://phabricator.wikimedia.org/T244058 (10daniel) [19:48:40] 10Operations, 10Parsing-Team, 10TechCom, 10serviceops, and 4 others: Strategy for storing parser output for "old revision" (Popular diffs and permalinks) - https://phabricator.wikimedia.org/T244058 (10daniel) a:05holger.knust→03None [19:48:44] all done [19:48:57] (03PS3) 10HitomiAkane: Creation of patroller group on arz.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629738 (https://phabricator.wikimedia.org/T262218) [19:49:15] (03CR) 10Mholloway: [C: 03+2] Update wikifeeds to 2020-09-24-191356-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/629803 (https://phabricator.wikimedia.org/T263133) (owner: 10Mholloway) [19:51:40] (03Merged) 10jenkins-bot: Update wikifeeds to 2020-09-24-191356-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/629803 (https://phabricator.wikimedia.org/T263133) (owner: 10Mholloway) [19:53:20] 10Operations, 10Traffic: backport ipvsadm>=1.30 to buster-wikimedia or buster-backports - https://phabricator.wikimedia.org/T263788 (10CDanis) [19:53:37] 10Operations, 10Parsing-Team, 10TechCom, 10serviceops, and 4 others: Strategy for storing parser output for "old revision" (Popular diffs and permalinks) - https://phabricator.wikimedia.org/T244058 (10daniel) Looking into this some more, we came across a number of issues, namely: * Diffs and permalinks don... [19:53:50] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.4533 ge 0.1 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [19:54:15] !log mholloway-shell@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' . [19:54:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:08] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:55:24] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Disable caching on the main page for anonymous users - https://phabricator.wikimedia.org/T119366 (10Krinkle) 05Open→03Stalled Pending feedback or confirmation from trwiki editors. [19:55:39] !log mholloway-shell@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [19:55:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:18] 10Operations, 10Analytics-Clusters, 10decommission-hardware: Decommission analytics10[28-31,33-41] - https://phabricator.wikimedia.org/T227485 (10RobH) I'm removing the #ops-eqiad tag, as this is hurting their open task metrics when its never actually been within their ability to move this forward. When thi... [19:57:22] !log mholloway-shell@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [19:57:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:50] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:07:12] 10Operations, 10Machine Learning Platform, 10ORES: [Epic] Deploy ORES in kubernetes cluster - https://phabricator.wikimedia.org/T182331 (10ACraze) [20:09:34] 10Operations, 10Traffic: Wikidough: Upgrade to dnsdist 1.5.0 - https://phabricator.wikimedia.org/T263789 (10ssingh) [20:12:21] 10Operations, 10Traffic: Wikidough: Upgrade to dnsdist 1.5.0 - https://phabricator.wikimedia.org/T263789 (10ssingh) [20:12:26] 10Operations, 10Traffic, 10Patch-For-Review: Deploy Wikidough: Experimental DNS-over-HTTPS (DoH) public resolver - https://phabricator.wikimedia.org/T252132 (10ssingh) [20:12:39] 10Operations, 10Traffic: Wikidough: Upgrade to dnsdist 1.5.0 - https://phabricator.wikimedia.org/T263789 (10ssingh) p:05Triage→03Medium a:03ssingh [20:30:04] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:33:16] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:34:46] !log andrew@deploy1001 Started deploy [horizon/deploy@85125d1]: (no justification provided) [20:34:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:38] !log andrew@deploy1001 Finished deploy [horizon/deploy@85125d1]: (no justification provided) (duration: 00m 52s) [20:35:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:44] !log andrew@deploy1001 Started deploy [horizon/deploy@24368a5]: (no justification provided) [20:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:39] 10Operations, 10Puppet: Upgrade Puppet to 5.5.21 - https://phabricator.wikimedia.org/T248168 (10MoritzMuehlenhoff) [20:41:53] !log andrew@deploy1001 Finished deploy [horizon/deploy@24368a5]: (no justification provided) (duration: 02m 10s) [20:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:59] 10Operations, 10Puppet: Upgrade Puppet to 5.5.21 - https://phabricator.wikimedia.org/T248168 (10MoritzMuehlenhoff) In the mean time Debian unstable moved to 5.5.21: https://puppet.com/docs/puppet/5.5/release_notes.html#puppet-5521 [20:46:06] (03PS1) 10Ebernhardson: cirrus: temporarily disable saneitizer [puppet] - 10https://gerrit.wikimedia.org/r/629820 (https://phabricator.wikimedia.org/T263073) [20:46:57] (03CR) 10Ebernhardson: "It's not clear if anything else has to be done to remove a periodic_job. The only example i can find in git is referenced at https://phabr" [puppet] - 10https://gerrit.wikimedia.org/r/629820 (https://phabricator.wikimedia.org/T263073) (owner: 10Ebernhardson) [20:47:48] (03PS7) 10Dzahn: cache::base: replace hiera() with lookup(), add data types [puppet] - 10https://gerrit.wikimedia.org/r/623662 [20:48:02] (03CR) 10Dzahn: "that's what happens If i don't merge fast enough, this change quite a bit and needed manual rebase" [puppet] - 10https://gerrit.wikimedia.org/r/623662 (owner: 10Dzahn) [20:50:50] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:52:26] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:59:46] !log andrew@deploy1001 Started deploy [horizon/deploy@404e205]: (no justification provided) [20:59:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:51] !log andrew@deploy1001 Finished deploy [horizon/deploy@404e205]: (no justification provided) (duration: 01m 05s) [21:00:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:14] !log reprepro: add backported ipvsadm 1:1.31-1+deb10u1 to buster-wikimedia [21:03:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:58] (03PS2) 10Ebernhardson: cirrus: temporarily disable saneitizer [puppet] - 10https://gerrit.wikimedia.org/r/629820 (https://phabricator.wikimedia.org/T263073) [21:10:27] (03PS2) 10Dzahn: oozie: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/629443 [21:11:25] (03CR) 10Dzahn: [C: 04-2] "https://puppet-compiler.wmflabs.org/compiler1003/25420/cp1082.eqiad.wmnet/change.cp1082.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/623662 (owner: 10Dzahn) [21:13:18] 10Operations, 10Traffic: backport ipvsadm>=1.30 to buster-wikimedia or buster-backports - https://phabricator.wikimedia.org/T263788 (10CDanis) Turns out this was trivial. `1:1.31-1+deb10u1` is now in buster-wikimedia. I'll test on some backup LVS machine tomorrow or early next week. [21:14:01] 10Operations, 10Traffic: Switch to Maglev hashing ('mh') on LVS hosts - https://phabricator.wikimedia.org/T263797 (10CDanis) [21:14:08] 10Operations, 10Traffic: backport ipvsadm>=1.30 to buster-wikimedia or buster-backports - https://phabricator.wikimedia.org/T263788 (10CDanis) [21:14:10] 10Operations, 10Traffic: Switch to Maglev hashing ('mh') on LVS hosts - https://phabricator.wikimedia.org/T263797 (10CDanis) [21:14:23] (03PS8) 10Dzahn: cache::base: replace hiera() with lookup(), add data types [puppet] - 10https://gerrit.wikimedia.org/r/623662 [21:14:36] (03CR) 10Ebernhardson: "pcc: https://puppet-compiler.wmflabs.org/compiler1002/25421/" [puppet] - 10https://gerrit.wikimedia.org/r/629820 (https://phabricator.wikimedia.org/T263073) (owner: 10Ebernhardson) [21:17:54] (03CR) 10Dzahn: "+1 to adding the ensure parameter and the data type for it. but it looks like you are not actually using it in the other file?" [puppet] - 10https://gerrit.wikimedia.org/r/629820 (https://phabricator.wikimedia.org/T263073) (owner: 10Ebernhardson) [21:18:10] (03CR) 10Effie Mouzeli: [C: 04-2] "Waiting on T263683" [puppet] - 10https://gerrit.wikimedia.org/r/629735 (https://phabricator.wikimedia.org/T262202) (owner: 10Effie Mouzeli) [21:22:11] (03PS3) 10Ebernhardson: cirrus: temporarily disable saneitizer [puppet] - 10https://gerrit.wikimedia.org/r/629820 (https://phabricator.wikimedia.org/T263073) [21:22:34] (03CR) 10Ebernhardson: "oh indeed, cirrussearch.pp updated appropriately." [puppet] - 10https://gerrit.wikimedia.org/r/629820 (https://phabricator.wikimedia.org/T263073) (owner: 10Ebernhardson) [21:26:07] (03PS4) 10Ebernhardson: cirrus: temporarily disable saneitizer [puppet] - 10https://gerrit.wikimedia.org/r/629820 (https://phabricator.wikimedia.org/T263073) [21:26:09] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/25422/" [puppet] - 10https://gerrit.wikimedia.org/r/629820 (https://phabricator.wikimedia.org/T263073) (owner: 10Ebernhardson) [21:26:27] (03PS1) 10Ryan Kemper: cloudelastic: expose psi/omega thru lvs [puppet] - 10https://gerrit.wikimedia.org/r/629829 (https://phabricator.wikimedia.org/T263073) [21:28:51] (03CR) 10Dzahn: [C: 03+2] cirrus: temporarily disable saneitizer [puppet] - 10https://gerrit.wikimedia.org/r/629820 (https://phabricator.wikimedia.org/T263073) (owner: 10Ebernhardson) [21:33:48] PROBLEM - Check the last execution of mediawiki_job_cirrus_sanitize_jobs on mwmaint2001 is CRITICAL: NRPE: Command check_check_mediawiki_job_cirrus_sanitize_jobs_status not defined https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:34:13] meh.. ok... [21:35:22] (03CR) 10Dzahn: "ran puppet on mwmaint1002 and mwmaint2001 - saw it remove a bunch of cirrus-sanitize resources but no other timers" [puppet] - 10https://gerrit.wikimedia.org/r/629820 (https://phabricator.wikimedia.org/T263073) (owner: 10Ebernhardson) [21:36:12] ACKNOWLEDGEMENT - Check the last execution of mediawiki_job_cirrus_sanitize_jobs on mwmaint2001 is CRITICAL: NRPE: Command check_check_mediawiki_job_cirrus_sanitize_jobs_status not defined daniel_zahn disabled timer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:39:04] RECOVERY - Check systemd state on mw1349 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:40:26] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:40:27] !log mw1349 - systemctl reset-failed [21:40:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:34] (03PS1) 10Effie Mouzeli: WIP mcrouter: install ohhost memcached on MediaWiki servers [puppet] - 10https://gerrit.wikimedia.org/r/629830 (https://phabricator.wikimedia.org/T244340) [21:41:28] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:41:32] (03CR) 10jerkins-bot: [V: 04-1] WIP mcrouter: install ohhost memcached on MediaWiki servers [puppet] - 10https://gerrit.wikimedia.org/r/629830 (https://phabricator.wikimedia.org/T244340) (owner: 10Effie Mouzeli) [21:43:30] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:44:34] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:52:35] 10Operations, 10Machine Learning Platform, 10ORES: [Discuss] ORES without celery - https://phabricator.wikimedia.org/T216838 (10ACraze) 05Open→03Resolved a:03ACraze Marking this as resolved as we understand the limitations that celery brings with the current design of ORES and it will not be included d... [21:57:20] 10Operations, 10Product-Infrastructure-Data, 10Epic, 10Goal, 10Patch-For-Review: automatically collect network error reports from users' browsers (Network Error Logging API) - https://phabricator.wikimedia.org/T257527 (10Nuria) Correct, I did not included chrome mobile which is about 17.8% of pageviews... [22:04:35] 10Operations, 10Analytics: Augment NEL reports with GeoIP country code and network AS number - https://phabricator.wikimedia.org/T263496 (10Nuria) @JAllemandou I think adding geo info (or rather swapping IP by Geo info ) is something that would need to happen in this case (in the absence of stream processing b... [22:36:04] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:36:42] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1002/25424/cp1079.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/623662 (owner: 10Dzahn) [22:37:52] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1002/25424/cp1079.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/623662 (owner: 10Dzahn) [22:39:14] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:39:57] (03PS1) 10Dzahn: start DHCP service on install3001 [puppet] - 10https://gerrit.wikimedia.org/r/629849 [22:44:07] (03PS1) 10Dzahn: installserver: remove dc-ops admin group from new hosts, applied via role [puppet] - 10https://gerrit.wikimedia.org/r/629852 [22:44:45] (03CR) 10Dzahn: [C: 03+2] installserver: remove dc-ops admin group from new hosts, applied via role [puppet] - 10https://gerrit.wikimedia.org/r/629852 (owner: 10Dzahn) [22:44:54] (03PS1) 10Gergő Tisza: Implement .well-known/change-password redirect on Wikimedia sites [puppet] - 10https://gerrit.wikimedia.org/r/629853 (https://phabricator.wikimedia.org/T263800) [22:45:03] (03PS1) 10Reedy: Prepare ExtensionDistributor for REL1_35 stable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629854 [22:52:04] (03PS1) 10Jeena Huneidi: Remove pipelinebuilder role and builder profile [puppet] - 10https://gerrit.wikimedia.org/r/629863 [23:00:04] RoanKattouw, Niharika, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy Evening backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200924T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:01:09] (03CR) 10Urbanecm: [C: 03+1] "good idea, LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/629853 (https://phabricator.wikimedia.org/T263800) (owner: 10Gergő Tisza) [23:25:18] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:28:32] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:37:46] !log andrew@deploy1001 Started deploy [horizon/deploy@7b61460]: (no justification provided) [23:37:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:44] !log andrew@deploy1001 Finished deploy [horizon/deploy@7b61460]: (no justification provided) (duration: 01m 58s) [23:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:47:40] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:49:16] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:56:09] 10Operations, 10MW-on-K8s, 10TechCom-RFC, 10serviceops, 10Patch-For-Review: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10tstarling)