[00:04:06] 10SRE, 10SRE-tools: sre.hosts.decomission -> generate_dns_snippets - > Cumin execution failed - https://phabricator.wikimedia.org/T274689 (10crusnov) Correct me if I'm wrong, but is this the vm that is replacing another vm of the same name? There might be some assumptions about the connectivity in netbox of th... [00:05:16] 10SRE, 10SRE-tools: sre.hosts.decomission -> generate_dns_snippets - > Cumin execution failed - https://phabricator.wikimedia.org/T274689 (10Dzahn) It's just an attempt tp remove an existing VM. (after reimaging the existing VM resulted in it not coming back from reboot) [00:06:36] (03CR) 10Razzi: "Combining with https://gerrit.wikimedia.org/r/c/operations/puppet/+/661529/2 into single patch." [puppet] - 10https://gerrit.wikimedia.org/r/661528 (https://phabricator.wikimedia.org/T269211) (owner: 10Razzi) [00:06:41] (03Abandoned) 10Razzi: site: add clouddb1021.eqiad.wmnet to insetup [puppet] - 10https://gerrit.wikimedia.org/r/661528 (https://phabricator.wikimedia.org/T269211) (owner: 10Razzi) [00:07:58] (03PS4) 10Razzi: wikireplicas: Add configuration for clouddb1021 [puppet] - 10https://gerrit.wikimedia.org/r/661529 (https://phabricator.wikimedia.org/T269211) [00:08:32] !log ganeti1011 - manually deleting VM mwdebug1002 - T274689 T274023 [00:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:08:38] T274023: Convert mwdebug VMs to debian buster - https://phabricator.wikimedia.org/T274023 [00:08:38] T274689: sre.hosts.decomission -> generate_dns_snippets - > Cumin execution failed - https://phabricator.wikimedia.org/T274689 [00:08:55] 10SRE, 10SRE-tools: sre.hosts.decomission -> generate_dns_snippets - > Cumin execution failed - https://phabricator.wikimedia.org/T274689 (10Dzahn) [00:09:07] RECOVERY - Kafka Broker Replica Max Lag on kafka-jumbo1005 is OK: (C)5e+06 ge (W)1e+06 ge 4.728e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1005 [00:12:06] (03CR) 10Ladsgroup: [C: 03+1] logging::mediawiki::udp2log: replace cron with timer [puppet] - 10https://gerrit.wikimedia.org/r/661200 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [00:25:44] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host mwdebug1002.eqiad.wmnet [00:25:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:26:49] !log ganeti - attempting to recreate VM mwdebug1002 with cookbook that wsa previously deleted manually (T274689 T274023) [00:26:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:26:55] T274023: Convert mwdebug VMs to debian buster - https://phabricator.wikimedia.org/T274023 [00:26:55] T274689: sre.hosts.decomission -> generate_dns_snippets - > Cumin execution failed - https://phabricator.wikimedia.org/T274689 [00:30:21] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host mwdebug1002.eqiad.wmnet [00:30:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:30:37] 10SRE: mw2295.codfw.wmnet returned [255]: Host key verification failed. - https://phabricator.wikimedia.org/T273726 (10Legoktm) 05Open→03Resolved Some kind of fluke? meh. I now check that puppet is enabled after I finish reimaging each host. [00:30:41] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host mwdebug1002.eqiad.wmnet [00:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:31:29] !log legoktm@cumin1001 conftool action : set/pooled=yes; selector: name=mw1284.eqiad.wmnet [00:31:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:31:34] !log legoktm@cumin1001 conftool action : set/pooled=yes; selector: name=mw1283.eqiad.wmnet [00:31:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:31:39] !log legoktm@cumin1001 conftool action : set/pooled=yes; selector: name=mw1282.eqiad.wmnet [00:31:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:31:49] !log legoktm@cumin1001 conftool action : set/pooled=yes; selector: name=mw1281.eqiad.wmnet [00:31:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:38:33] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host mwdebug1002.eqiad.wmnet [00:38:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:43:01] RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:43:19] 10SRE, 10SRE-tools: sre.hosts.decomission -> generate_dns_snippets - > Cumin execution failed - https://phabricator.wikimedia.org/T274689 (10Dzahn) summary: - existing VM, actually very old from 2016, works fine - tried to install new distro version on it, which was no problem on other VMs in codfw, but this... [00:47:47] 10SRE, 10Wikimedia-Mailing-lists: wikipedia-mai & wikiur-l lists do not seem to have active list admins (mail archives empty after August 2018 & January 2019) - https://phabricator.wikimedia.org/T270837 (10Aklapper) 05Stalled→03Invalid @jayantanth: Unfortunately closing this Phabricator task as no further... [00:48:07] PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:49:44] 10SRE, 10SRE-tools: sre.hosts.decomission -> generate_dns_snippets - > Cumin execution failed - https://phabricator.wikimedia.org/T274689 (10crusnov) I'm doing a little looking around on this. [00:49:51] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host mwdebug1002.eqiad.wmnet [00:49:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:50:48] 10SRE, 10SRE-tools: sre.hosts.decomission -> generate_dns_snippets - > Cumin execution failed - https://phabricator.wikimedia.org/T274689 (10Dzahn) the last part about the public IP might have been just about the ordering of parameters I had in makevm.. give me a minute, trying that one more time [00:51:19] PROBLEM - Postgres Replication Lag on maps2006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 4814104080 and 335 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:51:55] PROBLEM - Postgres Replication Lag on maps2005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 4445252656 and 354 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:51:59] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 2567315768 and 251 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:53:55] PROBLEM - Postgres Replication Lag on maps2009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1337478008 and 297 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:54:09] PROBLEM - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 2898371320 and 389 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:54:26] 10SRE, 10SRE-tools: sre.hosts.decomission -> generate_dns_snippets - > Cumin execution failed - https://phabricator.wikimedia.org/T274689 (10Dzahn) confirmed. that last part about the IP was because I put the "eqiad_A" part after the "--network" part. good : sudo cookbook sre.ganeti.makevm eqiad_A --vcpus 4... [00:55:05] RECOVERY - Postgres Replication Lag on maps2005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 191800 and 313 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:55:09] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 79392 and 317 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:55:10] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host mwdebug1002.eqiad.wmnet [00:55:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:55:33] RECOVERY - Postgres Replication Lag on maps2009 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 194176 and 340 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:55:47] RECOVERY - Postgres Replication Lag on maps2010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 77016 and 354 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:56:07] RECOVERY - Postgres Replication Lag on maps2006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 40968 and 374 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:00:03] !log crusnov@cumin1001 START - Cookbook sre.ganeti.makevm for new host mwdebug1002.eqiad.wmnet [01:00:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:28:21] (03PS4) 10Ladsgroup: wikilabels: replace cron with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/662781 (https://phabricator.wikimedia.org/T273673) [01:29:40] (03CR) 10Ladsgroup: wikilabels: replace cron with systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/662781 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [01:30:07] 10SRE, 10LDAP-Access-Requests: Request to add Georgina Burnett to the ldap/wmde group - https://phabricator.wikimedia.org/T273780 (10KFrancis) @CDanis The NDA is out for signatures. I will confirm when it's complete. Thanks! [01:30:56] !log crusnov@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host mwdebug1002.eqiad.wmnet [01:31:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:33:57] 10SRE, 10SRE-tools: sre.hosts.decomission -> generate_dns_snippets - > Cumin execution failed - https://phabricator.wikimedia.org/T274689 (10crusnov) I've successfully got makevm to work as expected after deleting the IP addresses for mwdebug1002 that were left behind by the DNS generation at the decom step fa... [01:59:48] (03CR) 10Dzahn: [C: 03+1] wikilabels: replace cron with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/662781 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [03:16:09] PROBLEM - WDQS SPARQL on wdqs1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:17:35] RECOVERY - WDQS SPARQL on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.072 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:22:27] PROBLEM - WDQS SPARQL on wdqs1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:23:28] !log Restarted blazegraph on `wdqs1006` [03:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:23:59] !log Depooled `wdqs1006` to catch up on lag [03:24:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:26:59] RECOVERY - WDQS SPARQL on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.084 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:43:11] RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:47:53] PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:19:43] PROBLEM - rpki grafana alert on alert1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: rsync status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [05:36:49] RECOVERY - rpki grafana alert on alert1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [05:38:03] RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:43:11] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:14:35] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1075 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [06:19:37] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1075 is OK: HTTP OK: HTTP/1.0 200 OK - 23609 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [07:37:53] RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:43:05] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210213T0800) [08:21:41] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [08:23:25] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [09:43:37] RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:48:49] PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:07:51] RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:13:01] RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:13:03] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:18:11] PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:29:35] 10SRE, 10Phabricator, 10Traffic: Access Forbidden to Phabricator at WikiArabia 2019 (Morocco) via Indian IP 185.174.156.75 - https://phabricator.wikimedia.org/T234598 (10Aklapper) 05Open→03Declined This is superseded by commit e85f4aae changing `hieradata/common.yaml` in the private puppet repo, plus htt... [11:43:55] Okay. Quit banning me. I obviously don't know how to access your systems and I am sure that this is abnormal here. [11:47:06] Convert labsdb1012 from multi-source to multi-instance [11:47:10] what is this [11:49:00] instances.yaml: Add es1033,es1034 [11:49:05] i have YAM [11:49:44] 10.33 and a 10.34 is just a 6.7 which is a 3.4 | 4.3 but i can run 16.9 like the 6.7 | 7.6 [11:49:59] you are running up systems [12:12:33] RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:17:43] PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:43:27] RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:48:11] PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:48:24] Things seem a bit slow at the moment. A few people have pointed it out in #wikipedia-en [15:22:59] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3052 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [15:33:09] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3052 is OK: HTTP OK: HTTP/1.0 200 OK - 23610 bytes in 0.279 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [15:43:31] RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:48:45] PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:02:45] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3062 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [16:09:33] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3062 is OK: HTTP OK: HTTP/1.0 200 OK - 23629 bytes in 0.274 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [17:56:17] 10SRE, 10Traffic, 10Wikimedia-General-or-Unknown, 10User-DannyS712: Pages whose title ends with semicolon (;) are intermittently inaccessible (likely due to ATS) - https://phabricator.wikimedia.org/T238285 (10Aklapper) [18:50:22] (03PS1) 10Urbanecm: Update urbanecm's dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/663993 [19:37:47] RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:42:51] RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:42:57] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:47:57] PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:57:45] (03PS1) 10Aklapper: phabricator weekly changes email: List cookie-licked Bugzilla tasks [puppet] - 10https://gerrit.wikimedia.org/r/664002 [19:58:41] (03PS2) 10Aklapper: phabricator weekly changes email: List cookie-licked Bugzilla tasks [puppet] - 10https://gerrit.wikimedia.org/r/664002 (https://phabricator.wikimedia.org/T274711) [20:42:47] RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:47:55] PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:37:55] RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:42:37] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:50:35] 10SRE, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 3 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Krinkle)