[00:00:04] RoanKattouw, Niharika, and Urbanecm: Your horoscope predicts another unfortunate Evening backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201203T0000). [00:00:04] No GERRIT patches in the queue for this window AFAICS. [00:00:34] * DannyS712 has a patch to deploy in a minute [00:00:44] security patch, if anyone can deploy it [00:04:56] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01071 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [00:05:11] Urbanecm are you around for the backport window? [00:09:34] (03CR) 10Ppchelko: [C: 04-2] "until wmf.21 is rolled out." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644317 (https://phabricator.wikimedia.org/T263579) (owner: 10Ppchelko) [00:32:17] (03PS7) 10Jeena Huneidi: Add new helm chart for rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/640571 (https://phabricator.wikimedia.org/T265526) (owner: 10Mstyles) [00:32:36] (03CR) 10jerkins-bot: [V: 04-1] Add new helm chart for rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/640571 (https://phabricator.wikimedia.org/T265526) (owner: 10Mstyles) [00:32:44] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 282714360 and 151 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:33:38] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 133181760 and 204 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:33:54] so.. regarding "widespread puppet failures" alert .. I looked [00:34:00] and the main contributors are: [00:34:04] druid [00:34:10] cloudelastic [00:34:20] with 20% and 16.66% [00:34:34] PROBLEM - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 774395568 and 261 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:34:40] PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 558226744 and 268 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:35:27] (03PS1) 10Bstorm: wikireplicas: extend maintain_dbusers to multiinstance replicas [puppet] - 10https://gerrit.wikimedia.org/r/644949 (https://phabricator.wikimedia.org/T268312) [00:35:44] no, scratch that. it's actually clouddb [00:35:54] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 541666312 and 341 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:36:45] bstorm: I am afraid something goes wrong after the revert. [00:36:57] ? [00:37:13] mutante: what do you mean? [00:37:21] bstorm: puppet runs on clouddb hosts seem broken [00:37:32] see for example on clouddb1014 [00:37:40] mutante: that doesn't run on clouddb hosts...and that is a known issue unrelated :) [00:38:11] hmm. but something just pushed things over the threshold for "widespread" failures [00:38:14] PROBLEM - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 133742608 and 480 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:38:17] T260511 [00:38:18] T260511: Parametrize wmf-pt-kill so it can connect to different sockets - https://phabricator.wikimedia.org/T260511 [00:38:37] mutante: I had disabled puppet on several of them for a bit and then enabled them [00:38:59] bstorm: oh, that would explain it [00:39:23] 10Operations, 10Android-app-Bugs, 10Fundraising-Backlog, 10Thank-You-Page, and 4 others: Deal with donatewiki Thank You page launching in apps - https://phabricator.wikimedia.org/T259312 (10MattCleinman) From https://developer.apple.com/library/archive/documentation/General/Conceptual/AppSearch/UniversalLi... [00:39:33] from my point of view it looked like "suddenly widespread failure", then a revert and then the error was "still" there [00:39:36] gotcha now [00:40:10] (03CR) 10Bstorm: [C: 03+2] wikireplicas: extend maintain_dbusers to multiinstance replicas [puppet] - 10https://gerrit.wikimedia.org/r/644949 (https://phabricator.wikimedia.org/T268312) (owner: 10Bstorm) [00:40:36] :) [00:41:01] I'm actually patching and messing with the labstore1004 host [00:41:36] RECOVERY - Postgres Replication Lag on maps1006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1647888 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:41:42] ACKNOWLEDGEMENT - Ensure local MW versions match expected deployment on deploy1002 is CRITICAL: CRITICAL: 646 mismatched wikiversions daniel_zahn not in use yet https://wikitech.wikimedia.org/wiki/Application_servers [00:41:42] ACKNOWLEDGEMENT - Keyholder SSH agent on deploy1002 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. daniel_zahn not in use yet https://wikitech.wikimedia.org/wiki/Keyholder [00:41:42] ACKNOWLEDGEMENT - mediawiki-installation DSH group on deploy1002 is CRITICAL: Host deploy1002 is not in mediawiki-installation dsh group daniel_zahn not in use yet https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [00:41:42] ACKNOWLEDGEMENT - Ensure local MW versions match expected deployment on deploy2002 is CRITICAL: CRITICAL: 643 mismatched wikiversions daniel_zahn not in use yet https://wikitech.wikimedia.org/wiki/Application_servers [00:41:42] ACKNOWLEDGEMENT - Keyholder SSH agent on deploy2002 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. daniel_zahn not in use yet https://wikitech.wikimedia.org/wiki/Keyholder [00:41:42] ACKNOWLEDGEMENT - mediawiki-installation DSH group on deploy2002 is CRITICAL: Host deploy2002 is not in mediawiki-installation dsh group daniel_zahn not in use yet https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [00:42:30] (03CR) 10Bstorm: "Works like a charm! No change to the service and config, but I got:" [puppet] - 10https://gerrit.wikimedia.org/r/644949 (https://phabricator.wikimedia.org/T268312) (owner: 10Bstorm) [00:46:42] PROBLEM - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 345270952 and 24 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:47:56] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 552771864 and 31 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:48:50] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 357435144 and 23 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:50:02] PROBLEM - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 371426568 and 14 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:51:03] (03PS1) 10Bstorm: wikireplicas: extend maintain_dbusers to multiinstance--test 2 [puppet] - 10https://gerrit.wikimedia.org/r/644950 (https://phabricator.wikimedia.org/T268312) [00:51:06] RECOVERY - Postgres Replication Lag on maps1008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 63656 and 42 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:51:32] PROBLEM - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 80494688 and 415 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:51:42] RECOVERY - Postgres Replication Lag on maps1006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 20264 and 78 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:52:00] PROBLEM - Postgres Replication Lag on maps2009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 203296184 and 444 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:52:02] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 267342032 and 445 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:52:10] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 72 and 107 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:52:16] PROBLEM - Postgres Replication Lag on maps2006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 47207792 and 460 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:52:58] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 164024 and 155 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:53:08] RECOVERY - Postgres Replication Lag on maps1005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 50224 and 165 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:53:14] RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 66992 and 170 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:53:36] (03CR) 10Bstorm: [C: 03+2] "Merging because this is a functional noop that will display what is needed for the last phase of this." [puppet] - 10https://gerrit.wikimedia.org/r/644950 (https://phabricator.wikimedia.org/T268312) (owner: 10Bstorm) [00:54:44] PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 340700400 and 607 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:56:50] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 78447608 and 5 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:58:32] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1786896 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:59:48] RECOVERY - Postgres Replication Lag on maps2008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1984224 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:00:04] twentyafterfour: That opportune time is upon us again. Time for a Phabricator update deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201203T0100). [01:00:26] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 38236280 and 3 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:02:10] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1605904 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:02:24] RECOVERY - Postgres Replication Lag on maps2006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1371368 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:03:50] RECOVERY - Postgres Replication Lag on maps2009 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1486208 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:05:02] RECOVERY - Postgres Replication Lag on maps2010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 890840 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:07:41] (03Abandoned) 10Dzahn: wmcs::instance: remove diamond removal remnants [puppet] - 10https://gerrit.wikimedia.org/r/632570 (https://phabricator.wikimedia.org/T210993) (owner: 10Dzahn) [01:10:26] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 234709784 and 14 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:11:28] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): cloudvirt1025 connection issues on primary nic - https://phabricator.wikimedia.org/T269313 (10Andrew) [01:12:06] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 52616 and 56 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:15:15] (03PS4) 10Dzahn: mailman: replace cron with systemd timer and move to profile [puppet] - 10https://gerrit.wikimedia.org/r/637037 (https://phabricator.wikimedia.org/T265138) [01:17:23] (03CR) 10Dzahn: [C: 03+2] mailman: replace cron with systemd timer and move to profile [puppet] - 10https://gerrit.wikimedia.org/r/637037 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [01:21:22] (03PS1) 10Bstorm: wikireplicas: extend maintain_dbusers to multiinstance [puppet] - 10https://gerrit.wikimedia.org/r/644952 (https://phabricator.wikimedia.org/T268312) [01:21:42] !log lists1001 - remove "delete_held_messages" cronjob from root crontab - replaced by systemd timer - systemctl start delete_held_messages.service and confirmed it succeeded [01:21:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:25:10] (03PS8) 10Jeena Huneidi: Add new helm chart for rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/640571 (https://phabricator.wikimedia.org/T265526) (owner: 10Mstyles) [01:26:09] (03CR) 10jerkins-bot: [V: 04-1] Add new helm chart for rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/640571 (https://phabricator.wikimedia.org/T265526) (owner: 10Mstyles) [01:27:31] (03CR) 10Dzahn: labstore: add data types and some other style fixes (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/622666 (owner: 10Dzahn) [01:27:52] (03PS17) 10Dzahn: labstore: add data types and some other style fixes [puppet] - 10https://gerrit.wikimedia.org/r/622666 [01:31:23] (03PS2) 10Bstorm: wikireplicas: extend maintain_dbusers to multiinstance [puppet] - 10https://gerrit.wikimedia.org/r/644952 (https://phabricator.wikimedia.org/T268312) [01:34:57] (03CR) 10Bstorm: "For purposes of this patch, I aim to ignore formatting issues because I want to put up a black formatting patch soon. Also, there's probab" [puppet] - 10https://gerrit.wikimedia.org/r/644952 (https://phabricator.wikimedia.org/T268312) (owner: 10Bstorm) [01:35:01] (03PS2) 10Dzahn: aptrepo: replace cron with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/641315 (https://phabricator.wikimedia.org/T265138) [01:35:58] (03CR) 10Bstorm: wikireplicas: extend maintain_dbusers to multiinstance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/644952 (https://phabricator.wikimedia.org/T268312) (owner: 10Bstorm) [01:36:49] (03PS9) 10Jeena Huneidi: Add new helm chart for rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/640571 (https://phabricator.wikimedia.org/T265526) (owner: 10Mstyles) [01:39:27] (03PS3) 10Dzahn: ores: move LB setup for cloud from role to profile [puppet] - 10https://gerrit.wikimedia.org/r/643117 [01:42:10] (03CR) 10Bstorm: "WMCS is already formatting all our scripts in puppet and elsewhere with `black --line-length 80`, which is conflicting with this. There ar" [puppet] - 10https://gerrit.wikimedia.org/r/554825 (https://phabricator.wikimedia.org/T211750) (owner: 10Jbond) [01:53:32] 10Operations, 10SRE-tools, 10tox-wikimedia, 10Patch-For-Review, 10User-Kormat: Introduce Python code formatters usage - https://phabricator.wikimedia.org/T211750 (10Bstorm) FWIW, WMCS uses `black` with line-length set to 80 for all python and has for a little while. In non-puppet repos, we have tox check... [02:00:10] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:00:20] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:05:28] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:14:46] PROBLEM - snapshot of s3 in eqiad on alert1001 is CRITICAL: snapshot for s3 at eqiad taken more than 3 days ago: Most recent backup 2020-11-30 02:06:32 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [02:47:54] RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:01:04] 10Operations, 10serviceops, 10Datacenter-Switchover: Updates to warmup script - https://phabricator.wikimedia.org/T269179 (10Krinkle) >>! In T269179#6662243, @Volans wrote: >> you were thinking about rewriting the warmup script in Python -- you were kind enough to let me talk you out of doing that right befo... [03:01:33] 10Operations, 10serviceops, 10Datacenter-Switchover: Updates to warmup script - https://phabricator.wikimedia.org/T269179 (10Krinkle) [03:24:14] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [03:24:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:26:13] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [03:26:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:58:30] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:01:48] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1010-production-logstash-eqiad on logstash1010 is CRITICAL: 119 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1010&panelId=37 [04:17:12] RECOVERY - snapshot of s3 in eqiad on alert1001 is OK: Last snapshot for s3 at eqiad (db1095.eqiad.wmnet:3313) taken on 2020-12-02 22:56:01 (1002 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [04:27:44] PROBLEM - Check systemd state on ms-be1063 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:48:26] RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:01:34] RECOVERY - Check systemd state on ms-be1063 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:50:46] RECOVERY - Check systemd state on pki2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:55:21] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db214[234] - https://phabricator.wikimedia.org/T267041 (10Marostegui) Thank you Papaul. * Memory looks good * CPUs look good * Disk space looks good * RAID level looks good * pvs looks good (we need to add the last 1TB the... [05:55:58] PROBLEM - Check systemd state on pki2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:06:24] (03CR) 10Marostegui: [C: 03+2] production-m2.sql.erb: Add sockpuppet users [puppet] - 10https://gerrit.wikimedia.org/r/644745 (https://phabricator.wikimedia.org/T268505) (owner: 10Marostegui) [06:06:47] !log Create sockpuppet database on m2 T268505 [06:06:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:55] T268505: New database request: sockpuppet - https://phabricator.wikimedia.org/T268505 [06:23:32] (03PS1) 10Marostegui: install_server: Remove clouddb1016 [puppet] - 10https://gerrit.wikimedia.org/r/644960 [06:27:39] (03CR) 10Marostegui: [C: 03+2] install_server: Remove clouddb1016 [puppet] - 10https://gerrit.wikimedia.org/r/644960 (owner: 10Marostegui) [06:59:46] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:10:42] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:12:56] RECOVERY - Rate of JVM GC Old generation-s runs - logstash1010-production-logstash-eqiad on logstash1010 is OK: (C)100 gt (W)80 gt 77.29 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1010&panelId=37 [07:13:30] 10Operations, 10Wikimedia-Mailing-lists: Please create testing-infrastructure mailing list - https://phabricator.wikimedia.org/T269327 (10Jrbranaa) [07:15:06] (03PS1) 10Elukey: profile::prometheus::ops: add monitoring for zookeeper test [puppet] - 10https://gerrit.wikimedia.org/r/644962 (https://phabricator.wikimedia.org/T268074) [07:53:04] 10Operations, 10Cassandra: Cassandra instance DNS records - are they needed? - https://phabricator.wikimedia.org/T269328 (10elukey) [07:53:27] 10Operations, 10Cassandra: Cassandra instance DNS records - are they needed? - https://phabricator.wikimedia.org/T269328 (10elukey) [08:01:12] (03CR) 10Ladsgroup: [C: 03+1] "Thanks" [puppet] - 10https://gerrit.wikimedia.org/r/643117 (owner: 10Dzahn) [08:08:11] 10Operations, 10Parsoid, 10serviceops: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10MoritzMuehlenhoff) I checked the journalctl of ferm on parse2001 and there are no errors logged, so it's not a case of Ferm failing to start (via syntax errors, failing DNS lookup or whatev... [08:29:02] (03CR) 10Elukey: [C: 03+2] hadoop: Migrate hiera() to lookup() and set datatype [puppet] - 10https://gerrit.wikimedia.org/r/644770 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [08:29:07] (03PS2) 10Elukey: hadoop: Migrate hiera() to lookup() and set datatype [puppet] - 10https://gerrit.wikimedia.org/r/644770 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [08:30:48] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:34:41] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] site: assign roles for all ms-be / ms-fe hosts [puppet] - 10https://gerrit.wikimedia.org/r/644847 (https://phabricator.wikimedia.org/T265419) (owner: 10Filippo Giunchedi) [08:34:46] (03PS2) 10Filippo Giunchedi: site: assign roles for all ms-be / ms-fe hosts [puppet] - 10https://gerrit.wikimedia.org/r/644847 (https://phabricator.wikimedia.org/T265419) [08:35:38] (03CR) 10Elukey: [C: 03+1] superset: add cache to an-tool1010 [puppet] - 10https://gerrit.wikimedia.org/r/644672 (https://phabricator.wikimedia.org/T268784) (owner: 10Razzi) [08:41:23] (03PS1) 10Filippo Giunchedi: Revert "grafana: set thanos as default datasource" [puppet] - 10https://gerrit.wikimedia.org/r/644970 (https://phabricator.wikimedia.org/T269329) [08:41:43] akosiaris: womp womp :( https://phabricator.wikimedia.org/T269329 [08:42:48] gilles: https://gerrit.wikimedia.org/r/c/operations/puppet/+/644970 [08:43:18] phew, good thing it was something so benign :) [08:43:29] I was worried of actual data loss for the queries [08:43:51] what's thanos? [08:44:10] (03CR) 10Gilles: [C: 03+1] Revert "grafana: set thanos as default datasource" [puppet] - 10https://gerrit.wikimedia.org/r/644970 (https://phabricator.wikimedia.org/T269329) (owner: 10Filippo Giunchedi) [08:44:29] (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "grafana: set thanos as default datasource" [puppet] - 10https://gerrit.wikimedia.org/r/644970 (https://phabricator.wikimedia.org/T269329) (owner: 10Filippo Giunchedi) [08:44:37] gilles: an aggregator and long-term storage for prometheus metrics [08:44:47] roughly [08:44:58] indeed, what kormat said :) also https://thanos.wikimedia.org [08:45:58] does it solve the issue of prometheus metrics that write either to eqiad or codfw depending on which is the master DC? [08:46:04] ok applied the change, we're back :) [08:46:19] gilles: it does yeah, it fans out the query to all prometheus [08:46:30] 🙏 [08:46:38] there's more info here https://wikitech.wikimedia.org/wiki/Thanos [08:47:25] I'll resolve the task and open a new one to switch the default datasource, obviously we need to audit the dashboards [08:47:44] could probably get migrated via a script? [08:48:06] RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:48:14] yeah I hope so [08:48:25] I would be surprised if that wasn't the case [08:55:20] PROBLEM - Check systemd state on cumin2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:57:04] RECOVERY - Check systemd state on cumin2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:57:50] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:00:34] 10Operations, 10Cassandra: Cassandra instance DNS records - are they needed? - https://phabricator.wikimedia.org/T269328 (10fgiunchedi) IIRC at the time we set up multi-instance cassandra one/some of the ports could not be changed, and thus we went with per-instance addresses. This might not be the case anymor... [09:02:13] (03CR) 10Filippo Giunchedi: profile::prometheus::ops: add monitoring for zookeeper test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/644962 (https://phabricator.wikimedia.org/T268074) (owner: 10Elukey) [09:03:32] (03Abandoned) 10Elukey: WIP - hive: allow multiple metastores in the hive-site.xml config [puppet] - 10https://gerrit.wikimedia.org/r/644832 (owner: 10Elukey) [09:05:13] (03CR) 10Elukey: profile::prometheus::ops: add monitoring for zookeeper test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/644962 (https://phabricator.wikimedia.org/T268074) (owner: 10Elukey) [09:10:28] (03PS1) 10Filippo Giunchedi: hiera: add ms-be20[57-61] [puppet] - 10https://gerrit.wikimedia.org/r/645037 (https://phabricator.wikimedia.org/T269337) [09:13:07] (03CR) 10Filippo Giunchedi: [C: 03+2] hiera: add ms-be20[57-61] [puppet] - 10https://gerrit.wikimedia.org/r/645037 (https://phabricator.wikimedia.org/T269337) (owner: 10Filippo Giunchedi) [09:14:47] !log installing qemu security updates on Stretch [09:14:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:06] (03PS1) 10Elukey: hive: use an explicit parameter to deploy jdbc settings [puppet] - 10https://gerrit.wikimedia.org/r/645038 [09:16:08] (03PS1) 10Elukey: Move the hive metastore to analytics-test-hive in Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/645039 (https://phabricator.wikimedia.org/T268028) [09:17:03] (03PS2) 10Elukey: profile::prometheus::ops: add monitoring for zookeeper test [puppet] - 10https://gerrit.wikimedia.org/r/644962 (https://phabricator.wikimedia.org/T268074) [09:20:15] godog: lol... We should have expected that :-( [09:21:01] akosiaris: ehehe it did seem a little too easy indeed [09:22:24] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26870/console" [puppet] - 10https://gerrit.wikimedia.org/r/645039 (https://phabricator.wikimedia.org/T268028) (owner: 10Elukey) [09:24:09] there will some ms-be20 hosts down, expected [09:24:58] PROBLEM - Host ms-be2059 is DOWN: PING CRITICAL - Packet loss = 100% [09:26:14] PROBLEM - Host ms-be2060 is DOWN: PING CRITICAL - Packet loss = 100% [09:26:20] PROBLEM - Host ms-be2058 is DOWN: PING CRITICAL - Packet loss = 100% [09:26:30] PROBLEM - Host ms-be2061 is DOWN: PING CRITICAL - Packet loss = 100% [09:26:52] RECOVERY - Host ms-be2059 is UP: PING OK - Packet loss = 0%, RTA = 34.43 ms [09:26:53] (03CR) 10Filippo Giunchedi: [C: 03+1] profile::prometheus::ops: add monitoring for zookeeper test [puppet] - 10https://gerrit.wikimedia.org/r/644962 (https://phabricator.wikimedia.org/T268074) (owner: 10Elukey) [09:26:56] RECOVERY - Host ms-be2061 is UP: PING OK - Packet loss = 0%, RTA = 33.34 ms [09:26:58] RECOVERY - Host ms-be2060 is UP: PING OK - Packet loss = 0%, RTA = 31.74 ms [09:27:36] (03PS2) 10Elukey: hive: use an explicit parameter to deploy jdbc settings [puppet] - 10https://gerrit.wikimedia.org/r/645038 [09:27:38] (03PS2) 10Elukey: Move the hive metastore to analytics-test-hive in Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/645039 (https://phabricator.wikimedia.org/T268028) [09:31:32] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:34:35] !log gnt-instance reboot ldap-replica2003 to validate new qemu [09:34:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:26] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [09:35:27] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:35:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:26] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26872/console" [puppet] - 10https://gerrit.wikimedia.org/r/645039 (https://phabricator.wikimedia.org/T268028) (owner: 10Elukey) [09:37:56] 10Operations, 10Traffic: Package and deploy varnish 6.0.7 - https://phabricator.wikimedia.org/T268736 (10ema) Varnish 6.0.7-1wm1 has been working well on both cp4032 and cp3054, upgrading all other nodes (except for cp3052 which is running 5.1.3-1wm15 as part of T264398). [09:37:58] RECOVERY - Host ms-be2058 is UP: PING OK - Packet loss = 0%, RTA = 33.46 ms [09:39:01] (03PS3) 10Elukey: hive: use an explicit parameter to deploy jdbc settings [puppet] - 10https://gerrit.wikimedia.org/r/645038 [09:39:03] (03PS3) 10Elukey: Move the hive metastore to analytics-test-hive in Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/645039 (https://phabricator.wikimedia.org/T268028) [09:39:40] 10Operations, 10vm-requests, 10Kubernetes: codfw: 4 VM request for kubernetes staging - https://phabricator.wikimedia.org/T268747 (10akosiaris) 05Open→03Resolved VMs are up and running and the services (etcd, apiserver) have been setup on them [09:40:37] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26873/console" [puppet] - 10https://gerrit.wikimedia.org/r/645039 (https://phabricator.wikimedia.org/T268028) (owner: 10Elukey) [09:40:57] !log A:cp start rolling varnish upgrade to 6.0.7-1wm1 T268736 [09:41:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:05] T268736: Package and deploy varnish 6.0.7 - https://phabricator.wikimedia.org/T268736 [09:41:07] (03PS1) 10Alexandros Kosiaris: k8s-codfw-staging: Add DNS RRs [dns] - 10https://gerrit.wikimedia.org/r/645041 (https://phabricator.wikimedia.org/T244335) [09:43:04] (03CR) 10Elukey: [C: 03+2] profile::prometheus::ops: add monitoring for zookeeper test [puppet] - 10https://gerrit.wikimedia.org/r/644962 (https://phabricator.wikimedia.org/T268074) (owner: 10Elukey) [09:44:20] (03CR) 10Elukey: [C: 03+2] hive: use an explicit parameter to deploy jdbc settings [puppet] - 10https://gerrit.wikimedia.org/r/645038 (owner: 10Elukey) [09:44:37] (03CR) 10Elukey: [V: 03+1 C: 03+2] Move the hive metastore to analytics-test-hive in Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/645039 (https://phabricator.wikimedia.org/T268028) (owner: 10Elukey) [09:47:08] RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:03:20] 10Operations, 10netops: SNMPD_AUTH_RESTRICTED_ADDRESS through management-instance - https://phabricator.wikimedia.org/T269340 (10ayounsi) p:05Triage→03Low [10:05:20] 10Operations, 10netops: SNMPD_AUTH_RESTRICTED_ADDRESS through management-instance - https://phabricator.wikimedia.org/T269340 (10ayounsi) 05Open→03Resolved [10:15:54] 10Operations, 10Cloud-Services, 10Datasets-Archiving, 10Datasets-General-or-Unknown, and 2 others: Adjust bandwidth/connection limits, memory settings on labstore1006,7 as appropriate - https://phabricator.wikimedia.org/T191491 (10Hydriz) [10:16:01] 10Operations, 10Cloud-Services, 10Datasets-Archiving, 10Datasets-General-or-Unknown, and 2 others: Adjust bandwidth/connection limits, memory settings on labstore1006,7 as appropriate - https://phabricator.wikimedia.org/T191491 (10Hydriz) Will it be possible to lift the bandwidth cap for specific hosts wit... [10:25:16] (03CR) 10Volans: Add per device _get_vlans() (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/644872 (owner: 10Ayounsi) [10:41:58] 10Operations, 10netops: Junos changes for management-instance support on QFX - https://phabricator.wikimedia.org/T269340 (10ayounsi) [10:46:01] 10Operations, 10conftool, 10serviceops, 10Datacenter-Switchover: Disable maintenance scripts via conftool - https://phabricator.wikimedia.org/T266717 (10Volans) I'm not familiar on what's the usual execution time of all the scripts, but I'm wondering if we also need to stop all the systemd unit triggered b... [10:50:32] (03PS1) 10Jbond: add configueration for puppetdb cleanup TTLs [puppet] - 10https://gerrit.wikimedia.org/r/645047 [10:51:23] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26874/console" [puppet] - 10https://gerrit.wikimedia.org/r/645047 (owner: 10Jbond) [10:55:21] (03PS1) 10Alexandros Kosiaris: k8s: Remove default values for some parameters [puppet] - 10https://gerrit.wikimedia.org/r/645048 (https://phabricator.wikimedia.org/T244335) [10:55:22] (03PS1) 10Alexandros Kosiaris: profile::kubernetes: fold infrastructure_config to profile [puppet] - 10https://gerrit.wikimedia.org/r/645049 (https://phabricator.wikimedia.org/T244335) [10:55:24] 10Operations, 10netops: Junos changes for management-instance support on QFX - https://phabricator.wikimedia.org/T269340 (10ayounsi) Unrelated, but to document it somewhere: > Dec 3 10:03:00 cloudsw1-c8-eqiad /kernel: tcp_timer_keep:Local(0x80000010:53122) Foreign(0x80000001:6997) Harmless according to https... [10:55:27] (03PS1) 10Alexandros Kosiaris: profile::kubernetes::node: Remove toolforge customizations [puppet] - 10https://gerrit.wikimedia.org/r/645050 (https://phabricator.wikimedia.org/T244335) [10:55:29] (03PS1) 10Alexandros Kosiaris: scap-helm: Remove the last vestige of it [puppet] - 10https://gerrit.wikimedia.org/r/645051 [10:55:31] (03PS1) 10Alexandros Kosiaris: deployment_server: Add k8s-staging-codfw [puppet] - 10https://gerrit.wikimedia.org/r/645052 (https://phabricator.wikimedia.org/T244335) [10:55:33] (03PS1) 10Alexandros Kosiaris: kube-apiserver: Use the infrastructure users file directly [puppet] - 10https://gerrit.wikimedia.org/r/645053 (https://phabricator.wikimedia.org/T244335) [10:55:35] (03PS1) 10Alexandros Kosiaris: k8s:apiserver: Manage kube user/group [puppet] - 10https://gerrit.wikimedia.org/r/645054 (https://phabricator.wikimedia.org/T244335) [10:55:44] (03PS1) 10Muehlenhoff: Move s-nail (providing mail(1)) to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/645055 (https://phabricator.wikimedia.org/T268725) [10:56:42] 10Operations, 10homer, 10netops: Homer: merge all system.conf templates in one - https://phabricator.wikimedia.org/T269345 (10ayounsi) p:05Triage→03Low [10:57:03] (03CR) 10jerkins-bot: [V: 04-1] profile::kubernetes: fold infrastructure_config to profile [puppet] - 10https://gerrit.wikimedia.org/r/645049 (https://phabricator.wikimedia.org/T244335) (owner: 10Alexandros Kosiaris) [10:59:55] (03PS1) 10Muehlenhoff: dumps::generation::server::statsender: Drop OS check [puppet] - 10https://gerrit.wikimedia.org/r/645056 [10:59:57] (03PS2) 10Jbond: add configuration for puppetdb cleanup TTLs [puppet] - 10https://gerrit.wikimedia.org/r/645047 [11:00:04] mvolz: Dear deployers, time to do the Services – Citoid / Zotero deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201203T1100). [11:00:13] (03CR) 10jerkins-bot: [V: 04-1] dumps::generation::server::statsender: Drop OS check [puppet] - 10https://gerrit.wikimedia.org/r/645056 (owner: 10Muehlenhoff) [11:00:38] (03CR) 10Alexandros Kosiaris: [C: 03+1] hieradata: enable onhost memcached on jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/643234 (https://phabricator.wikimedia.org/T244340) (owner: 10Effie Mouzeli) [11:00:40] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26876/console" [puppet] - 10https://gerrit.wikimedia.org/r/645047 (owner: 10Jbond) [11:00:46] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26875/console" [puppet] - 10https://gerrit.wikimedia.org/r/645048 (https://phabricator.wikimedia.org/T244335) (owner: 10Alexandros Kosiaris) [11:00:49] (03CR) 10Alexandros Kosiaris: [C: 03+1] hieradata: enable onhost memcached on parsoid servers [puppet] - 10https://gerrit.wikimedia.org/r/643235 (https://phabricator.wikimedia.org/T244340) (owner: 10Effie Mouzeli) [11:01:06] (03PS2) 10Muehlenhoff: dumps::generation::server::statsender: Drop OS check [puppet] - 10https://gerrit.wikimedia.org/r/645056 [11:01:32] (03CR) 10jerkins-bot: [V: 04-1] dumps::generation::server::statsender: Drop OS check [puppet] - 10https://gerrit.wikimedia.org/r/645056 (owner: 10Muehlenhoff) [11:02:16] (03PS3) 10Muehlenhoff: dumps::generation::server::statsender: Drop OS check [puppet] - 10https://gerrit.wikimedia.org/r/645056 [11:04:18] (03PS4) 10Ayounsi: Add per device _get_vlans() [software/homer] - 10https://gerrit.wikimedia.org/r/644872 [11:04:42] (03CR) 10Ayounsi: [C: 03+2] Add per device _get_vlans() (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/644872 (owner: 10Ayounsi) [11:04:49] (03PS1) 10Elukey: hive: force the hive server to use the local metastore [puppet] - 10https://gerrit.wikimedia.org/r/645057 (https://phabricator.wikimedia.org/T268028) [11:06:13] (03Merged) 10jenkins-bot: Add per device _get_vlans() [software/homer] - 10https://gerrit.wikimedia.org/r/644872 (owner: 10Ayounsi) [11:06:50] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26880/console" [puppet] - 10https://gerrit.wikimedia.org/r/645057 (https://phabricator.wikimedia.org/T268028) (owner: 10Elukey) [11:07:00] (03Abandoned) 10Jcrespo: Revert "remote-backup-mariadb: update cron to systemd::timer::job" [puppet] - 10https://gerrit.wikimedia.org/r/644662 (owner: 10Jcrespo) [11:07:20] (03CR) 10Elukey: [V: 03+1 C: 03+2] hive: force the hive server to use the local metastore [puppet] - 10https://gerrit.wikimedia.org/r/645057 (https://phabricator.wikimedia.org/T268028) (owner: 10Elukey) [11:08:43] (03CR) 10Jcrespo: "Daniel: It kept failing after setting a full path. :'-(" [puppet] - 10https://gerrit.wikimedia.org/r/644662 (owner: 10Jcrespo) [11:08:44] (03PS2) 10Jcrespo: mariadb-backups: Remove old scheduled job disabling [puppet] - 10https://gerrit.wikimedia.org/r/644861 [11:11:48] (03CR) 10Elukey: [C: 03+1] "Analytics part +1!" [puppet] - 10https://gerrit.wikimedia.org/r/645055 (https://phabricator.wikimedia.org/T268725) (owner: 10Muehlenhoff) [11:13:03] (03PS2) 10Alexandros Kosiaris: k8s: Remove default values for some parameters [puppet] - 10https://gerrit.wikimedia.org/r/645048 (https://phabricator.wikimedia.org/T244335) [11:13:05] (03PS2) 10Alexandros Kosiaris: profile::kubernetes: fold infrastructure_config to profile [puppet] - 10https://gerrit.wikimedia.org/r/645049 (https://phabricator.wikimedia.org/T244335) [11:13:07] (03PS2) 10Alexandros Kosiaris: profile::kubernetes::node: Remove toolforge customizations [puppet] - 10https://gerrit.wikimedia.org/r/645050 (https://phabricator.wikimedia.org/T244335) [11:13:09] (03PS2) 10Alexandros Kosiaris: scap-helm: Remove the last vestige of it [puppet] - 10https://gerrit.wikimedia.org/r/645051 [11:13:11] (03PS2) 10Alexandros Kosiaris: deployment_server: Add k8s-staging-codfw [puppet] - 10https://gerrit.wikimedia.org/r/645052 (https://phabricator.wikimedia.org/T244335) [11:13:13] (03PS2) 10Alexandros Kosiaris: kube-apiserver: Use the infrastructure users file directly [puppet] - 10https://gerrit.wikimedia.org/r/645053 (https://phabricator.wikimedia.org/T244335) [11:13:15] (03PS2) 10Alexandros Kosiaris: k8s:apiserver: Manage kube user/group [puppet] - 10https://gerrit.wikimedia.org/r/645054 (https://phabricator.wikimedia.org/T244335) [11:13:17] (03CR) 10Ayounsi: [C: 03+1] "+1 for homer" [puppet] - 10https://gerrit.wikimedia.org/r/645055 (https://phabricator.wikimedia.org/T268725) (owner: 10Muehlenhoff) [11:13:19] (03CR) 10Marostegui: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/645055 (https://phabricator.wikimedia.org/T268725) (owner: 10Muehlenhoff) [11:16:39] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26884/console" [puppet] - 10https://gerrit.wikimedia.org/r/645048 (https://phabricator.wikimedia.org/T244335) (owner: 10Alexandros Kosiaris) [11:21:15] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26886/console" [puppet] - 10https://gerrit.wikimedia.org/r/645050 (https://phabricator.wikimedia.org/T244335) (owner: 10Alexandros Kosiaris) [11:21:54] (03PS3) 10Jbond: add configuration for puppetdb cleanup TTLs [puppet] - 10https://gerrit.wikimedia.org/r/645047 [11:23:02] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26888/console" [puppet] - 10https://gerrit.wikimedia.org/r/645047 (owner: 10Jbond) [11:23:26] (03CR) 10Jbond: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/644833 (owner: 10Muehlenhoff) [11:23:40] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26887/console" [puppet] - 10https://gerrit.wikimedia.org/r/645051 (owner: 10Alexandros Kosiaris) [11:26:11] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26889/console" [puppet] - 10https://gerrit.wikimedia.org/r/645052 (https://phabricator.wikimedia.org/T244335) (owner: 10Alexandros Kosiaris) [11:28:37] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26890/console" [puppet] - 10https://gerrit.wikimedia.org/r/645053 (https://phabricator.wikimedia.org/T244335) (owner: 10Alexandros Kosiaris) [11:31:44] !log volans@cumin2001 START - Cookbook sre.hosts.downtime for 0:10:00 on cumin2001.codfw.wmnet with reason: volans's test [11:31:45] !log volans@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on cumin2001.codfw.wmnet with reason: volans's test [11:31:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:03] rzl: you might like it ^^^ :D [11:33:28] (03CR) 10Volans: "Tested on cumin2001, actual output on SAL:" [cookbooks] - 10https://gerrit.wikimedia.org/r/633484 (https://phabricator.wikimedia.org/T221212) (owner: 10Volans) [11:35:34] PROBLEM - Check nf_conntrack usage in neutron netns on cloudnet1004 is CRITICAL: CRITICAL: nf_conntrack usage over 80% in netns qrouter-d93771ba-2711-4f88-804a-8df6fd03978a https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [11:36:25] (03CR) 10Effie Mouzeli: [C: 04-1] "There are a few things that I think make this patch to require some more love: 1) it should be split in at least 2 different patches, one " [puppet] - 10https://gerrit.wikimedia.org/r/643354 (https://phabricator.wikimedia.org/T266488) (owner: 10CRusnov) [11:40:46] RECOVERY - Check nf_conntrack usage in neutron netns on cloudnet1004 is OK: OK: everything is apparently fine https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [11:43:24] (03PS1) 10Jbond: P:puppetdb: increase node-ttl to 14d [puppet] - 10https://gerrit.wikimedia.org/r/645058 [11:43:28] PROBLEM - Host druid1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:45:50] this is me --^ [11:46:06] !log move druid1001 to rack A1 - T267065 [11:46:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:13] T267065: eqiad: Server moves to free up space on 10g racks - https://phabricator.wikimedia.org/T267065 [11:46:36] (03PS1) 10Lucas Werkmeister (WMDE): Enable implicit description usage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645059 (https://phabricator.wikimedia.org/T267745) [11:47:49] jouncebot: refresh please [11:47:50] I refreshed my knowledge about deployments. [11:51:04] jouncebot: now [11:51:04] For the next 0 hour(s) and 8 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201203T1100) [11:51:19] Lucas_WMDE: how long will this take ? [11:51:45] not too long, I think [11:51:58] I can’t really test it on mwdebug, I believe, since it goes through the job queue [11:52:11] so I’d sync Wikibase.php, verify that it works as expected, and (hopefully) be done [11:52:17] !log Start of mwscript extensions/AbuseFilter/maintenance/updateVarDumps.php --wiki=$wiki --print-orphaned-records-to=/tmp/urbanecm/$wiki-orphaned.log --progress-markers > $wiki.log in a tmux at mwmaint1002 (wiki=eswiki; T246539) [11:52:17] (and then watch logstash for a while) [11:52:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:22] oh cool, I want to rolling pool and depool jobrunners :p [11:52:23] T246539: Dry-run, then actually run updateVarDumps - https://phabricator.wikimedia.org/T246539 [11:53:16] Lucas_WMDE: just give me a go ahead to start my part, I want to install mmecached on jobrunners and then rolling depool and pool them [11:53:39] ok! [11:55:25] (03CR) 10Itamar Givon: [C: 03+1] Enable implicit description usage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645059 (https://phabricator.wikimedia.org/T267745) (owner: 10Lucas Werkmeister (WMDE)) [11:55:34] tx [11:57:24] !log Start of mwscript extensions/AbuseFilter/maintenance/updateVarDumps.php --wiki=$wiki --print-orphaned-records-to=/tmp/urbanecm/$wiki-orphaned.log --progress-markers > $wiki.log in a tmux at mwmaint1002 (wiki=enwiki; T246539) [11:57:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:31] T246539: Dry-run, then actually run updateVarDumps - https://phabricator.wikimedia.org/T246539 [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: May I have your attention please! European mid-day backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201203T1200) [12:00:04] Lucas_WMDE: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:22] o/ [12:01:00] I also realized I can test (part of) this change on mwdebug after all \o/ [12:01:02] RECOVERY - Host druid1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 12.05 ms [12:01:02] so I’ll do that first [12:01:39] (03CR) 10Muehlenhoff: [C: 03+2] Move s-nail (providing mail(1)) to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/645055 (https://phabricator.wikimedia.org/T268725) (owner: 10Muehlenhoff) [12:01:43] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Enable implicit description usage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645059 (https://phabricator.wikimedia.org/T267745) (owner: 10Lucas Werkmeister (WMDE)) [12:01:45] (03Merged) 10jenkins-bot: Enable implicit description usage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645059 (https://phabricator.wikimedia.org/T267745) (owner: 10Lucas Werkmeister (WMDE)) [12:02:35] pulled to mwdebug1001, testing… [12:03:53] (03PS1) 10Muehlenhoff: Revert "Move s-nail (providing mail(1)) to standard packages" [puppet] - 10https://gerrit.wikimedia.org/r/645061 [12:04:31] 10Operations, 10ops-eqiad, 10DBA: db1139 memory errors on boot (issue continues after board change) 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10Jclark-ctr) downloaded and sent log again to hp. moved dimm again per hp request.. Error continues DIMM 7 to slot 9 and DIMM 9 to slot 7 DIMM 8 to sl... [12:05:31] ACKNOWLEDGEMENT - PHP opcache health on parse2001 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% Effie Mouzeli This is a new host, it is fine https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:05:36] ok, seems to work as it should [12:05:42] syncing [12:05:55] (03CR) 10Muehlenhoff: [C: 03+2] Revert "Move s-nail (providing mail(1)) to standard packages" [puppet] - 10https://gerrit.wikimedia.org/r/645061 (owner: 10Muehlenhoff) [12:06:06] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Tested locally in minikube, works. +2ing" [deployment-charts] - 10https://gerrit.wikimedia.org/r/644531 (https://phabricator.wikimedia.org/T255672) (owner: 10KartikMistry) [12:07:29] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/Wikibase.php: Config: [[gerrit:645059|Enable implicit description usage (T267745)]] (duration: 01m 12s) [12:07:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:36] T267745: Enable implicit description use in production - https://phabricator.wikimedia.org/T267745 [12:07:38] (03Merged) 10jenkins-bot: Add apertium helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/644531 (https://phabricator.wikimedia.org/T255672) (owner: 10KartikMistry) [12:07:58] testing again [12:09:33] seems to work \o/ [12:09:46] !log EU backport+config window done [12:09:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:54] effie: you’re free to go :) [12:10:07] tx tx [12:10:14] RECOVERY - Host mw1304.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.90 ms [12:10:57] !log disable puppet on jobrunners and parsoid - T244340 [12:11:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:05] T244340: Reduce read pressure on mc* servers by adding a machine-local Memcached instance (on-host memcached) - https://phabricator.wikimedia.org/T244340 [12:11:14] 10Operations, 10ops-eqiad: mw1304.mgmt down - https://phabricator.wikimedia.org/T269050 (10Jclark-ctr) 05Open→03Resolved replaced mgmt cable. error cleared in icinga [12:15:14] (03CR) 10Effie Mouzeli: [C: 03+2] hieradata: enable onhost memcached on parsoid servers [puppet] - 10https://gerrit.wikimedia.org/r/643235 (https://phabricator.wikimedia.org/T244340) (owner: 10Effie Mouzeli) [12:15:37] (03CR) 10Effie Mouzeli: [C: 03+2] hieradata: enable onhost memcached on jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/643234 (https://phabricator.wikimedia.org/T244340) (owner: 10Effie Mouzeli) [12:17:03] (03PS2) 10Muehlenhoff: CAS support for debmonitor, step 1 [puppet] - 10https://gerrit.wikimedia.org/r/644833 [12:17:06] (03CR) 10JMeybohm: [C: 04-1] deployment_server: Add k8s-staging-codfw (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/645052 (https://phabricator.wikimedia.org/T244335) (owner: 10Alexandros Kosiaris) [12:17:11] (03PS2) 10Effie Mouzeli: hieradata: enable onhost memcached on parsoid servers [puppet] - 10https://gerrit.wikimedia.org/r/643235 (https://phabricator.wikimedia.org/T244340) [12:17:19] !log move aqs1006 to rack D6 - T267065 [12:17:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:26] D6: Interactive deployment shell aka iscap - https://phabricator.wikimedia.org/D6 [12:17:26] T267065: eqiad: Server moves to free up space on 10g racks - https://phabricator.wikimedia.org/T267065 [12:19:51] (03PS1) 10Alexandros Kosiaris: Add tokens and users for 3 new k8s services [puppet] - 10https://gerrit.wikimedia.org/r/645064 (https://phabricator.wikimedia.org/T265893) [12:20:22] (03PS2) 10Jbond: P:puppetdb: increase node-ttl to 14d and decrease node-purge-ttl to 1d [puppet] - 10https://gerrit.wikimedia.org/r/645058 [12:21:17] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26892/console" [puppet] - 10https://gerrit.wikimedia.org/r/645058 (owner: 10Jbond) [12:21:38] (03CR) 10Alexandros Kosiaris: "Bundling these 3 together to batch the work required. Adding reviewers so they can object to my naming scheme." [puppet] - 10https://gerrit.wikimedia.org/r/645064 (https://phabricator.wikimedia.org/T265893) (owner: 10Alexandros Kosiaris) [12:26:02] (03CR) 10Alexandros Kosiaris: [V: 03+1] deployment_server: Add k8s-staging-codfw (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/645052 (https://phabricator.wikimedia.org/T244335) (owner: 10Alexandros Kosiaris) [12:29:13] (03CR) 10Muehlenhoff: add configuration for puppetdb cleanup TTLs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/645047 (owner: 10Jbond) [12:29:18] (03PS3) 10Alexandros Kosiaris: k8s: Remove default values for some parameters [puppet] - 10https://gerrit.wikimedia.org/r/645048 (https://phabricator.wikimedia.org/T244335) [12:29:20] (03PS3) 10Alexandros Kosiaris: profile::kubernetes: fold infrastructure_config to profile [puppet] - 10https://gerrit.wikimedia.org/r/645049 (https://phabricator.wikimedia.org/T244335) [12:29:22] (03PS3) 10Alexandros Kosiaris: profile::kubernetes::node: Remove toolforge customizations [puppet] - 10https://gerrit.wikimedia.org/r/645050 (https://phabricator.wikimedia.org/T244335) [12:29:24] (03PS3) 10Alexandros Kosiaris: scap-helm: Remove the last vestige of it [puppet] - 10https://gerrit.wikimedia.org/r/645051 [12:29:26] (03PS3) 10Alexandros Kosiaris: kube-apiserver: Use the infrastructure users file directly [puppet] - 10https://gerrit.wikimedia.org/r/645053 (https://phabricator.wikimedia.org/T244335) [12:29:28] (03PS3) 10Alexandros Kosiaris: k8s:apiserver: Manage kube user/group [puppet] - 10https://gerrit.wikimedia.org/r/645054 (https://phabricator.wikimedia.org/T244335) [12:29:31] (03PS3) 10Alexandros Kosiaris: deployment_server: Add k8s-staging-codfw [puppet] - 10https://gerrit.wikimedia.org/r/645052 (https://phabricator.wikimedia.org/T244335) [12:33:28] PROBLEM - cassandra service on maps1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:33:42] PROBLEM - tilerator on maps1009 is CRITICAL: connect to address 10.64.32.8 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [12:33:51] (03PS4) 10Ayounsi: Run Homer during the decom cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/643979 [12:34:16] (03CR) 10Ayounsi: Run Homer during the decom cookbook (037 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/643979 (owner: 10Ayounsi) [12:34:24] PROBLEM - Check systemd state on maps1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:34:31] (03PS5) 10Ayounsi: Run Homer during the decom cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/643979 [12:36:21] (03CR) 10jerkins-bot: [V: 04-1] Run Homer during the decom cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/643979 (owner: 10Ayounsi) [12:37:06] !log installing jupyter-notebook security updates on Stretch [12:37:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:47] (03PS1) 10Jbond: disable-puppet: use a user of root if no sudo_user [puppet] - 10https://gerrit.wikimedia.org/r/645068 [12:49:53] (03PS2) 10Jbond: disable-puppet: use a user of root if no sudo_user [puppet] - 10https://gerrit.wikimedia.org/r/645068 [12:50:27] effie: volans: ^^ [12:51:48] (03CR) 10Kosta Harlan: Add tokens and users for 3 new k8s services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/645064 (https://phabricator.wikimedia.org/T265893) (owner: 10Alexandros Kosiaris) [12:52:28] (03CR) 10Effie Mouzeli: disable-puppet: use a user of root if no sudo_user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/645068 (owner: 10Jbond) [12:54:26] (03CR) 10Jbond: [C: 03+2] puppet-merge: readd check for unbounded variables [puppet] - 10https://gerrit.wikimedia.org/r/643746 (owner: 10Jbond) [12:55:30] PROBLEM - Host db1108.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:55:45] (03PS1) 10Jbond: test puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/645069 [12:56:18] (03CR) 10Jbond: [V: 03+2 C: 03+2] test puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/645069 (owner: 10Jbond) [12:57:02] (03PS1) 10Jbond: Revert "test puppet-merge" [puppet] - 10https://gerrit.wikimedia.org/r/644974 [12:58:25] (03CR) 10Jbond: [C: 03+2] Revert "test puppet-merge" [puppet] - 10https://gerrit.wikimedia.org/r/644974 (owner: 10Jbond) [13:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201203T1300) [13:00:21] !log move db1108 to C3 - T267065 [13:00:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:29] T267065: eqiad: Server moves to free up space on 10g racks - https://phabricator.wikimedia.org/T267065 [13:00:38] RECOVERY - Host db1108.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.54 ms [13:02:31] (03PS1) 10Muehlenhoff: Move bsd-mailx (providing mail(1)) to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/645071 (https://phabricator.wikimedia.org/T268725) [13:05:23] 10Operations, 10Domains, 10Traffic: URL to redirect to upcoming Wikipedia Birthday page on wikimediafoundation.org - https://phabricator.wikimedia.org/T264367 (10hdothiduc) > @hdothiduc Alright, thank you. You're welcome. To clarify for us it's actually easiest to complete it now but either way is no problem... [13:07:04] 10Operations, 10SRE-Access-Requests: Change production access ssh key for wmde-leszek - https://phabricator.wikimedia.org/T269351 (10WMDE-leszek) [13:08:02] 10Operations, 10SRE-Access-Requests: Change production access ssh key for wmde-leszek - https://phabricator.wikimedia.org/T269351 (10WMDE-leszek) [13:09:46] !log uploaded jenkins 2.263.1 to apt.wikimedia.org component/ci [13:09:48] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/622666 (owner: 10Dzahn) [13:09:50] (03PS6) 10Ayounsi: Run Homer during the decom cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/643979 [13:09:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:52] (03CR) 10BBlack: [C: 03+1] vcl: remove legacy temporary parameter workaround [puppet] - 10https://gerrit.wikimedia.org/r/644465 (owner: 10Ema) [13:11:29] (03CR) 10Jbond: "> Patch Set 8:" [puppet] - 10https://gerrit.wikimedia.org/r/554825 (https://phabricator.wikimedia.org/T211750) (owner: 10Jbond) [13:11:45] (03PS1) 10Ssingh: admin: update ssh key for wmde-leszek [puppet] - 10https://gerrit.wikimedia.org/r/645073 (https://phabricator.wikimedia.org/T269351) [13:12:49] (03CR) 10BBlack: [C: 03+1] vcl: move /static Host normalization to cluster_fe_recv_pre_purge [puppet] - 10https://gerrit.wikimedia.org/r/644466 (https://phabricator.wikimedia.org/T130904) (owner: 10Ema) [13:16:26] (03CR) 10Marostegui: "What's the difference with https://gerrit.wikimedia.org/r/c/operations/puppet/+/645055?" [puppet] - 10https://gerrit.wikimedia.org/r/645071 (https://phabricator.wikimedia.org/T268725) (owner: 10Muehlenhoff) [13:16:48] (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/633484 (https://phabricator.wikimedia.org/T221212) (owner: 10Volans) [13:17:15] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/644861 (owner: 10Jcrespo) [13:18:52] PROBLEM - Host druid1003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:19:19] this is me --^ [13:19:32] (03CR) 10Jbond: disable-puppet: use a user of root if no sudo_user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/645068 (owner: 10Jbond) [13:19:48] (03CR) 10Muehlenhoff: "645055 was reverted. I was misled by the package description: s-nail is an implementation of mail(1), but with slight differences, so ther" [puppet] - 10https://gerrit.wikimedia.org/r/645071 (https://phabricator.wikimedia.org/T268725) (owner: 10Muehlenhoff) [13:21:01] (03CR) 10Marostegui: [C: 03+1] "Thanks! I didn't get the notification for the revert so it looked strange :-)" [puppet] - 10https://gerrit.wikimedia.org/r/645071 (https://phabricator.wikimedia.org/T268725) (owner: 10Muehlenhoff) [13:24:08] RECOVERY - Host druid1003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.75 ms [13:24:36] !log Upgraded Jenkins on releases2002 (spare server) # T269352 [13:24:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:46] T269352: Upgrade Jenkins to 2.263.1 - https://phabricator.wikimedia.org/T269352 [13:24:48] (03CR) 10Volans: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/645068 (owner: 10Jbond) [13:24:56] (03PS4) 10Jbond: add configuration for puppetdb cleanup TTLs [puppet] - 10https://gerrit.wikimedia.org/r/645047 [13:25:19] (03CR) 10Jbond: "corrected" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/645047 (owner: 10Jbond) [13:29:56] !log Upgraded Jenkins on releases1002 # T269352 [13:30:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:03] T269352: Upgrade Jenkins to 2.263.1 - https://phabricator.wikimedia.org/T269352 [13:30:37] !log puppet enabled on jobrunners [13:30:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:21] (03CR) 10ArielGlenn: "We do need the actual mail package for buster, that seems to be gone now?" [puppet] - 10https://gerrit.wikimedia.org/r/645056 (owner: 10Muehlenhoff) [13:36:42] 10Operations, 10Traffic: Package and deploy varnish 6.0.7 - https://phabricator.wikimedia.org/T268736 (10ema) 05Open→03Resolved a:03ema 6.0.7-1wm1 deployed fleet-wide, closing. [13:39:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1089', diff saved to https://phabricator.wikimedia.org/P13522 and previous config saved to /var/cache/conftool/dbconfig/20201203-133953-marostegui.json [13:39:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1089', diff saved to https://phabricator.wikimedia.org/P13523 and previous config saved to /var/cache/conftool/dbconfig/20201203-134724-marostegui.json [13:47:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:20] (03CR) 10Volans: "Are you just replicating the current values for now?" [puppet] - 10https://gerrit.wikimedia.org/r/645047 (owner: 10Jbond) [13:51:23] RECOVERY - Check systemd state on pki2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:54:10] (03CR) 10Muehlenhoff: [C: 04-1] "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/645056 (owner: 10Muehlenhoff) [13:54:33] PROBLEM - Check systemd state on pki2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:54:58] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/645058 (owner: 10Jbond) [13:55:25] (03PS1) 10Hashar: jenkins: support changing $JAVA_HOME [puppet] - 10https://gerrit.wikimedia.org/r/645075 [13:56:13] (03PS5) 10Volans: sre.hosts.downtime: convert to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/633484 (https://phabricator.wikimedia.org/T221212) [13:57:11] (03PS3) 10Jbond: P:puppetdb: increase node-ttl to 14d and decrease node-purge-ttl to 1d [puppet] - 10https://gerrit.wikimedia.org/r/645058 [13:57:30] (03PS1) 10KartikMistry: Add apertium helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/645076 (https://phabricator.wikimedia.org/T255672) [13:57:32] (03CR) 10Jbond: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/645047 (owner: 10Jbond) [13:58:07] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [13:58:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:45] (03CR) 10jerkins-bot: [V: 04-1] Add apertium helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/645076 (https://phabricator.wikimedia.org/T255672) (owner: 10KartikMistry) [14:00:04] hashar and twentyafterfour: It is that lovely time of the day again! You are hereby commanded to deploy Mediawiki train - European+American Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201203T1400). [14:00:08] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:00:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:11] train is blocked I think [14:04:08] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/645075 (owner: 10Hashar) [14:09:54] (03PS2) 10Hashar: jenkins: support changing $JAVA_HOME [puppet] - 10https://gerrit.wikimedia.org/r/645075 (https://phabricator.wikimedia.org/T269354) [14:11:10] (03CR) 10Hashar: "Attached to T269354" [puppet] - 10https://gerrit.wikimedia.org/r/645075 (https://phabricator.wikimedia.org/T269354) (owner: 10Hashar) [14:11:27] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/645075 (https://phabricator.wikimedia.org/T269354) (owner: 10Hashar) [14:13:44] (03PS1) 10JMeybohm: Remove calico::builder [puppet] - 10https://gerrit.wikimedia.org/r/645078 (https://phabricator.wikimedia.org/T266893) [14:15:27] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: Refactor calico deploy strategy - https://phabricator.wikimedia.org/T267653 (10JMeybohm) a:03JMeybohm [14:16:56] 1q [14:17:03] (03CR) 10Alexandros Kosiaris: [C: 04-1] Add apertium helmfile.d (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/645076 (https://phabricator.wikimedia.org/T255672) (owner: 10KartikMistry) [14:19:36] (03PS1) 10Papaul: Add logstash203[345] to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/645079 (https://phabricator.wikimedia.org/T267420) [14:23:09] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ldap site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:23:29] PROBLEM - Labs LDAP on serpens is CRITICAL: Could not bind to the LDAP server https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting [14:23:36] (03CR) 10Papaul: [C: 03+2] Add logstash203[345] to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/645079 (https://phabricator.wikimedia.org/T267420) (owner: 10Papaul) [14:25:22] (03CR) 10Kormat: [C: 03+1] admin: update ssh key for wmde-leszek [puppet] - 10https://gerrit.wikimedia.org/r/645073 (https://phabricator.wikimedia.org/T269351) (owner: 10Ssingh) [14:25:55] (03CR) 10Ssingh: [C: 03+2] admin: update ssh key for wmde-leszek [puppet] - 10https://gerrit.wikimedia.org/r/645073 (https://phabricator.wikimedia.org/T269351) (owner: 10Ssingh) [14:25:58] (03PS6) 10JMeybohm: Add calico helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/644462 (https://phabricator.wikimedia.org/T267653) [14:27:57] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Change production access ssh key for wmde-leszek - https://phabricator.wikimedia.org/T269351 (10ssingh) 05Open→03Resolved Key confirmed out-of-band; request merged. Thanks! Feel free to reopen if there are any issues. [14:29:52] (03PS1) 10Volans: wmf-autp-reimage: prevent race condition [puppet] - 10https://gerrit.wikimedia.org/r/645080 (https://phabricator.wikimedia.org/T269187) [14:30:26] (03CR) 10Muehlenhoff: [C: 03+2] CAS support for debmonitor, step 1 [puppet] - 10https://gerrit.wikimedia.org/r/644833 (owner: 10Muehlenhoff) [14:30:28] (03PS2) 10KartikMistry: Add apertium helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/645076 (https://phabricator.wikimedia.org/T255672) [14:31:45] (03CR) 10jerkins-bot: [V: 04-1] Add apertium helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/645076 (https://phabricator.wikimedia.org/T255672) (owner: 10KartikMistry) [14:32:15] (03PS3) 10KartikMistry: Add apertium helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/645076 (https://phabricator.wikimedia.org/T255672) [14:32:57] (03CR) 10Alexandros Kosiaris: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/645076 (https://phabricator.wikimedia.org/T255672) (owner: 10KartikMistry) [14:33:39] (03CR) 10jerkins-bot: [V: 04-1] Add apertium helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/645076 (https://phabricator.wikimedia.org/T255672) (owner: 10KartikMistry) [14:34:02] !log stop zookeeper and etcd on conf1005 as prep-step before rack move [14:34:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:38] (03PS1) 10Jbond: early_command: configure static, mapped ipv6 address [puppet] - 10https://gerrit.wikimedia.org/r/645081 [14:37:03] 10Operations, 10SRE-Access-Requests: Requesting access to maps for mbsantos and jgiannelos - https://phabricator.wikimedia.org/T269357 (10MSantos) [14:37:25] (03CR) 10jerkins-bot: [V: 04-1] early_command: configure static, mapped ipv6 address [puppet] - 10https://gerrit.wikimedia.org/r/645081 (owner: 10Jbond) [14:38:21] (03PS1) 10Muehlenhoff: Fix CustomLog directive [puppet] - 10https://gerrit.wikimedia.org/r/645082 [14:38:30] (03CR) 10Jbond: [C: 03+1] "this looks fine, hopefully we can remove the hack if we get the following working" [puppet] - 10https://gerrit.wikimedia.org/r/645080 (https://phabricator.wikimedia.org/T269187) (owner: 10Volans) [14:38:48] jouncebot now [14:38:49] For the next 1 hour(s) and 21 minute(s): Mediawiki train - European+American Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201203T1400) [14:38:50] jouncebot next [14:38:51] In 2 hour(s) and 21 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201203T1700) [14:39:54] (03PS2) 10Jbond: early_command: configure static, mapped ipv6 address [puppet] - 10https://gerrit.wikimedia.org/r/645081 [14:40:15] (03CR) 10Volans: [C: 03+2] wmf-autp-reimage: prevent race condition [puppet] - 10https://gerrit.wikimedia.org/r/645080 (https://phabricator.wikimedia.org/T269187) (owner: 10Volans) [14:40:35] ah a typo! [14:40:50] (03CR) 10Razzi: [C: 03+2] superset: add cache to an-tool1010 [puppet] - 10https://gerrit.wikimedia.org/r/644672 (https://phabricator.wikimedia.org/T268784) (owner: 10Razzi) [14:40:52] (03PS12) 10Razzi: superset: add cache to an-tool1010 [puppet] - 10https://gerrit.wikimedia.org/r/644672 (https://phabricator.wikimedia.org/T268784) [14:40:54] (03CR) 10jerkins-bot: [V: 04-1] early_command: configure static, mapped ipv6 address [puppet] - 10https://gerrit.wikimedia.org/r/645081 (owner: 10Jbond) [14:41:44] (03CR) 10Muehlenhoff: [C: 03+2] Fix CustomLog directive [puppet] - 10https://gerrit.wikimedia.org/r/645082 (owner: 10Muehlenhoff) [14:42:06] 10Operations, 10SRE-Access-Requests: Requesting access to maps for mbsantos and jgiannelos - https://phabricator.wikimedia.org/T269357 (10MSantos) [14:42:59] 10Operations, 10Domains, 10Traffic: URL to redirect to upcoming Wikipedia Birthday page on wikimediafoundation.org - https://phabricator.wikimedia.org/T264367 (10Dzahn) @hdothiduc I can can confirm going to https://wikimediafoundation.org/wikipedia20 redirects me to just https://wikimediafoundation.org (but... [14:46:59] !log rolling depool and pool of parsoid servers [14:47:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:11] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:47:33] 10Operations, 10SRE-Access-Requests: Requesting access to maps for mbsantos and jgiannelos - https://phabricator.wikimedia.org/T269357 (10dcipoletti) p:05Triage→03High [14:47:50] 10Operations, 10SRE-Access-Requests: Requesting access to maps for mbsantos and jgiannelos - https://phabricator.wikimedia.org/T269357 (10MoritzMuehlenhoff) Is this specifically only about maps2007? Given that you're both members of the maps-admin group you can log into any maps host, but maps2007 in particula... [14:48:43] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:49:32] (03PS3) 10Jbond: early_command: configure static, mapped ipv6 address [puppet] - 10https://gerrit.wikimedia.org/r/645081 [14:49:43] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install logstash203[345] - https://phabricator.wikimedia.org/T267420 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` logstash2033.codfw.wmnet ` The log can be found... [14:50:17] PROBLEM - Host conf1005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:52:35] (03CR) 10Jbond: "Ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/645081 (owner: 10Jbond) [14:52:59] (03CR) 10Alexandros Kosiaris: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/645076 (https://phabricator.wikimedia.org/T255672) (owner: 10KartikMistry) [14:53:09] 10Operations, 10SRE-Access-Requests: Requesting access to maps for mbsantos and jgiannelos - https://phabricator.wikimedia.org/T269357 (10MSantos) >>! In T269357#6666829, @MoritzMuehlenhoff wrote: > Is this specifically only about maps2007? Given that you're both members of the maps-admin group you can log int... [14:55:20] RECOVERY - Host conf1005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.87 ms [14:55:20] (03PS4) 10Alexandros Kosiaris: Add apertium helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/645076 (https://phabricator.wikimedia.org/T255672) (owner: 10KartikMistry) [14:55:52] (03CR) 10Jbond: [C: 03+2] add configuration for puppetdb cleanup TTLs [puppet] - 10https://gerrit.wikimedia.org/r/645047 (owner: 10Jbond) [14:57:26] (03CR) 10Jbond: "Will merege this for now can always make a second update to error if no message" [puppet] - 10https://gerrit.wikimedia.org/r/645068 (owner: 10Jbond) [14:57:29] (03CR) 10Jbond: [C: 03+2] disable-puppet: use a user of root if no sudo_user [puppet] - 10https://gerrit.wikimedia.org/r/645068 (owner: 10Jbond) [14:57:59] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Definitely fixed an issue in CI in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/645076 and code LGTM, so +2ing" [deployment-charts] - 10https://gerrit.wikimedia.org/r/644474 (owner: 10JMeybohm) [14:58:57] (03CR) 10Jbond: [C: 03+2] P:puppetdb: increase node-ttl to 14d and decrease node-purge-ttl to 1d [puppet] - 10https://gerrit.wikimedia.org/r/645058 (owner: 10Jbond) [14:59:16] (03Merged) 10jenkins-bot: Run a helm repo update before linting [deployment-charts] - 10https://gerrit.wikimedia.org/r/644474 (owner: 10JMeybohm) [14:59:39] !log disable puppet fleet wide to role out puppetdb node-ttl change [14:59:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:46] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2001 is CRITICAL: CRITICAL - Unknown error executing dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [15:01:09] (03CR) 10Alexandros Kosiaris: [C: 03+2] "And by being rebased on top of the helm repo update change, CI +2ed this. It also LGTM, so +2ing" [deployment-charts] - 10https://gerrit.wikimedia.org/r/645076 (https://phabricator.wikimedia.org/T255672) (owner: 10KartikMistry) [15:01:23] (03CR) 10Ema: [C: 03+2] vcl: remove legacy temporary parameter workaround [puppet] - 10https://gerrit.wikimedia.org/r/644465 (owner: 10Ema) [15:01:35] elukey: conftool has been behaving funnily [15:01:37] (03PS3) 10Ema: vcl: move /static Host normalization to cluster_fe_recv_pre_purge [puppet] - 10https://gerrit.wikimedia.org/r/644466 (https://phabricator.wikimedia.org/T130904) [15:01:41] is that expected? [15:02:11] (03CR) 10Ema: [C: 03+2] vcl: move /static Host normalization to cluster_fe_recv_pre_purge [puppet] - 10https://gerrit.wikimedia.org/r/644466 (https://phabricator.wikimedia.org/T130904) (owner: 10Ema) [15:02:39] (03Merged) 10jenkins-bot: Add apertium helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/645076 (https://phabricator.wikimedia.org/T255672) (owner: 10KartikMistry) [15:05:31] effie: there's some maintenance on the conf* nodes, see -sre [15:05:43] yeah I saw it on sall [15:05:44] sal* [15:05:54] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [15:06:06] I will take the conversation there, thank you [15:06:15] ack [15:06:26] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [15:06:29] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [15:06:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:21] (03PS1) 10Andrew Bogott: haproxy: reduce log persistence to 14 days [puppet] - 10https://gerrit.wikimedia.org/r/645096 (https://phabricator.wikimedia.org/T269252) [15:11:56] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install logstash203[345] - https://phabricator.wikimedia.org/T267420 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['logstash2033.codfw.wmnet'] ` Of which those **FAILED**: ` ['logstash2033.codfw.wmnet'] ` [15:15:01] 10Operations, 10SRE-Access-Requests: Requesting access to maps for mbsantos and jgiannelos - https://phabricator.wikimedia.org/T269357 (10akosiaris) >>! In T269357#6666853, @MSantos wrote: >>>! In T269357#6666829, @MoritzMuehlenhoff wrote: >> Is this specifically only about maps2007? Given that you're both mem... [15:16:47] (03CR) 10Alexandros Kosiaris: [C: 03+1] "To my knowledge this has only been used once in WMCS and is no longer used since a long time ago. Should any resource be found to use it, " [puppet] - 10https://gerrit.wikimedia.org/r/645078 (https://phabricator.wikimedia.org/T266893) (owner: 10JMeybohm) [15:17:12] (03CR) 10Razzi: Ensure /tmp/sqoop-jars/ is present (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/644347 (https://phabricator.wikimedia.org/T251788) (owner: 10Razzi) [15:18:05] (03CR) 10Alexandros Kosiaris: Add tokens and users for 3 new k8s services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/645064 (https://phabricator.wikimedia.org/T265893) (owner: 10Alexandros Kosiaris) [15:18:16] (03PS2) 10MSantos: WIP: start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) [15:19:36] (03PS2) 10Alexandros Kosiaris: Add tokens and users for 3 new k8s services [puppet] - 10https://gerrit.wikimedia.org/r/645064 (https://phabricator.wikimedia.org/T265893) [15:19:41] 10Operations, 10Traffic, 10Patch-For-Review: Host rewrite for /static/ not applied to purges - https://phabricator.wikimedia.org/T130904 (10ema) 05Open→03Resolved a:03ema I tried a test purge on cp3054: ` curl -v -X PURGE -H "Host: fr.wikipedia.org" http://127.0.0.1/static/ema-test ` And indeed varnish... [15:19:45] (03CR) 10jerkins-bot: [V: 04-1] WIP: start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [15:20:18] (03CR) 10Ottomata: "One nit, but you know whatever! :)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/645038 (owner: 10Elukey) [15:20:49] (03CR) 10Alexandros Kosiaris: [C: 03+2] k8s: Remove default values for some parameters [puppet] - 10https://gerrit.wikimedia.org/r/645048 (https://phabricator.wikimedia.org/T244335) (owner: 10Alexandros Kosiaris) [15:21:02] (03CR) 10Alexandros Kosiaris: [C: 03+2] "PCC gave a +1, merging" [puppet] - 10https://gerrit.wikimedia.org/r/645048 (https://phabricator.wikimedia.org/T244335) (owner: 10Alexandros Kosiaris) [15:22:00] (03PS4) 10Muehlenhoff: dumps::generation::server::statsender: Drop OS check [puppet] - 10https://gerrit.wikimedia.org/r/645056 [15:22:47] (03CR) 10Razzi: [C: 03+1] hive: use an explicit parameter to deploy jdbc settings [puppet] - 10https://gerrit.wikimedia.org/r/645038 (owner: 10Elukey) [15:24:12] (03CR) 10Ottomata: hive: force the hive server to use the local metastore (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/645057 (https://phabricator.wikimedia.org/T268028) (owner: 10Elukey) [15:24:32] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26893/console" [puppet] - 10https://gerrit.wikimedia.org/r/645049 (https://phabricator.wikimedia.org/T244335) (owner: 10Alexandros Kosiaris) [15:26:04] (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+2] "PCC just +1ed it, +2ing and merging" [puppet] - 10https://gerrit.wikimedia.org/r/645049 (https://phabricator.wikimedia.org/T244335) (owner: 10Alexandros Kosiaris) [15:26:39] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Has already been +1ed by PCC, +2ing" [puppet] - 10https://gerrit.wikimedia.org/r/645050 (https://phabricator.wikimedia.org/T244335) (owner: 10Alexandros Kosiaris) [15:26:47] (03PS4) 10Alexandros Kosiaris: profile::kubernetes::node: Remove toolforge customizations [puppet] - 10https://gerrit.wikimedia.org/r/645050 (https://phabricator.wikimedia.org/T244335) [15:26:56] (03PS1) 10Jbond: enable-puppet: Add sudo_user to enable-puppet command as well [puppet] - 10https://gerrit.wikimedia.org/r/645100 [15:27:19] PROBLEM - Memcached on an-tool1010 is CRITICAL: connect to address 10.64.32.81 and port 11211: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [15:29:12] (03CR) 10Alexandros Kosiaris: [C: 03+2] scap-helm: Remove the last vestige of it [puppet] - 10https://gerrit.wikimedia.org/r/645051 (owner: 10Alexandros Kosiaris) [15:32:56] (03CR) 10Bstorm: [C: 03+1] "I think this should work https://puppet-compiler.wmflabs.org/compiler1003/26896/" [puppet] - 10https://gerrit.wikimedia.org/r/622666 (owner: 10Dzahn) [15:33:04] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install frqueue100[34] - https://phabricator.wikimedia.org/T266365 (10Cmjohnson) [15:35:44] 10Operations, 10SRE-Access-Requests: Requesting access to maps for mbsantos and jgiannelos - https://phabricator.wikimedia.org/T269357 (10MSantos) >>! In T269357#6666885, @akosiaris wrote: >>>! In T269357#6666853, @MSantos wrote: >>>>! In T269357#6666829, @MoritzMuehlenhoff wrote: >>> Is this specifically only... [15:36:15] (03CR) 10Jbond: [C: 03+2] enable-puppet: Add sudo_user to enable-puppet command as well [puppet] - 10https://gerrit.wikimedia.org/r/645100 (owner: 10Jbond) [15:36:37] godog: effie: [15:36:41] sorry [15:36:52] been a while thogh go.dog ;) [15:36:58] <3 [15:38:07] (03CR) 10Bstorm: "Adding some folks from Data Persistence for review since they use this class as well." [puppet] - 10https://gerrit.wikimedia.org/r/645096 (https://phabricator.wikimedia.org/T269252) (owner: 10Andrew Bogott) [15:41:55] (03PS1) 10Volans: wmf-auto-reimage: safer output parsing [puppet] - 10https://gerrit.wikimedia.org/r/645102 (https://phabricator.wikimedia.org/T269187) [15:42:21] jbond42: lol <3 [15:44:06] (03PS3) 10MSantos: WIP: start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) [15:44:10] (03CR) 10Elukey: hive: force the hive server to use the local metastore (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/645057 (https://phabricator.wikimedia.org/T268028) (owner: 10Elukey) [15:45:01] (03CR) 10Bstorm: [C: 03+2] labstore: add data types and some other style fixes [puppet] - 10https://gerrit.wikimedia.org/r/622666 (owner: 10Dzahn) [15:45:18] (03CR) 10Elukey: hive: use an explicit parameter to deploy jdbc settings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/645038 (owner: 10Elukey) [15:45:33] (03CR) 10jerkins-bot: [V: 04-1] WIP: start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [15:45:51] 10Operations, 10LDAP-Access-Requests: Onboarding Genoveva, access request to ldap/wmf - https://phabricator.wikimedia.org/T269365 (10gengh) [15:45:59] !log moved conf1005 to rack B3 - T267065 [15:46:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:07] T267065: eqiad: Server moves to free up space on 10g racks - https://phabricator.wikimedia.org/T267065 [15:46:56] (03PS1) 10Jbond: enable-puppet: fix force support [puppet] - 10https://gerrit.wikimedia.org/r/645103 [15:46:58] (03PS1) 10Jbond: disable-puppet: make message mandatory [puppet] - 10https://gerrit.wikimedia.org/r/645104 [15:47:00] effie volans ^^ [15:47:12] !log updated thirdparty/postgres96 to 9.6.20-1.pgdg100+1 9.6.17-2.pgdg100+1 [15:47:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:45] (03CR) 10Effie Mouzeli: [C: 03+1] disable-puppet: make message mandatory [puppet] - 10https://gerrit.wikimedia.org/r/645104 (owner: 10Jbond) [15:48:19] PROBLEM - debmonitor.wikimedia.org requires authentication on debmonitor1002 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 400 Bad Request https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [15:48:38] 10Operations, 10serviceops, 10Datacenter-Switchover: Updates to warmup script - https://phabricator.wikimedia.org/T269179 (10RLazarus) [15:49:23] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/645102 (https://phabricator.wikimedia.org/T269187) (owner: 10Volans) [15:49:35] jouncebot: next [15:49:35] In 1 hour(s) and 10 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201203T1700) [15:49:38] jouncebot: now [15:49:39] For the next 0 hour(s) and 10 minute(s): Mediawiki train - European+American Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201203T1400) [15:49:50] ^ debmonitor alert it harmless, silencinf [15:50:23] ACKNOWLEDGEMENT - debmonitor.wikimedia.org requires authentication on debmonitor1002 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 400 Bad Request Muehlenhoff in setuip, still LDAP in place https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [15:51:09] (03CR) 10Volans: [C: 03+2] wmf-auto-reimage: safer output parsing [puppet] - 10https://gerrit.wikimedia.org/r/645102 (https://phabricator.wikimedia.org/T269187) (owner: 10Volans) [15:51:26] (03PS6) 10Jbond: labstore::nfs_mount: drop support for empty string share_path [puppet] - 10https://gerrit.wikimedia.org/r/630589 [15:52:01] !log upgrading labweb* to ICU 63 - T264991 [15:52:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:09] T264991: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 [15:52:54] RECOVERY - Memcached on an-tool1010 is OK: TCP OK - 0.000 second response time on 10.64.32.81 port 11211 https://wikitech.wikimedia.org/wiki/Memcached [15:54:14] (03CR) 10Effie Mouzeli: [C: 03+2] hiera: enable icu 63 component on labweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/644248 (https://phabricator.wikimedia.org/T264991) (owner: 10Effie Mouzeli) [15:54:47] (03CR) 10Elukey: hive: force the hive server to use the local metastore (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/645057 (https://phabricator.wikimedia.org/T268028) (owner: 10Elukey) [15:56:22] (03CR) 10Elukey: Ensure /tmp/sqoop-jars/ is present (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/644347 (https://phabricator.wikimedia.org/T251788) (owner: 10Razzi) [16:03:17] (03CR) 10Hashar: "https://puppet-compiler.wmflabs.org/compiler1003/645/" [puppet] - 10https://gerrit.wikimedia.org/r/645075 (https://phabricator.wikimedia.org/T269354) (owner: 10Hashar) [16:04:40] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:12:43] something weird happening on s7, reads increased x8 [16:13:03] https://grafana.wikimedia.org/d/000000278/mysql-aggregated?viewPanel=8&orgId=1&from=1606990378452&to=1607011978452&var-site=eqiad&var-group=core&var-shard=All&var-role=All [16:14:45] (03PS1) 10Arturo Borrero Gonzalez: openstack: l3 agent: refresh conntrackd configuration [puppet] - 10https://gerrit.wikimedia.org/r/645107 (https://phabricator.wikimedia.org/T268335) [16:15:00] (03CR) 10Mstyles: Add new helm chart for rdf-streaming-updater (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/640571 (https://phabricator.wikimedia.org/T265526) (owner: 10Mstyles) [16:15:30] 10Operations, 10SRE-Access-Requests: Requesting access to maps for mbsantos and jgiannelos - https://phabricator.wikimedia.org/T269357 (10Jgiannelos) Regarding the packaging part, here is some context about the state of imposm3 and debian: https://phabricator.wikimedia.org/T238753 Just to add to what Mateus sa... [16:15:36] 10Operations, 10SRE-Access-Requests: Requesting access to Analytics Data for WMDE-leszek - https://phabricator.wikimedia.org/T269284 (10WMDE-leszek) Thanks a lot @ssingh, all seems to work! [16:17:51] (03PS4) 10Bstorm: wmcs: set the add_wiki cookbook to only run meta_p on some hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/644552 (https://phabricator.wikimedia.org/T268312) [16:20:12] (03CR) 10Bstorm: [C: 03+2] wikireplicas: add cumin aliases that include multiinstance servers [puppet] - 10https://gerrit.wikimedia.org/r/644606 (https://phabricator.wikimedia.org/T268312) (owner: 10Bstorm) [16:21:12] (03PS2) 10Arturo Borrero Gonzalez: openstack: l3 agent: refresh conntrackd configuration [puppet] - 10https://gerrit.wikimedia.org/r/645107 (https://phabricator.wikimedia.org/T268335) [16:22:18] (03CR) 10Bstorm: [C: 03+2] wmcs: set the add_wiki cookbook to only run meta_p on some hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/644552 (https://phabricator.wikimedia.org/T268312) (owner: 10Bstorm) [16:23:53] (03Merged) 10jenkins-bot: wmcs: set the add_wiki cookbook to only run meta_p on some hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/644552 (https://phabricator.wikimedia.org/T268312) (owner: 10Bstorm) [16:24:26] (03PS3) 10Arturo Borrero Gonzalez: openstack: l3 agent: refresh conntrackd configuration [puppet] - 10https://gerrit.wikimedia.org/r/645107 (https://phabricator.wikimedia.org/T268335) [16:25:36] 10Operations, 10LDAP-Access-Requests: Onboarding Genoveva, access request to ldap/wmf - https://phabricator.wikimedia.org/T269365 (10Jdforrester-WMF) Confirming as team Tech Lead that Geno is new staff on our team (Abstract Wikipedia). [16:26:19] (03PS4) 10Arturo Borrero Gonzalez: openstack: l3 agent: refresh conntrackd configuration [puppet] - 10https://gerrit.wikimedia.org/r/645107 (https://phabricator.wikimedia.org/T268335) [16:27:36] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [16:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:49] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1002/26900/" [puppet] - 10https://gerrit.wikimedia.org/r/645107 (https://phabricator.wikimedia.org/T268335) (owner: 10Arturo Borrero Gonzalez) [16:27:55] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] openstack: l3 agent: refresh conntrackd configuration [puppet] - 10https://gerrit.wikimedia.org/r/645107 (https://phabricator.wikimedia.org/T268335) (owner: 10Arturo Borrero Gonzalez) [16:28:00] 10Operations, 10LDAP-Access-Requests: Onboarding Genoveva, access request to ldap/wmf - https://phabricator.wikimedia.org/T269365 (10dr0ptp4kt) Approved. [16:28:53] 10Operations, 10LDAP-Access-Requests: Onboarding Genoveva, access request to ldap/wmf - https://phabricator.wikimedia.org/T269365 (10Jdforrester-WMF) a:05dr0ptp4kt→03None Over to SRE. [16:29:56] 10Operations, 10LDAP-Access-Requests: Onboarding Genoveva, access request to ldap/wmf - https://phabricator.wikimedia.org/T269365 (10Jdforrester-WMF) [16:32:16] (03CR) 10Bstorm: "This does what it set out to do https://puppet-compiler.wmflabs.org/compiler1001/26901/" [puppet] - 10https://gerrit.wikimedia.org/r/643578 (https://phabricator.wikimedia.org/T268312) (owner: 10Bstorm) [16:33:34] (03CR) 10Bstorm: [C: 03+2] wikireplicas: add maintain-meta_p only to s7 and legacy replicas [puppet] - 10https://gerrit.wikimedia.org/r/643578 (https://phabricator.wikimedia.org/T268312) (owner: 10Bstorm) [16:33:49] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:33:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:50] (03CR) 10CRusnov: "> Patch Set 15: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/643354 (https://phabricator.wikimedia.org/T266488) (owner: 10CRusnov) [16:38:42] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install logstash203[345] - https://phabricator.wikimedia.org/T267420 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by volans on cumin2001.codfw.wmnet for hosts: ` logstash2034.codfw.wmnet ` The log can be found in `/var/log/wmf-auto... [16:38:47] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install logstash203[345] - https://phabricator.wikimedia.org/T267420 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['logstash2034.codfw.wmnet'] ` Of which those **FAILED**: ` ['logstash2034.codfw.wmnet'] ` [16:39:23] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install logstash203[345] - https://phabricator.wikimedia.org/T267420 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by volans on cumin2001.codfw.wmnet for hosts: ` logstash2034.codfw.wmnet ` The log can be found in `/var/log/wmf-auto... [16:39:28] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install logstash203[345] - https://phabricator.wikimedia.org/T267420 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['logstash2034.codfw.wmnet'] ` Of which those **FAILED**: ` ['logstash2034.codfw.wmnet'] ` [16:39:43] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install logstash203[345] - https://phabricator.wikimedia.org/T267420 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by volans on cumin2001.codfw.wmnet for hosts: ` logstash2034.codfw.wmnet ` The log can be found in `/var/log/wmf-auto... [16:40:12] (03PS1) 10Ladsgroup: superset: Migrate hiera() to lookup() and setting datatype [puppet] - 10https://gerrit.wikimedia.org/r/645110 (https://phabricator.wikimedia.org/T209953) [16:40:29] 10Operations, 10SRE-Access-Requests: Requesting access to Analytics Data for WMDE-leszek - https://phabricator.wikimedia.org/T269284 (10ssingh) 05Open→03Resolved Thanks for confirming! [16:41:06] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/645110 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [16:41:45] (03CR) 10jerkins-bot: [V: 04-1] superset: Migrate hiera() to lookup() and setting datatype [puppet] - 10https://gerrit.wikimedia.org/r/645110 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [16:43:51] (03PS2) 10Ladsgroup: superset: Migrate hiera() to lookup() and setting datatype [puppet] - 10https://gerrit.wikimedia.org/r/645110 (https://phabricator.wikimedia.org/T209953) [16:45:40] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/645110 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [16:47:34] (03CR) 10Ladsgroup: "PCC is happy 😄" [puppet] - 10https://gerrit.wikimedia.org/r/645110 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [16:47:44] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/645100 (owner: 10Jbond) [16:48:24] RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:50:50] (03CR) 10Ottomata: hive: force the hive server to use the local metastore (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/645057 (https://phabricator.wikimedia.org/T268028) (owner: 10Elukey) [16:51:49] (03CR) 10Volans: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/645103 (owner: 10Jbond) [16:52:18] (03CR) 10Volans: [C: 03+1] "I agree with the new behaviour" [puppet] - 10https://gerrit.wikimedia.org/r/645104 (owner: 10Jbond) [16:54:42] (03PS1) 10Jbond: DO NOT MERGE: change to check the difference paralle_spec makes [puppet] - 10https://gerrit.wikimedia.org/r/645112 [16:54:43] (03PS1) 10Jbond: DO NOT MERGE: change to check the difference paralle_spec makes [puppet] - 10https://gerrit.wikimedia.org/r/645113 [16:54:46] (03PS1) 10Bstorm: wikireplicas: let clouddb1020 join the party [puppet] - 10https://gerrit.wikimedia.org/r/645114 (https://phabricator.wikimedia.org/T268312) [16:55:28] (03PS2) 10Jbond: enable-puppet: fix force support [puppet] - 10https://gerrit.wikimedia.org/r/645103 [16:55:37] !log volans@cumin2001 START - Cookbook sre.hosts.downtime [16:55:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:30] !log volans@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [16:57:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:47] (03CR) 10Jbond: [C: 03+2] enable-puppet: fix force support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/645103 (owner: 10Jbond) [16:57:59] 10Operations, 10ops-codfw: RMA failed codfw C7 switch - WMF6114 - https://phabricator.wikimedia.org/T267950 (10Papaul) Switch dropped at shipping, waiting for pickup [16:58:39] 10Operations, 10LDAP-Access-Requests: Onboarding Genoveva, access request to ldap/wmf - https://phabricator.wikimedia.org/T269365 (10ssingh) a:03ssingh [17:00:04] jbond42 and cdanis: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201203T1700). [17:00:13] (03PS2) 10Jbond: disable-puppet: make message mandatory [puppet] - 10https://gerrit.wikimedia.org/r/645104 [17:00:19] (03PS1) 10Ssingh: admin: add Genoveva Galarza to ldap_only_users (wmf) [puppet] - 10https://gerrit.wikimedia.org/r/645116 (https://phabricator.wikimedia.org/T269365) [17:00:36] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [17:00:36] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:00:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:31] (03CR) 10Elukey: hive: force the hive server to use the local metastore (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/645057 (https://phabricator.wikimedia.org/T268028) (owner: 10Elukey) [17:04:34] PROBLEM - Check systemd state on ms-be1043 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:08:12] (03CR) 10Dzahn: [C: 03+1] disable-puppet: make message mandatory [puppet] - 10https://gerrit.wikimedia.org/r/645104 (owner: 10Jbond) [17:08:14] (03CR) 10Jcrespo: [C: 03+1] disable-puppet: make message mandatory [puppet] - 10https://gerrit.wikimedia.org/r/645104 (owner: 10Jbond) [17:08:25] that's a popular patch [17:08:43] (03CR) 10Jcrespo: [C: 03+1] disable-puppet: make message mandatory [puppet] - 10https://gerrit.wikimedia.org/r/645104 (owner: 10Jbond) [17:08:53] hehe [17:09:13] :) [17:10:14] (03CR) 10Herron: [C: 03+1] "good call!" [puppet] - 10https://gerrit.wikimedia.org/r/645104 (owner: 10Jbond) [17:11:04] (03PS1) 10Elukey: icinga: add users razzi/Razzi and elukey to run commands [puppet] - 10https://gerrit.wikimedia.org/r/645120 [17:11:59] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install logstash203[345] - https://phabricator.wikimedia.org/T267420 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['logstash2034.codfw.wmnet'] ` and were **ALL** successful. [17:13:56] (03CR) 10Elukey: [C: 03+1] disable-puppet: make message mandatory [puppet] - 10https://gerrit.wikimedia.org/r/645104 (owner: 10Jbond) [17:16:32] RECOVERY - Check systemd state on ms-be1043 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:18:47] (03PS1) 10Elukey: profile::hive::client: rename parameter [puppet] - 10https://gerrit.wikimedia.org/r/645122 [17:20:14] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1010-production-logstash-eqiad on logstash1010 is CRITICAL: 102.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1010&panelId=37 [17:20:23] (03PS1) 10Ahmon Dancy: Fix formatting issues in README.rst [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/645123 [17:20:53] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26902/console" [puppet] - 10https://gerrit.wikimedia.org/r/645122 (owner: 10Elukey) [17:21:20] (03CR) 10Elukey: profile::hive::client: rename parameter [puppet] - 10https://gerrit.wikimedia.org/r/645122 (owner: 10Elukey) [17:32:20] !log robh@cumin1001 START - Cookbook sre.dns.netbox [17:32:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:51] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [17:33:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:46] !log robh@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [17:34:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:24] 10Operations, 10ops-eqiad, 10DC-Ops, 10Platform Team Workboards (Green): eqiad: Server moves to free up space on 10g racks - https://phabricator.wikimedia.org/T267065 (10elukey) Note for conf1006 - this node is set as target for pybal in ulsfo, and after a chat with @akosiaris it is not clear what happens... [17:35:48] (03PS1) 10Seddon: Undeploy graphoid for phase 3 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645125 (https://phabricator.wikimedia.org/T259207) [17:37:41] (03CR) 10Jbond: early_command: configure static, mapped ipv6 address (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/645081 (owner: 10Jbond) [17:39:03] (03PS2) 10Jbond: DO NOT MERGE: change to check the difference paralle_spec makes [puppet] - 10https://gerrit.wikimedia.org/r/645113 [17:39:46] (03CR) 10Dzahn: "I wish we could solve this by just _not_ making the login work with the wrong capitalization in the first place. How hard would that be? " [puppet] - 10https://gerrit.wikimedia.org/r/645120 (owner: 10Elukey) [17:41:13] (03PS4) 10Jbond: early_command: configure static, mapped ipv6 address [puppet] - 10https://gerrit.wikimedia.org/r/645081 [17:41:15] (03CR) 10Dzahn: "just make sure it doesn't break Icinga config because it's looking for a contact that doesn't exist" [puppet] - 10https://gerrit.wikimedia.org/r/645120 (owner: 10Elukey) [17:41:35] (03CR) 10Jbond: early_command: configure static, mapped ipv6 address (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/645081 (owner: 10Jbond) [17:43:38] (03CR) 10Elukey: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/645120 (owner: 10Elukey) [17:48:52] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:48:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:53] !log robh@cumin1001 START - Cookbook sre.dns.netbox [17:50:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:31] (03CR) 10Dzahn: "On the one hand this (logged in with Dzahn instead of dzahn and then notice I don't have privileges) happens to me all the time as well an" [puppet] - 10https://gerrit.wikimedia.org/r/645120 (owner: 10Elukey) [17:56:37] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:56:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:32] 10Operations, 10Graphoid, 10serviceops, 10Patch-For-Review, 10Platform Engineering (Icebox): Undeploy graphoid for phase 3 wiki's - https://phabricator.wikimedia.org/T259207 (10Jseddon) [17:59:45] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:00:05] chrisalbon and accraze: Your horoscope predicts another unfortunate Services – Graphoid / ORES deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201203T1800). [18:01:09] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [18:01:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:20] (03CR) 10Elukey: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/645120 (owner: 10Elukey) [18:03:10] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:03:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:20] 10Operations, 10Parsoid, 10serviceops: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10Dzahn) >>! In T268524#6665888, @MoritzMuehlenhoff wrote: > let's maybe reimage also parse2002 to show whether it's a reproducible error and if it happens again, keep it in the broken state... [18:04:28] !log volans@cumin1001 START - Cookbook sre.dns.netbox [18:04:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:48] (03CR) 10Dzahn: [C: 03+1] icinga: add users razzi/Razzi and elukey to run commands [puppet] - 10https://gerrit.wikimedia.org/r/645120 (owner: 10Elukey) [18:05:22] (03PS1) 10Jforrester: Use 'default' as default group when reading filters from history [extensions/AbuseFilter] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/644984 (https://phabricator.wikimedia.org/T269314) [18:06:39] (03CR) 10Dzahn: [C: 03+1] "yea, mostly I just wanted to add the part about "check icinga config after merge" and the rest of that is totally low prio. +1" [puppet] - 10https://gerrit.wikimedia.org/r/645120 (owner: 10Elukey) [18:06:44] (03CR) 10Elukey: [C: 03+2] superset: Migrate hiera() to lookup() and setting datatype [puppet] - 10https://gerrit.wikimedia.org/r/645110 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [18:07:15] 10Operations, 10Research, 10SRE-Access-Requests: Access to analytics-privatedata-users for Research volunteer Swagoel - https://phabricator.wikimedia.org/T267314 (10Swagoel) Thank you! Everything works now. I'm resolving the task. [18:09:27] 10Operations, 10Research, 10SRE-Access-Requests: Access to analytics-privatedata-users for Research volunteer Swagoel - https://phabricator.wikimedia.org/T267314 (10Swagoel) 05Open→03Resolved [18:09:32] 10Operations, 10Parsoid, 10serviceops: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10jijiki) >>! In T268524#6667353, @Dzahn wrote: >>>! In T268524#6665888, @MoritzMuehlenhoff wrote: >> let's maybe reimage also parse2002 to show whether it's a reproducible error and if it h... [18:09:58] (03CR) 10Thcipriani: [C: 03+1] Fix formatting issues in README.rst [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/645123 (owner: 10Ahmon Dancy) [18:10:41] 10Operations, 10Release-Engineering-Team-TODO, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade MediaWiki appservers to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10jijiki) [18:10:43] 10Operations, 10Parsoid, 10serviceops: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10Dzahn) Alright, will do that and start in a couple minutes. [18:11:24] (03PS5) 10Jbond: early_command: configure static, mapped ipv6 address [puppet] - 10https://gerrit.wikimedia.org/r/645081 [18:11:51] 10Operations, 10Cassandra: Cassandra instance DNS records - are they needed? - https://phabricator.wikimedia.org/T269328 (10Eevans) At the very least, getting rid of these names would create inconvenience. There are lots of examples of maintenance and admin commands that run against an instance, or resolve IP... [18:17:05] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:17:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:50] 10Operations, 10Parsoid, 10serviceops: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10MoritzMuehlenhoff) >>! In T268524#6667353, @Dzahn wrote: >>>! In T268524#6665888, @MoritzMuehlenhoff wrote: >> let's maybe reimage also parse2002 to show whether it's a reproducible error... [18:23:00] 10Operations, 10Platform Engineering, 10serviceops: Upgrade snapshot hosts to Buster - https://phabricator.wikimedia.org/T269377 (10jijiki) [18:23:31] 10Operations, 10Platform Engineering, 10serviceops: Upgrade snapshot hosts to Buster - https://phabricator.wikimedia.org/T269377 (10jijiki) [18:23:37] 10Operations, 10Release-Engineering-Team-TODO, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade MediaWiki appservers to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10jijiki) [18:24:28] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10RobH) [18:29:06] (03CR) 10Jbond: "The difference in run time between this patch and tha last one is 78s (1st ci job) - 54s (2nd ci job) +14 (additional time to install par" [puppet] - 10https://gerrit.wikimedia.org/r/645113 (owner: 10Jbond) [18:31:24] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10RobH) [18:32:55] (03PS1) 10RobH: adding in db115[1-5] mac address entries [puppet] - 10https://gerrit.wikimedia.org/r/645130 (https://phabricator.wikimedia.org/T267043) [18:34:26] (03PS1) 10Ayounsi: Netbox scripts, set commit_default = False [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/645131 [18:39:42] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [18:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:46] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:41:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:41] (03CR) 10Volans: [C: 03+1] "LGTM but please communicate this widely and in particular to DCOps" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/645131 (owner: 10Ayounsi) [18:48:55] RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:50:10] (03CR) 10Bstorm: [C: 03+2] wikireplicas: extend maintain_dbusers to multiinstance [puppet] - 10https://gerrit.wikimedia.org/r/644952 (https://phabricator.wikimedia.org/T268312) (owner: 10Bstorm) [19:00:04] RoanKattouw, Niharika, and Urbanecm: May I have your attention please! Morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201203T1900) [19:00:04] Seddon: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:29] I am around [19:00:42] (03CR) 10RobH: [C: 03+2] adding in db115[1-5] mac address entries [puppet] - 10https://gerrit.wikimedia.org/r/645130 (https://phabricator.wikimedia.org/T267043) (owner: 10RobH) [19:01:07] I can deploy today! [19:01:09] hi Seddon [19:01:22] Hey Urbanecm! [19:01:46] (03CR) 10Urbanecm: [C: 03+2] Undeploy graphoid for phase 3 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645125 (https://phabricator.wikimedia.org/T259207) (owner: 10Seddon) [19:02:35] (03Merged) 10jenkins-bot: Undeploy graphoid for phase 3 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645125 (https://phabricator.wikimedia.org/T259207) (owner: 10Seddon) [19:03:39] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10RobH) [19:04:18] Seddon: can you please check at mwdebug1001? [19:05:23] (03PS1) 10Jbond: standard: remove invalid dependencies [puppet] - 10https://gerrit.wikimedia.org/r/645133 [19:07:01] (03CR) 10Jbond: [C: 03+2] standard: remove invalid dependencies [puppet] - 10https://gerrit.wikimedia.org/r/645133 (owner: 10Jbond) [19:07:53] Urbanecm: I've confirmed that the deployment has worked correctly on de.wikipedia.org [19:07:58] good, thanks! [19:08:00] syncing then [19:09:25] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 06d9e8d3081de457974e4e95fada0a502a634dd9: Undeploy graphoid for phase 3 wikis (T259207) (duration: 01m 08s) [19:09:31] Seddon: done [19:09:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:33] anything else? [19:09:33] T259207: Undeploy graphoid for phase 3 wiki's - https://phabricator.wikimedia.org/T259207 [19:10:07] Urbanecm: nope that should be it. If I've broken anything I'll let someone know [19:10:19] okay, good :) [19:10:28] (03PS2) 10Jbond: DO NOT MERGE: change to check the difference paralle_spec makes [puppet] - 10https://gerrit.wikimedia.org/r/645112 [19:10:31] (03PS3) 10Jbond: DO NOT MERGE: change to check the difference paralle_spec makes [puppet] - 10https://gerrit.wikimedia.org/r/645113 [19:10:37] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=parse2001.codfw.wmnet [19:10:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:12] !log depooling parse2001 and repeating auto-reimage to see if ferm issue is repeatable (T268524) [19:11:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:20] T268524: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 [19:12:11] (03PS1) 10Urbanecm: Enable NewUserMessage for ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645134 (https://phabricator.wikimedia.org/T269290) [19:12:27] (03CR) 10Urbanecm: [C: 03+2] Enable NewUserMessage for ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645134 (https://phabricator.wikimedia.org/T269290) (owner: 10Urbanecm) [19:13:20] (03Merged) 10jenkins-bot: Enable NewUserMessage for ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645134 (https://phabricator.wikimedia.org/T269290) (owner: 10Urbanecm) [19:13:36] (03PS1) 10Urbanecm: Kurdish Wiktionary: Add WF namespace alias to NS_PROJECT [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645135 (https://phabricator.wikimedia.org/T269319) [19:14:26] (03PS2) 10Urbanecm: Kurdish Wiktionary: Add WF namespace alias to NS_PROJECT [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645135 (https://phabricator.wikimedia.org/T269319) [19:14:29] (03CR) 10Urbanecm: [C: 03+2] Kurdish Wiktionary: Add WF namespace alias to NS_PROJECT [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645135 (https://phabricator.wikimedia.org/T269319) (owner: 10Urbanecm) [19:14:44] (03CR) 10Ottomata: [C: 03+1] icinga: add users razzi/Razzi and elukey to run commands [puppet] - 10https://gerrit.wikimedia.org/r/645120 (owner: 10Elukey) [19:14:49] (03PS1) 10Mforns: analytics::refinery::job::data_purge.pp: remove netflow from event automatic deletion [puppet] - 10https://gerrit.wikimedia.org/r/645137 [19:14:55] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: b7b946a64ba4dc0121732ca48699a897718f4584: Enable NewUserMessage for ptwiki (T269290) (duration: 01m 08s) [19:15:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:03] T269290: Enable NewUserMessage extension in ptwiki - https://phabricator.wikimedia.org/T269290 [19:15:11] stanglavine: ^^enjoy^^ [19:15:26] (03PS1) 10Jbond: profile: migrate to shared spec_test [puppet] - 10https://gerrit.wikimedia.org/r/645147 [19:15:34] (03PS2) 10Jbond: profile: migrate to shared spec_test [puppet] - 10https://gerrit.wikimedia.org/r/645147 [19:15:40] (03Merged) 10jenkins-bot: Kurdish Wiktionary: Add WF namespace alias to NS_PROJECT [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645135 (https://phabricator.wikimedia.org/T269319) (owner: 10Urbanecm) [19:15:45] Urbanecm: yeaaaahh :P [19:16:19] (03CR) 10jerkins-bot: [V: 04-1] analytics::refinery::job::data_purge.pp: remove netflow from event automatic deletion [puppet] - 10https://gerrit.wikimedia.org/r/645137 (owner: 10Mforns) [19:17:53] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=parse2001.codfw.wmnet [19:17:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:05] jouncebot: now [19:18:06] For the next 0 hour(s) and 41 minute(s): Morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201203T1900) [19:18:20] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 6be070c6fdc4a80954d91c2d62dab5368260c5aa: Kurdish Wiktionary: Add WF namespace alias to NS_PROJECT (T269319) (duration: 01m 08s) [19:18:24] eh.. i did not mean to depool a server during scap but I think it's ok [19:18:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:27] T269319: Kurdish Wiktionary: Add WF namespace alias to NS 4 - https://phabricator.wikimedia.org/T269319 [19:18:39] (03CR) 10Ottomata: [C: 03+1] disable-puppet: make message mandatory [puppet] - 10https://gerrit.wikimedia.org/r/645104 (owner: 10Jbond) [19:18:45] (03PS2) 10Mforns: analytics::refinery::data_purge.pp: remove netflow automatic deletion [puppet] - 10https://gerrit.wikimedia.org/r/645137 [19:18:46] mutante: hopefully I/you didn't break anything unintentionally :) [19:19:00] I'm done for now anyway, fwiw [19:19:01] Urbanecm: parse2001 if any warning [19:19:15] mutante: no, no warning on my side [19:19:17] but removing it should be fine [19:19:18] (03CR) 10Hashar: [C: 03+2] "Thank you!" [extensions/AbuseFilter] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/644984 (https://phabricator.wikimedia.org/T269314) (owner: 10Jforrester) [19:19:19] ack,ty [19:19:54] hello [19:19:57] hi hashar [19:20:04] hi [19:20:15] I will take of deploying the AbuseFilter patch I have +2ed once the config window has completed [19:20:26] hashar: feel free to do it during the window, I'm done [19:20:26] (03PS2) 10Razzi: Add kafka-test1006 as start of test kafka cluster [puppet] - 10https://gerrit.wikimedia.org/r/644620 (https://phabricator.wikimedia.org/T268202) [19:20:32] ah perfect [19:20:33] (03PS3) 10Jbond: profile: migrate to shared spec_test [puppet] - 10https://gerrit.wikimedia.org/r/645147 [19:21:09] 10Operations, 10Parsoid, 10serviceops: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` parse2001.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020120319... [19:21:12] (03CR) 10Ottomata: [C: 03+2] analytics::refinery::data_purge.pp: remove netflow automatic deletion [puppet] - 10https://gerrit.wikimedia.org/r/645137 (owner: 10Mforns) [19:21:16] RECOVERY - Check systemd state on pki2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:21:54] reinstalling parsoid canary server with buster to see if we can repeat an issue with ferm or not [19:23:22] !log mwscript namespaceDupes.php --wiki=kuwiktionary --fix (T269319) [19:23:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:31] T269319: Kurdish Wiktionary: Add WF namespace alias to NS 4 - https://phabricator.wikimedia.org/T269319 [19:24:22] PROBLEM - Check systemd state on pki2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:26:30] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:33:31] (03CR) 10Razzi: [C: 03+1] disable-puppet: make message mandatory [puppet] - 10https://gerrit.wikimedia.org/r/645104 (owner: 10Jbond) [19:35:43] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install logstash203[345] - https://phabricator.wikimedia.org/T267420 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` logstash2035.codfw.wmnet ` The log can be found in `/var/log/wmf-auto... [19:35:55] (03PS1) 10Ahmon Dancy: Fix Step 0 reporting for update operation [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/645138 [19:36:24] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install logstash203[345] - https://phabricator.wikimedia.org/T267420 (10Papaul) [19:37:11] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [19:37:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:15] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [19:39:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:48] (03Merged) 10jenkins-bot: Use 'default' as default group when reading filters from history [extensions/AbuseFilter] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/644984 (https://phabricator.wikimedia.org/T269314) (owner: 10Jforrester) [19:44:15] !log restart logstash kafka in eqiad - java updates [19:44:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:32] (03PS1) 10Milimetric: aqs: update mw history snapshot [puppet] - 10https://gerrit.wikimedia.org/r/645141 [19:45:36] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): cloudvirt1025 connection issues on primary nic - https://phabricator.wikimedia.org/T269313 (10Andrew) I just reimaged cloudvirt1026 and it's exhibiting this same behavior. [19:46:13] (03CR) 10Razzi: [V: 03+2] aqs: update mw history snapshot [puppet] - 10https://gerrit.wikimedia.org/r/645141 (owner: 10Milimetric) [19:46:15] (03CR) 10Razzi: [V: 03+2 C: 03+2] aqs: update mw history snapshot [puppet] - 10https://gerrit.wikimedia.org/r/645141 (owner: 10Milimetric) [19:46:47] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Hardware): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew) [19:47:36] RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:47:57] (03PS4) 10Jbond: spec: add and use parallel sec which seems to give a boost to run time [puppet] - 10https://gerrit.wikimedia.org/r/645113 [19:51:05] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [19:51:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:24] I am going to deploy https://gerrit.wikimedia.org/r/c/mediawiki/extensions/AbuseFilter/+/644984 ( poke Daimona DannyS712 if they are here ) [19:53:00] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [19:53:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:00] !log milimetric@deploy1001 Started restart [analytics/aqs/deploy@95d6432]: (no justification provided) [19:55:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:40] I am, for a couple of minutes more [19:55:41] !log milimetric@deploy1001 Started restart [analytics/aqs/deploy@95d6432]: (no justification provided) [19:55:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:04] !log milimetric@deploy1001 Started restart [analytics/aqs/deploy@95d6432]: (no justification provided) [19:56:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:26] Daimona: turns out the AbuseFilter issue had a repro case and I managed to check on mwdebug1001 :] all set! [19:56:33] Daimona: thank you for the patch / review etc! [19:56:41] Perfect :-) [19:56:46] Thank you [19:56:46] !log milimetric@deploy1001 Started restart [analytics/aqs/deploy@95d6432]: (no justification provided) [19:56:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:00] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` db1151.eqiad.wmnet ` The log can be found in `/var/log/wmf-a... [19:57:06] !log hashar@deploy1001 Synchronized php-1.36.0-wmf.20/extensions/AbuseFilter/includes/FilterLookup.php: Use 'default' as default group when reading filters from history - T269314 (duration: 01m 05s) [19:57:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:16] T269314: Argument 5 passed to MediaWiki\Extension\AbuseFilter\Filter\Specs::__construct() must be of the type string, null given, called in /srv/mediawiki/php-1.36.0-wmf.20/extensions/AbuseFilter/includes/FilterLookup.php on line 346 - https://phabricator.wikimedia.org/T269314 [19:57:33] !log milimetric@deploy1001 Started deploy [analytics/aqs/deploy@95d6432]: (no justification provided) [19:57:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:49] !log restart logstash kafka in codfw - java updates [19:57:52] !log milimetric@deploy1001 Finished deploy [analytics/aqs/deploy@95d6432]: (no justification provided) (duration: 00m 19s) [19:57:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:08] !log milimetric@deploy1001 Started deploy [analytics/aqs/deploy@95d6432]: (no justification provided) [19:58:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:30] !log milimetric@deploy1001 Finished deploy [analytics/aqs/deploy@95d6432]: (no justification provided) (duration: 00m 23s) [19:58:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:13] Daimona: all set :] [20:00:04] hashar and twentyafterfour: Your horoscope predicts another unfortunate Mediawiki train - European+American Version (secondary timeslot) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201203T2000). [20:01:13] (03PS1) 10Hashar: all wikis to 1.36.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645144 [20:01:15] (03CR) 10Hashar: [C: 03+2] all wikis to 1.36.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645144 (owner: 10Hashar) [20:02:00] (03Merged) 10jenkins-bot: all wikis to 1.36.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645144 (owner: 10Hashar) [20:02:11] (03PS3) 10Razzi: Add kafka-test1006 as start of test kafka cluster [puppet] - 10https://gerrit.wikimedia.org/r/644620 (https://phabricator.wikimedia.org/T268202) [20:02:35] (03PS4) 10Razzi: Add kafka-test1006 as start of test kafka cluster [puppet] - 10https://gerrit.wikimedia.org/r/644620 (https://phabricator.wikimedia.org/T268202) [20:03:10] (03CR) 10Ottomata: [C: 03+1] Add kafka-test1006 as start of test kafka cluster [puppet] - 10https://gerrit.wikimedia.org/r/644620 (https://phabricator.wikimedia.org/T268202) (owner: 10Razzi) [20:03:30] RECOVERY - Ensure local MW versions match expected deployment on deploy1002 is OK: OKAY: Not alerting due to fresh production wikiversions: 957 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [20:03:52] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.36.0-wmf.20 [20:03:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:20] (03CR) 10Kosta Harlan: [C: 03+1] Add tokens and users for 3 new k8s services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/645064 (https://phabricator.wikimedia.org/T265893) (owner: 10Alexandros Kosiaris) [20:08:13] !log T269204 Re-imaging `wdqs2004` to upgrade it to buster: `sudo -i wmf-auto-reimage-host --conftool -p T269204 wdqs2004.codfw.wmnet` [20:08:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:21] T269204: Some wdqs metrics changed when switching to python3 - https://phabricator.wikimedia.org/T269204 [20:09:52] (03PS5) 10Jbond: spec: add and use parallel sec which seems to give a boost to run time [puppet] - 10https://gerrit.wikimedia.org/r/645113 [20:10:04] (03PS6) 10Jbond: spec: add and use parallel sec which seems to give a boost to run time [puppet] - 10https://gerrit.wikimedia.org/r/645113 [20:10:13] !log robh@cumin1001 START - Cookbook sre.hosts.downtime [20:10:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:19] 10Operations, 10Parsoid, 10serviceops: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2001.codfw.wmnet'] ` and were **ALL** successful. [20:11:48] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install logstash203[345] - https://phabricator.wikimedia.org/T267420 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['logstash2035.codfw.wmnet'] ` and were **ALL** successful. [20:12:16] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:12:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:12] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install logstash203[345] - https://phabricator.wikimedia.org/T267420 (10Papaul) [20:14:55] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install logstash203[345] - https://phabricator.wikimedia.org/T267420 (10Papaul) 05Open→03Resolved @herron all yours [20:16:58] train looks fine \o/ [20:19:55] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1151.eqiad.wmnet'] ` Of which those **FAILED**: ` ['db1151.eqiad.wmnet'] ` [20:22:39] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10RobH) 20:19:50 | db1151.eqiad.wmnet | Unable to run wmf-auto-reimage-host: could not convert string to float: "Warning: Permanently added the ECDSA host key for IP... [20:24:20] (03CR) 10Razzi: [C: 03+2] Add kafka-test1006 as start of test kafka cluster [puppet] - 10https://gerrit.wikimedia.org/r/644620 (https://phabricator.wikimedia.org/T268202) (owner: 10Razzi) [20:24:28] (03PS5) 10Razzi: Add kafka-test1006 as start of test kafka cluster [puppet] - 10https://gerrit.wikimedia.org/r/644620 (https://phabricator.wikimedia.org/T268202) [20:26:08] !log ryankemper@cumin2001 START - Cookbook sre.hosts.downtime [20:26:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:20] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['db1152.eqiad.wmnet', 'db1153.eqiad.wmnet', 'db1154.eqiad.w... [20:26:22] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install logstash203[345] - https://phabricator.wikimedia.org/T267420 (10herron) Thanks @Papaul! [20:26:27] 10Puppet, 10Cloud-VPS: role::simplelamp takes ownership of all content in /etc/apache2/sites-enabled - https://phabricator.wikimedia.org/T169368 (10Dzahn) https://wikitech.wikimedia.org/wiki/Help:Cloud_VPS_project#How_to_automatically_setup_a_simple_LAMP_server_on_a_cloud_VPS_instance [20:28:09] !log ryankemper@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:28:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:19] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): cloudvirt1025 connection issues on primary nic - https://phabricator.wikimedia.org/T269313 (10ayounsi) The issue is related to ARP. And I suspect something is wrong with the NIC or the OS. Most of the ARP queries from hosts in t... [20:36:39] (03PS1) 10Volans: wmf-auto-reimage: fix hack for SSH warning [puppet] - 10https://gerrit.wikimedia.org/r/645168 (https://phabricator.wikimedia.org/T269187) [20:38:04] (03CR) 10jerkins-bot: [V: 04-1] wmf-auto-reimage: fix hack for SSH warning [puppet] - 10https://gerrit.wikimedia.org/r/645168 (https://phabricator.wikimedia.org/T269187) (owner: 10Volans) [20:38:29] !log robh@cumin1001 START - Cookbook sre.hosts.downtime [20:38:29] !log robh@cumin1001 START - Cookbook sre.hosts.downtime [20:38:31] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [20:38:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:41] !log robh@cumin1001 START - Cookbook sre.hosts.downtime [20:38:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:18] (03PS2) 10Volans: wmf-auto-reimage: fix hack for SSH warning [puppet] - 10https://gerrit.wikimedia.org/r/645168 (https://phabricator.wikimedia.org/T269187) [20:39:44] !log robh@cumin1001 START - Cookbook sre.hosts.downtime [20:39:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:32] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:40:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:53] (03PS1) 10Razzi: Add kafka_test and zookeeper_test clusters [puppet] - 10https://gerrit.wikimedia.org/r/645169 (https://phabricator.wikimedia.org/T268202) [20:42:21] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:42:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:03] !log kill slapd on serpens and restart it [20:43:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:13] RECOVERY - Labs LDAP on serpens is OK: LDAP OK - 0.101 seconds response time https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting [20:44:17] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:44:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:55] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer [20:47:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:27] (03CR) 10Dzahn: [C: 03+1] "lgmt, "employeeType: Full Time"" [puppet] - 10https://gerrit.wikimedia.org/r/645116 (https://phabricator.wikimedia.org/T269365) (owner: 10Ssingh) [20:51:10] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:51:58] (03CR) 10Ssingh: [C: 03+2] admin: add Genoveva Galarza to ldap_only_users (wmf) [puppet] - 10https://gerrit.wikimedia.org/r/645116 (https://phabricator.wikimedia.org/T269365) (owner: 10Ssingh) [20:52:46] (03CR) 10Dzahn: [C: 03+1] "looks good to me. though I have to say I got the same impression from package descriptions before and they made me use s-nail instead ;P" [puppet] - 10https://gerrit.wikimedia.org/r/645071 (https://phabricator.wikimedia.org/T268725) (owner: 10Muehlenhoff) [20:53:37] 10Operations, 10ops-codfw: RMA failed codfw C7 switch - WMF6114 - https://phabricator.wikimedia.org/T267950 (10Papaul) Your scheduled delivery date has changed. Scheduled Delivery Date: Monday, 12/07/2020 [20:55:37] (03CR) 10Dzahn: "I'm really sorry but like Andre always says better to unassign than to hold on to it." [puppet] - 10https://gerrit.wikimedia.org/r/608163 (https://phabricator.wikimedia.org/T256536) (owner: 10Ladsgroup) [20:57:22] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Onboarding Genoveva, access request to ldap/wmf - https://phabricator.wikimedia.org/T269365 (10ssingh) @gengh: Access request merged; please let me know if there are any issues. Welcome! [20:59:20] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:59:52] (03PS1) 10Razzi: Add dummy files for kafka_test-eqiad_broker [labs/private] - 10https://gerrit.wikimedia.org/r/645171 [21:02:02] 10Operations: slapd fails to restart sometimes - https://phabricator.wikimedia.org/T269394 (10colewhite) [21:02:03] (03CR) 10RobH: [C: 03+1] "My +1 is really not worth much, since I only get the gist of what this is doing." [puppet] - 10https://gerrit.wikimedia.org/r/645168 (https://phabricator.wikimedia.org/T269187) (owner: 10Volans) [21:02:14] (03CR) 10Dzahn: "https://openstack-browser.toolforge.org/puppetclass/role::labs::ores::lb" [puppet] - 10https://gerrit.wikimedia.org/r/643117 (owner: 10Dzahn) [21:02:44] (03CR) 10Dzahn: [V: 04-1] "https://puppet-compiler.wmflabs.org/compiler1002/26910/ores-lb-03.ores.eqiad.wmflabs/change.ores-lb-03.ores.eqiad.wmflabs.err" [puppet] - 10https://gerrit.wikimedia.org/r/643117 (owner: 10Dzahn) [21:03:04] (03CR) 10RobH: [C: 03+2] "IRC sync, this should be a good change. I'm merging on behalf of volans as its quite late there." [puppet] - 10https://gerrit.wikimedia.org/r/645168 (https://phabricator.wikimedia.org/T269187) (owner: 10Volans) [21:05:20] (03CR) 10Cwhite: [C: 03+1] disable-puppet: make message mandatory [puppet] - 10https://gerrit.wikimedia.org/r/645104 (owner: 10Jbond) [21:05:35] !log running maintain-dbusers harvest-replicas to populate the user accounts on new wikireplicas servers [21:05:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:53] (03PS4) 10Dzahn: ores: move LB setup for cloud from role to profile [puppet] - 10https://gerrit.wikimedia.org/r/643117 [21:06:25] (03CR) 10Ayounsi: [C: 03+1] "In case you need it." [puppet] - 10https://gerrit.wikimedia.org/r/645168 (https://phabricator.wikimedia.org/T269187) (owner: 10Volans) [21:07:36] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['db1151.eqiad.wmnet', 'db1152.eqiad.wmnet', 'db1153.eqiad.w... [21:08:03] (03CR) 10Dzahn: [C: 04-1] "still...Function lookup() did not find a value for the name 'role::labs::ores::lb::cache'" [puppet] - 10https://gerrit.wikimedia.org/r/643117 (owner: 10Dzahn) [21:08:20] (03CR) 10Razzi: [V: 03+2 C: 03+2] Add dummy files for kafka_test-eqiad_broker [labs/private] - 10https://gerrit.wikimedia.org/r/645171 (owner: 10Razzi) [21:10:11] (03PS5) 10Dzahn: ores: move LB setup for cloud from role to profile [puppet] - 10https://gerrit.wikimedia.org/r/643117 [21:11:59] (03CR) 10jerkins-bot: [V: 04-1] ores: move LB setup for cloud from role to profile [puppet] - 10https://gerrit.wikimedia.org/r/643117 (owner: 10Dzahn) [21:17:41] Where is it best to poke someone about a parser cache issue that may be related to this week's train? [21:18:28] https://phabricator.wikimedia.org/T269396 is the task that I created for it. [21:20:30] (03CR) 10Volans: [C: 04-1] "Much simpler, one block needs re-indentation, couple of questions/comments inline" (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/643979 (owner: 10Ayounsi) [21:20:58] JJMC89[m]: I think hashar is running the train this week. Adding T263186 as a parent of your bug will raise it as a train blocker [21:20:59] T263186: 1.36.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T263186 [21:21:35] !log robh@cumin1001 START - Cookbook sre.hosts.downtime [21:21:38] !log robh@cumin1001 START - Cookbook sre.hosts.downtime [21:21:38] !log robh@cumin1001 START - Cookbook sre.hosts.downtime [21:21:39] !log robh@cumin1001 START - Cookbook sre.hosts.downtime [21:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:42] !log robh@cumin1001 START - Cookbook sre.hosts.downtime [21:21:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:54] Thanks. The train is already all out and that task is resolved. Add it anyway? [21:21:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:08] bd808: ^ [21:23:01] JJMC89[m]: sure. I'll go shout loudly in the releng channel that there may be a problem in paser cache too [21:23:38] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [21:23:40] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [21:23:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:27] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [21:25:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:45] 10Operations, 10Wikimedia-Mailing-lists: Please create testing-infrastructure mailing list - https://phabricator.wikimedia.org/T269327 (10ssingh) a:03ssingh [21:26:12] (03PS6) 10Dzahn: ores: move LB setup for cloud from role to profile [puppet] - 10https://gerrit.wikimedia.org/r/643117 [21:26:39] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [21:26:43] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [21:26:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:46] (03PS1) 10Bstorm: wikireplicas: fix the harvest-replicas functionality [puppet] - 10https://gerrit.wikimedia.org/r/645173 (https://phabricator.wikimedia.org/T268312) [21:26:50] 10Operations, 10SRE-Access-Requests, 10Wikimedia-Mailing-lists: Please create testing-infrastructure mailing list - https://phabricator.wikimedia.org/T269327 (10ssingh) [21:26:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:30] (03PS2) 10Bstorm: wikireplicas: fix the harvest-replicas functionality [puppet] - 10https://gerrit.wikimedia.org/r/645173 (https://phabricator.wikimedia.org/T268312) [21:28:05] (03PS2) 10Razzi: Add kafka_test and zookeeper_test clusters [puppet] - 10https://gerrit.wikimedia.org/r/645169 (https://phabricator.wikimedia.org/T268202) [21:28:52] (03CR) 10Dzahn: [V: 03+1] "now it works https://puppet-compiler.wmflabs.org/compiler1002/26913/ores-lb-03.ores.eqiad.wmflabs/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/643117 (owner: 10Dzahn) [21:29:17] (03CR) 10Dzahn: [V: 03+1 C: 03+2] ores: move LB setup for cloud from role to profile [puppet] - 10https://gerrit.wikimedia.org/r/643117 (owner: 10Dzahn) [21:29:44] sees "+BATMAN" in puppet-merge :) *g* [21:30:51] (03CR) 10Dzahn: "confirmed noop on ores-lb-03" [puppet] - 10https://gerrit.wikimedia.org/r/643117 (owner: 10Dzahn) [21:32:37] (03CR) 10BryanDavis: [C: 03+1] wikireplicas: fix the harvest-replicas functionality [puppet] - 10https://gerrit.wikimedia.org/r/645173 (https://phabricator.wikimedia.org/T268312) (owner: 10Bstorm) [21:33:18] (03CR) 10Dzahn: "Now I am thinking I should probably do an "ensure absent" first for the cron in this case. Otherwise all local puppetmasters in cloud migh" [puppet] - 10https://gerrit.wikimedia.org/r/636104 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [21:35:26] (03PS8) 10Dzahn: puppetmaster: replace cron to remove old reports with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/636104 (https://phabricator.wikimedia.org/T265138) [21:36:15] (03CR) 10Bstorm: [C: 03+2] wikireplicas: fix the harvest-replicas functionality [puppet] - 10https://gerrit.wikimedia.org/r/645173 (https://phabricator.wikimedia.org/T268312) (owner: 10Bstorm) [21:36:22] (03PS2) 10Tchanders: Add IPInfo extension config to InitialiseSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644550 (https://phabricator.wikimedia.org/T260599) [21:36:24] (03PS2) 10Tchanders: Load IPInfo extension in CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644551 (https://phabricator.wikimedia.org/T260599) [21:36:36] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1155.eqiad.wmnet', 'db1151.eqiad.wmnet', 'db1153.eqiad.wmnet', 'db1152.eqiad.wmnet', 'db1154.eqiad.wmnet... [21:36:41] (03CR) 10jerkins-bot: [V: 04-1] Add IPInfo extension config to InitialiseSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644550 (https://phabricator.wikimedia.org/T260599) (owner: 10Tchanders) [21:36:45] (03CR) 10jerkins-bot: [V: 04-1] Load IPInfo extension in CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644551 (https://phabricator.wikimedia.org/T260599) (owner: 10Tchanders) [21:37:03] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=parse2001.codfw.wmnet [21:37:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:33] !log rolling back wmf.20 due to T269396 refs T263186 [21:39:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:42] T269396: Parser cache serving old results - https://phabricator.wikimedia.org/T269396 [21:39:42] T263186: 1.36.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T263186 [21:41:41] (03CR) 10ArielGlenn: [C: 03+1] "looks great to me!" [puppet] - 10https://gerrit.wikimedia.org/r/645056 (owner: 10Muehlenhoff) [21:43:55] (03PS1) 1020after4: All wikis (except testwikis) to 1.36.0-wmf.18 refs T269396 and T263186 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645178 [21:45:47] 10Operations, 10Parsoid, 10serviceops: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10Dzahn) What can I say.. it did not happen this time. Reimaged, ran puppet.. waited a bit, checked Icinga.. all green with the following exceptions: - Ensure local MW versions match expect... [21:45:52] (03PS3) 10Razzi: Add kafka_test and zookeeper_test clusters [puppet] - 10https://gerrit.wikimedia.org/r/645169 (https://phabricator.wikimedia.org/T268202) [21:46:04] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: rollback to wmf.18 [21:46:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:41] (03CR) 10Razzi: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26915/console" [puppet] - 10https://gerrit.wikimedia.org/r/645169 (https://phabricator.wikimedia.org/T268202) (owner: 10Razzi) [21:48:01] (03CR) 1020after4: "already deployed rollback" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645178 (owner: 1020after4) [21:48:22] (03CR) 1020after4: [C: 03+2] All wikis (except testwikis) to 1.36.0-wmf.18 refs T269396 and T263186 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645178 (owner: 1020after4) [21:48:29] RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:49:44] (03CR) 10Dzahn: "Woohoo. .thanks for merging this:)" [puppet] - 10https://gerrit.wikimedia.org/r/622666 (owner: 10Dzahn) [21:50:10] (03Merged) 10jenkins-bot: All wikis (except testwikis) to 1.36.0-wmf.18 refs T269396 and T263186 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645178 (owner: 1020after4) [21:50:24] (03CR) 10Dzahn: [C: 03+2] Phabricator monthly email: [Hopefully] fix priority median calculation [puppet] - 10https://gerrit.wikimedia.org/r/644383 (https://phabricator.wikimedia.org/T269076) (owner: 10Aklapper) [21:50:59] razzi: you still have one change waiting on the master, by the way [21:51:12] mutante: ah whoops [21:51:28] no problem, in this case puppet-merge is really smart and lets me just skip it and still merge my own [21:51:46] mutante: ok cool. Merged mine [21:51:53] sometimes it does and sometimes it doesnt [21:51:54] thanks [21:58:17] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ml-serve200[1-4] - https://phabricator.wikimedia.org/T267670 (10Papaul) [22:01:06] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10RobH) [22:01:26] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10RobH) [22:02:43] (03CR) 10Razzi: [V: 03+1 C: 03+2] Add kafka_test and zookeeper_test clusters [puppet] - 10https://gerrit.wikimedia.org/r/645169 (https://phabricator.wikimedia.org/T268202) (owner: 10Razzi) [22:08:01] (03PS1) 10Cwhite: profile: identify network devices input [puppet] - 10https://gerrit.wikimedia.org/r/645181 (https://phabricator.wikimedia.org/T268806) [22:09:58] PROBLEM - Check systemd state on ms-be1022 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:15:24] !log ryankemper@cumin2001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [22:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:48] !log restart elasticsearch on logstash1010 - gc issues [22:15:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:26] 10Operations, 10SRE-Access-Requests, 10Wikimedia-Mailing-lists: Please create testing-infrastructure mailing list - https://phabricator.wikimedia.org/T269327 (10ssingh) Hi @Jrbranaa: The list `testing-infrastructure` has been created and you should have received an email. The list info page is at https:/... [22:18:22] (03PS2) 10Cwhite: profile: identify network devices logging input [puppet] - 10https://gerrit.wikimedia.org/r/645181 (https://phabricator.wikimedia.org/T268806) [22:25:33] (03PS1) 10Jbond: spec test: various spec test fixes [puppet] - 10https://gerrit.wikimedia.org/r/645186 [22:28:13] (03PS1) 10Jbond: (WIP) spec test fixes [puppet] - 10https://gerrit.wikimedia.org/r/645187 [22:28:38] (03CR) 10Jbond: [C: 03+2] spec test: various spec test fixes [puppet] - 10https://gerrit.wikimedia.org/r/645186 (owner: 10Jbond) [22:28:41] (03CR) 10jerkins-bot: [V: 04-1] (WIP) spec test fixes [puppet] - 10https://gerrit.wikimedia.org/r/645187 (owner: 10Jbond) [22:33:54] (03Abandoned) 10Dzahn: cumin: remove stretch support and move python_version to Hiera [puppet] - 10https://gerrit.wikimedia.org/r/636101 (owner: 10Dzahn) [22:34:34] (03PS1) 10Razzi: Add back test-eqiad zookeeper cluster [puppet] - 10https://gerrit.wikimedia.org/r/645188 (https://phabricator.wikimedia.org/T268202) [22:35:17] (03PS7) 10Jbond: spec: add and use parallel sec which seems to give a boost to run time [puppet] - 10https://gerrit.wikimedia.org/r/645113 [22:36:33] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "this is a duplicate of other changes meanwhile and would need heavy manual rebasing. it seems easier to just make new patches for anything" [puppet] - 10https://gerrit.wikimedia.org/r/633029 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [22:37:18] RECOVERY - Rate of JVM GC Old generation-s runs - logstash1010-production-logstash-eqiad on logstash1010 is OK: (C)100 gt (W)80 gt 78.31 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1010&panelId=37 [22:37:55] (03Abandoned) 10Dzahn: mediawiki: replace hiera with lookup, add data types in all profiles [puppet] - 10https://gerrit.wikimedia.org/r/633029 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [22:37:57] (03CR) 10RLazarus: [C: 03+1] "Looks good! Just small stuff, feel free to merge." (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/633484 (https://phabricator.wikimedia.org/T221212) (owner: 10Volans) [22:40:17] (03Restored) 10Dzahn: mediawiki: replace hiera with lookup, add data types in all profiles [puppet] - 10https://gerrit.wikimedia.org/r/633029 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [22:41:07] PROBLEM - Kafka Broker Server #page on kafka-test1006 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [22:41:19] 👀 [22:41:25] that paged, but its test? [22:41:30] seems bad to page for test server [22:41:41] shdubsh: related to what you were working on? ^ [22:41:42] yea, agree [22:41:52] rzl: nope [22:41:57] ack thanks [22:42:02] Can this be adjusted to remove paging if it is a test host? If it is not a test hsot, it needs to be renamed. [22:42:15] its conflicting. [22:42:44] Ok, who do we need to track down for kafka info? [22:42:47] could be done via a regex in hiera [22:43:02] hey, maybe it's just coming up? not sure about this host [22:43:02] like for labtest* [22:43:05] razzi: kafka-test something you're working on? [22:43:10] cuz test hostname but page should likely be treated as real until proven otherwise =] [22:43:31] it looks like it's a brand new host, yeah [22:43:32] shdubsh: indeed, testing stuff on it [22:43:55] https://phabricator.wikimedia.org/T268202 [22:44:22] ok, i sent the ACK code [22:45:22] razzi: could you use the downtime cookbook to set it to scheduled maintenance for a while [22:45:40] mutante: yeah [22:45:51] ideally we should do something in hieradata/regex.yaml like for labtest [22:46:01] then it gets excluded from paging if it has "test" in the name [22:46:05] razzi: thanks! [22:46:25] (03PS2) 10Jbond: (WIP) spec test fixes [puppet] - 10https://gerrit.wikimedia.org/r/645187 [22:46:34] razzi: also if you want we have a flag that can be set in hiera to disable monitoring on a host indefinitely, if it's just for testing right now. as a random example `hieradata/hosts/db1077.yaml:profile::base::notifications: disabled` [22:47:01] s/monitoring/alerting [22:47:21] __regex: !ruby/regexp /^labtest/ ... do_paging: false [22:48:07] (03CR) 10jerkins-bot: [V: 04-1] (WIP) spec test fixes [puppet] - 10https://gerrit.wikimedia.org/r/645187 (owner: 10Jbond) [22:49:36] !log razzi@cumin1001 START - Cookbook sre.hosts.downtime [22:49:37] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [22:49:42] 10Operations, 10vm-requests, 10Patch-For-Review: Eq: 5 VM request for kafka-test-eqiad cluster - https://phabricator.wikimedia.org/T268202 (10ops-monitoring-bot) Icinga downtime for 40 days, 0:00:00 set by razzi@cumin1001 on 1 host(s) and their services with reason: new_install ` kafka-test1006.eqiad.wmnet ` [22:49:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:06] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10RobH) >>! In T267043#6651222, @LSobanski wrote: > @Cmjohnson Would it be possible to plan for racking 5 instead of 3 of the new hosts in one go? It would help us p... [23:00:05] 10Operations, 10conftool, 10serviceops, 10Datacenter-Switchover: Disable maintenance scripts via conftool - https://phabricator.wikimedia.org/T266717 (10RLazarus) Oh, that's a good idea! We'd set `maintenance_host` to the empty string before the switchover, so that no new jobs would start anywhere, then se... [23:01:18] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): cloudvirt1025 connection issues on primary nic - https://phabricator.wikimedia.org/T269313 (10faidon) Arzhel nerd-sniped me with this. It seems that all broadcast traffic destined for eno1np0 arrives, untagged, on eno2np1(!), h... [23:02:14] (03PS7) 10Dzahn: mediawiki: replace hiera with lookup, add data types in all profiles [puppet] - 10https://gerrit.wikimedia.org/r/633029 (https://phabricator.wikimedia.org/T209953) [23:02:35] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: replace hiera with lookup, add data types in all profiles [puppet] - 10https://gerrit.wikimedia.org/r/633029 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [23:04:23] RECOVERY - Check systemd state on ms-be1022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:07:56] (03PS1) 10Jbond: wmflib: update rspec test [puppet] - 10https://gerrit.wikimedia.org/r/645198 [23:08:46] (03PS8) 10Dzahn: mediawiki: replace hiera with lookup, add data types in all profiles [puppet] - 10https://gerrit.wikimedia.org/r/633029 (https://phabricator.wikimedia.org/T209953) [23:10:50] (03PS1) 10Cwhite: profile: make a logstash templates directory and relocate existing templates [puppet] - 10https://gerrit.wikimedia.org/r/645200 (https://phabricator.wikimedia.org/T234565) [23:10:55] (03CR) 10Jbond: [C: 03+2] wmflib: update rspec test [puppet] - 10https://gerrit.wikimedia.org/r/645198 (owner: 10Jbond) [23:12:27] (03PS8) 10Jbond: spec: add and use parallel sec which seems to give a boost to run time [puppet] - 10https://gerrit.wikimedia.org/r/645113 [23:12:44] (03PS4) 10Jbond: profile: migrate to shared spec_test [puppet] - 10https://gerrit.wikimedia.org/r/645147 [23:14:03] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): cloudvirt1025 connection issues on primary nic - https://phabricator.wikimedia.org/T269313 (10Andrew) Thank you @faidon! This is extremely strange. [23:16:17] (03PS3) 10Jbond: (WIP) spec test fixes [puppet] - 10https://gerrit.wikimedia.org/r/645187 [23:17:02] (03CR) 10Andrew Bogott: [C: 03+1] "Definite improvement" [puppet] - 10https://gerrit.wikimedia.org/r/645104 (owner: 10Jbond) [23:18:04] (03CR) 10jerkins-bot: [V: 04-1] (WIP) spec test fixes [puppet] - 10https://gerrit.wikimedia.org/r/645187 (owner: 10Jbond) [23:21:19] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1002/26916/" [puppet] - 10https://gerrit.wikimedia.org/r/633029 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [23:25:11] (03PS1) 10Dzahn: realm: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/645202 [23:25:37] (03CR) 10jerkins-bot: [V: 04-1] realm: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/645202 (owner: 10Dzahn) [23:25:41] (03PS2) 10Dzahn: realm: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/645202 (https://phabricator.wikimedia.org/T209953) [23:26:05] (03CR) 10jerkins-bot: [V: 04-1] realm: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/645202 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [23:30:54] (03PS1) 10Dzahn: query_service/updater: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/645203 (https://phabricator.wikimedia.org/T209953) [23:32:25] (03CR) 10jerkins-bot: [V: 04-1] query_service/updater: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/645203 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [23:37:49] (03CR) 10Dzahn: "what..can't use the {} in default_value inside an if-block? by the way, i think there shouldn't even be a host name comparison in here, th" [puppet] - 10https://gerrit.wikimedia.org/r/645202 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [23:39:18] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer [23:39:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:06] (03PS2) 10Dzahn: query_service/updater: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/645203 (https://phabricator.wikimedia.org/T209953) [23:44:42] !log ryankemper@cumin2001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [23:44:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:52:05] PROBLEM - SSH on ms-be1030 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:52:59] RECOVERY - SSH on ms-be1030 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:53:23] (03PS1) 10Dzahn: ntp: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/645206 [23:54:49] (03CR) 10jerkins-bot: [V: 04-1] ntp: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/645206 (owner: 10Dzahn)