[00:03:44] (03CR) 10Effie Mouzeli: Added new role::redis::misc for general purposes redis servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/467734 (https://phabricator.wikimedia.org/T206450) (owner: 10Effie Mouzeli) [00:05:09] (03CR) 10Effie Mouzeli: Added new role::redis::misc for general purposes redis servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/467734 (https://phabricator.wikimedia.org/T206450) (owner: 10Effie Mouzeli) [00:21:27] RECOVERY - High lag on wdqs1010 is OK: (C)3600 ge (W)1200 ge 1082 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [00:22:36] RECOVERY - MariaDB Slave Lag: s7 on db2040 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [00:22:46] RECOVERY - MariaDB Slave Lag: s7 on db2095 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [00:22:47] RECOVERY - MariaDB Slave Lag: s7 on db2047 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [00:22:56] RECOVERY - MariaDB Slave Lag: s7 on db2061 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [00:23:06] RECOVERY - MariaDB Slave Lag: s7 on db2087 is OK: OK slave_sql_lag Replication lag: 0.08 seconds [00:23:16] RECOVERY - MariaDB Slave Lag: s7 on db2077 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [00:23:16] RECOVERY - MariaDB Slave Lag: s7 on db2086 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [00:23:17] RECOVERY - MariaDB Slave Lag: s7 on db2054 is OK: OK slave_sql_lag Replication lag: 0.34 seconds [00:31:27] PROBLEM - Check systemd state on ms-be1019 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [01:20:32] * Krinkle staging on mwdebug1001 [01:38:11] (03PS1) 10Herron: confluent::kafka::common: dont use enable => 'mask' on jessie hosts [puppet] - 10https://gerrit.wikimedia.org/r/468498 (https://phabricator.wikimedia.org/T206454) [01:38:53] (03CR) 10jerkins-bot: [V: 04-1] confluent::kafka::common: dont use enable => 'mask' on jessie hosts [puppet] - 10https://gerrit.wikimedia.org/r/468498 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [01:40:38] (03PS2) 10Herron: confluent::kafka::common: dont use enable => 'mask' on jessie hosts [puppet] - 10https://gerrit.wikimedia.org/r/468498 (https://phabricator.wikimedia.org/T206454) [01:44:31] (03CR) 10Krinkle: "@Tim, @Joe: This is also meant to set a precedent for other early and/or non-mw stuff, such as profiler.php. There's lots of different way" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467239 (https://phabricator.wikimedia.org/T206963) (owner: 10Krinkle) [01:49:32] !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.26/extensions/WikimediaEvents/includes/WikimediaEventsHooks.php: Ic74a9d5601b8c (duration: 00m 55s) [01:49:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:01:21] 10Operations, 10Traffic, 10HTTPS, 10Upstream: Enable ESNI support on Wikimedia servers - https://phabricator.wikimedia.org/T205378 (10Shizhao) Firefox Nightly now supports ESNI: https://blog.mozilla.org/security/2018/10/18/encrypted-sni-comes-to-firefox-nightly/ [03:34:16] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 867.16 seconds [03:43:50] 10Operations, 10Traffic, 10HTTPS, 10Upstream: Enable ESNI support on Wikimedia servers - https://phabricator.wikimedia.org/T205378 (10Bawolff) >>! In T205378#4613684, @Aklapper wrote: > @Shizhao: Is this a [[ https://www.mediawiki.org/wiki/How_to_report_a_bug | feature request ]]? Currently it looks like a... [03:54:16] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 255.77 seconds [03:57:08] PROBLEM - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.200 second response time [03:59:39] ACKNOWLEDGEMENT - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.200 second response time andrew bogott Im moving worker nodes around and didnt realize that would cause a page. Should recover on its own. [04:37:58] RECOVERY - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.163 second response time [05:11:07] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2051 - https://phabricator.wikimedia.org/T207212 (10Marostegui) 05Open>03Resolved So this is what I meant and why I re-opened the task: ``` root@db2051:~# hpssacli controller all show config Smart Array P420i in Slot 0 (Embedded) (sn: 0014380337F5E... [05:25:28] !log Deploy schema change on s1 codfw host by host without replication - T204006 [05:25:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:25:32] T204006: Execute the schema change for Partial Blocks - https://phabricator.wikimedia.org/T204006 [05:43:53] (03PS1) 10Elukey: profile::hadoop::master: fix threshold for alarm [puppet] - 10https://gerrit.wikimedia.org/r/468502 [05:45:08] (03CR) 10Elukey: [C: 032] profile::hadoop::master: fix threshold for alarm [puppet] - 10https://gerrit.wikimedia.org/r/468502 (owner: 10Elukey) [05:58:19] !log Deploy schema change on s2 codfw host by host without replication - T204006 [05:58:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:58:22] T204006: Execute the schema change for Partial Blocks - https://phabricator.wikimedia.org/T204006 [06:13:46] !log Deploy schema change on s7 codfw host by host without replication - T204006 [06:13:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:50] T204006: Execute the schema change for Partial Blocks - https://phabricator.wikimedia.org/T204006 [06:22:10] (03PS1) 10Elukey: hadoop::master|stanby: avoid hardcoded thresholds for jvm alarms [puppet] - 10https://gerrit.wikimedia.org/r/468503 [06:23:13] (03CR) 10Elukey: [C: 032] hadoop::master|stanby: avoid hardcoded thresholds for jvm alarms [puppet] - 10https://gerrit.wikimedia.org/r/468503 (owner: 10Elukey) [06:29:04] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/apparmor.d/abstractions/ssl_certs] [06:41:03] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.6517 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [06:41:32] PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.3919 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [06:44:23] RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [06:44:43] RECOVERY - Packet loss ratio for UDP on logstash1009 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [06:59:43] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:03:19] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10ayounsi) a:03ayounsi [07:05:58] !log bump /proc/sys/net/core/rmem_default temporarily to 1MB and bounce statsd-proxy statsite-instances on graphite1004 - T196484 [07:06:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:02] T196484: rack/setup/install graphite1004 - https://phabricator.wikimedia.org/T196484 [07:07:23] (03PS1) 10Elukey: Move Analytics JVM's alarms away from fixed thresholds [puppet] - 10https://gerrit.wikimedia.org/r/468506 [07:08:20] (03CR) 10Elukey: [C: 032] Move Analytics JVM's alarms away from fixed thresholds [puppet] - 10https://gerrit.wikimedia.org/r/468506 (owner: 10Elukey) [07:11:38] PROBLEM - pdfrender on scb1001 is CRITICAL: HTTP CRITICAL - No data received from host [07:12:39] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time [07:13:33] (03PS2) 10Muehlenhoff: Use auto_ferm to properly restrict to rsyncd on stat hosts [puppet] - 10https://gerrit.wikimedia.org/r/467973 [07:14:42] (03CR) 10Muehlenhoff: [C: 032] Use auto_ferm to properly restrict to rsyncd on stat hosts [puppet] - 10https://gerrit.wikimedia.org/r/467973 (owner: 10Muehlenhoff) [07:14:49] RECOVERY - Check systemd state on ms-be1019 is OK: OK - running: The system is fully operational [07:15:18] !log powercycle ms-be1021, [19601329.556259] sd 0:1:0:1: rejecting I/O to offline device [07:15:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:39] (03PS2) 10Marostegui: mariadb: remove mwmaint1001 from prod-m5 SQL grants [puppet] - 10https://gerrit.wikimedia.org/r/465685 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn) [07:19:22] (03CR) 10Marostegui: [C: 032] mariadb: remove mwmaint1001 from prod-m5 SQL grants [puppet] - 10https://gerrit.wikimedia.org/r/465685 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn) [07:19:49] PROBLEM - Host ms-be1021 is DOWN: PING CRITICAL - Packet loss = 100% [07:19:50] (03PS2) 10Muehlenhoff: Use auto_ferm for eventlogging rsync module [puppet] - 10https://gerrit.wikimedia.org/r/467974 [07:20:25] !og Remove mwmaint1001 grants from m5 - https://phabricator.wikimedia.org/T201343 https://phabricator.wikimedia.org/T192457 [07:20:35] !log Remove mwmaint1001 grants from m5 - https://phabricator.wikimedia.org/T201343 https://phabricator.wikimedia.org/T192457 [07:20:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:15] (03CR) 10Muehlenhoff: [C: 032] Use auto_ferm for eventlogging rsync module [puppet] - 10https://gerrit.wikimedia.org/r/467974 (owner: 10Muehlenhoff) [07:21:39] RECOVERY - swift-container-updater on ms-be1021 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [07:21:48] RECOVERY - Host ms-be1021 is UP: PING OK - Packet loss = 0%, RTA = 0.72 ms [07:21:49] RECOVERY - Check systemd state on ms-be1021 is OK: OK - running: The system is fully operational [07:22:19] RECOVERY - Disk space on ms-be1021 is OK: DISK OK [07:22:19] RECOVERY - MD RAID on ms-be1021 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [07:24:33] (03PS2) 10Muehlenhoff: Use auto_ferm for hdfs-archive rsyncd module [puppet] - 10https://gerrit.wikimedia.org/r/467977 [07:25:59] RECOVERY - puppet last run on ms-be1021 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:27:29] (03CR) 10Muehlenhoff: [C: 032] Use auto_ferm for hdfs-archive rsyncd module [puppet] - 10https://gerrit.wikimedia.org/r/467977 (owner: 10Muehlenhoff) [07:31:21] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1021 - https://phabricator.wikimedia.org/T207399 (10fgiunchedi) 05Open>03Invalid In this case the controller freaked out, after a reboot the raids are clean: ``` ms-be1021:~$ cat /proc/mdstat Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [ra... [07:44:29] (03PS3) 10Elukey: role::eventlogging::analytics::files: lower down retention [puppet] - 10https://gerrit.wikimedia.org/r/467648 (https://phabricator.wikimedia.org/T206542) [07:48:21] (03PS4) 10Elukey: role::eventlogging::analytics::files: lower down retention [puppet] - 10https://gerrit.wikimedia.org/r/467648 (https://phabricator.wikimedia.org/T206542) [07:48:47] PROBLEM - puppet last run on cp3047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:49:02] (03CR) 10Elukey: [C: 032] role::eventlogging::analytics::files: lower down retention [puppet] - 10https://gerrit.wikimedia.org/r/467648 (https://phabricator.wikimedia.org/T206542) (owner: 10Elukey) [07:50:21] !log bump /proc/sys/net/core/rmem_default temporarily to 2MB and bounce statsd-proxy statsite-instances on graphite1004 - T196484 [07:50:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:26] T196484: rack/setup/install graphite1004 - https://phabricator.wikimedia.org/T196484 [07:53:59] 10Operations, 10ops-eqiad, 10Datacenter-Switchover-2018, 10Patch-For-Review: rack/setup/install mwmaint1002.eqiad.wmnet - https://phabricator.wikimedia.org/T201343 (10Marostegui) >>! In T201343#4659214, @Dzahn wrote: > Yes. > > mariadb: remove mwmaint1001 from prod-m5 SQL grants - https://gerrit.wikimedia... [08:14:05] RECOVERY - puppet last run on cp3047 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [08:15:56] (03PS1) 10Elukey: Smooth Analytics' JVM alarms over the past hour to avoid icinga noise [puppet] - 10https://gerrit.wikimedia.org/r/468528 [08:16:59] (03CR) 10Elukey: [C: 032] Smooth Analytics' JVM alarms over the past hour to avoid icinga noise [puppet] - 10https://gerrit.wikimedia.org/r/468528 (owner: 10Elukey) [08:19:18] (03PS7) 10Mathew.onipe: scap::target: added additional_services_names param [puppet] - 10https://gerrit.wikimedia.org/r/464807 (https://phabricator.wikimedia.org/T206314) [08:23:46] (03CR) 10Gehel: [C: 04-1] scap::target: added additional_services_names param (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464807 (https://phabricator.wikimedia.org/T206314) (owner: 10Mathew.onipe) [08:28:52] !log stopping db1092 and db1087 in sync [08:28:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:13] (03PS8) 10Mathew.onipe: scap::target: added additional_services_names param [puppet] - 10https://gerrit.wikimedia.org/r/464807 (https://phabricator.wikimedia.org/T206314) [08:37:02] (03CR) 10Banyek: [C: 032] wmf-pt-kill: logrotate feature added [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/466886 (https://phabricator.wikimedia.org/T206521) (owner: 10Banyek) [08:37:13] (03CR) 10Banyek: [V: 032 C: 032] wmf-pt-kill: logrotate feature added [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/466886 (https://phabricator.wikimedia.org/T206521) (owner: 10Banyek) [08:38:32] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: rack/setup/install graphite1004 - https://phabricator.wikimedia.org/T196484 (10fgiunchedi) Setting a 2MB socket receive buffer has helped getting errors down to ~0, unfortunately `statsd-proxy` nor `statsite` support setting `SO_RCVBUF` soc... [08:40:27] (03PS1) 10Hashar: admin: add ssh key for Antoine Musso [puppet] - 10https://gerrit.wikimedia.org/r/468532 [08:40:40] (03PS1) 10Elukey: Fix analytics JVM alarms (missing left parenthesis) [puppet] - 10https://gerrit.wikimedia.org/r/468533 [08:41:03] (03CR) 10Hashar: "I am not sure how to confirm I am who I claim to be :-)" [puppet] - 10https://gerrit.wikimedia.org/r/468532 (owner: 10Hashar) [08:41:27] (03CR) 10Elukey: [C: 032] Fix analytics JVM alarms (missing left parenthesis) [puppet] - 10https://gerrit.wikimedia.org/r/468533 (owner: 10Elukey) [08:51:42] (03CR) 10Mathew.onipe: "puppet compiler output:" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/464807 (https://phabricator.wikimedia.org/T206314) (owner: 10Mathew.onipe) [08:53:47] !log adding wmf-pt-kill_2.2.20-1+wmf4 package for stretch (T206521) [08:53:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:51] T206521: Solve logrotating on wmf-pt-kill - https://phabricator.wikimedia.org/T206521 [09:10:31] (03PS1) 10Elukey: profile::hadoop::worker: fix yarn prometheus alert [puppet] - 10https://gerrit.wikimedia.org/r/468535 [09:14:26] (03CR) 10Elukey: [C: 032] mcrouter: allow to tune server timeout and timeouts until tko [puppet] - 10https://gerrit.wikimedia.org/r/468363 (https://phabricator.wikimedia.org/T203786) (owner: 10Elukey) [09:14:37] (03CR) 10Elukey: "wrong one!" [puppet] - 10https://gerrit.wikimedia.org/r/468363 (https://phabricator.wikimedia.org/T203786) (owner: 10Elukey) [09:15:23] (03PS3) 10Elukey: mcrouter: allow to tune server timeout and timeouts until tko [puppet] - 10https://gerrit.wikimedia.org/r/468363 (https://phabricator.wikimedia.org/T203786) [09:15:25] (03CR) 10Elukey: [C: 032] profile::hadoop::worker: fix yarn prometheus alert [puppet] - 10https://gerrit.wikimedia.org/r/468535 (owner: 10Elukey) [09:20:27] (03CR) 10Alex Monk: "could log in somewhere with the current key and write a file saying this is you?" [puppet] - 10https://gerrit.wikimedia.org/r/468532 (owner: 10Hashar) [09:37:05] !log bump /proc/sys/net/core/rmem_default temporarily to 6MB and bounce statsd-proxy statsite-instances on graphite1004 - T196484 [09:37:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:09] T196484: rack/setup/install graphite1004 - https://phabricator.wikimedia.org/T196484 [09:38:42] (03PS1) 10Giuseppe Lavagetto: pybal::web: convert to puppet 4.x [puppet] - 10https://gerrit.wikimedia.org/r/468536 [09:39:33] (03CR) 10jerkins-bot: [V: 04-1] pybal::web: convert to puppet 4.x [puppet] - 10https://gerrit.wikimedia.org/r/468536 (owner: 10Giuseppe Lavagetto) [09:41:47] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: rack/setup/install graphite1004 - https://phabricator.wikimedia.org/T196484 (10fgiunchedi) Tried 6MB per thread now: we're ingesting about 30MB/s of udp traffic, with 4 statsd-proxy threads each should be able to buffer its share of bandwid... [09:47:15] (03PS5) 10Effie Mouzeli: Added new role::redis::misc for general purposes redis servers [puppet] - 10https://gerrit.wikimedia.org/r/467734 (https://phabricator.wikimedia.org/T206450) [09:48:14] (03CR) 10Vgutierrez: [C: 032] "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1002/13084/" [puppet] - 10https://gerrit.wikimedia.org/r/468315 (https://phabricator.wikimedia.org/T199711) (owner: 10Alex Monk) [09:48:23] (03PS3) 10Vgutierrez: certcentral: Add first domain for testing in prod [puppet] - 10https://gerrit.wikimedia.org/r/468315 (https://phabricator.wikimedia.org/T199711) (owner: 10Alex Monk) [09:48:34] (03CR) 10Effie Mouzeli: Added new role::redis::misc for general purposes redis servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/467734 (https://phabricator.wikimedia.org/T206450) (owner: 10Effie Mouzeli) [09:48:44] (03Abandoned) 10Fdans: Add change_tag to the list of tables to sqoop in cron [puppet] - 10https://gerrit.wikimedia.org/r/467320 (https://phabricator.wikimedia.org/T205940) (owner: 10Fdans) [09:55:55] (03PS1) 10Alex Monk: certcentral: Notify certcentral backend service on config change [puppet] - 10https://gerrit.wikimedia.org/r/468537 [09:56:52] PROBLEM - Check systemd state on certcentral1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:57:09] ^^ that's me [09:59:03] RECOVERY - Check systemd state on certcentral1001 is OK: OK - running: The system is fully operational [10:02:10] (03PS2) 10Giuseppe Lavagetto: pybal::web: convert to puppet 4.x [puppet] - 10https://gerrit.wikimedia.org/r/468536 [10:04:24] (03PS1) 10Alex Monk: certcentral: don't muck with whitespace after SNI entries [puppet] - 10https://gerrit.wikimedia.org/r/468538 [10:09:52] (03PS2) 10Alex Monk: certcentral: Notify certcentral backend service on config change [puppet] - 10https://gerrit.wikimedia.org/r/468537 [10:09:54] (03PS2) 10Alex Monk: certcentral: don't muck with whitespace after SNI entries [puppet] - 10https://gerrit.wikimedia.org/r/468538 [10:14:41] (03PS3) 10Alex Monk: certcentral: don't muck with whitespace after SNI entries [puppet] - 10https://gerrit.wikimedia.org/r/468538 [10:14:43] (03PS1) 10Alex Monk: certcentral hiera style: quote all cert details [puppet] - 10https://gerrit.wikimedia.org/r/468539 [10:16:09] (03PS3) 10Giuseppe Lavagetto: pybal::web: convert to puppet 4.x [puppet] - 10https://gerrit.wikimedia.org/r/468536 [10:19:57] (03PS4) 10Giuseppe Lavagetto: pybal::web: convert to puppet 4.x [puppet] - 10https://gerrit.wikimedia.org/r/468536 [10:19:59] (03PS6) 10Effie Mouzeli: Added new role::redis::misc for general purposes redis servers [puppet] - 10https://gerrit.wikimedia.org/r/467734 (https://phabricator.wikimedia.org/T206450) [10:20:50] (03PS2) 10Alex Monk: certcentral hiera style: quote all cert details [puppet] - 10https://gerrit.wikimedia.org/r/468539 [10:22:48] (03CR) 10Mobrovac: [C: 031] "LGTM, PCC ok - https://puppet-compiler.wmflabs.org/compiler1002/13092/" [puppet] - 10https://gerrit.wikimedia.org/r/464807 (https://phabricator.wikimedia.org/T206314) (owner: 10Mathew.onipe) [10:24:22] (03PS7) 10Effie Mouzeli: Added new role::redis::misc for general purposes redis servers [puppet] - 10https://gerrit.wikimedia.org/r/467734 (https://phabricator.wikimedia.org/T206450) [10:28:54] (03PS1) 10Elukey: Guard $net_topology_script_path since it can be undef [puppet/cdh] - 10https://gerrit.wikimedia.org/r/468541 [10:29:25] (03CR) 10Vgutierrez: [C: 032] certcentral: Notify certcentral backend service on config change [puppet] - 10https://gerrit.wikimedia.org/r/468537 (owner: 10Alex Monk) [10:29:36] (03CR) 10Vgutierrez: [C: 032] certcentral: don't muck with whitespace after SNI entries [puppet] - 10https://gerrit.wikimedia.org/r/468538 (owner: 10Alex Monk) [10:29:45] (03CR) 10Vgutierrez: [C: 032] certcentral hiera style: quote all cert details [puppet] - 10https://gerrit.wikimedia.org/r/468539 (owner: 10Alex Monk) [10:30:15] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/13097/" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/468541 (owner: 10Elukey) [10:30:22] (03CR) 10Elukey: [C: 032] Guard $net_topology_script_path since it can be undef [puppet/cdh] - 10https://gerrit.wikimedia.org/r/468541 (owner: 10Elukey) [10:32:14] (03PS1) 10Elukey: Update cdh submodule to the lastest sha [puppet] - 10https://gerrit.wikimedia.org/r/468542 [10:33:38] 10Operations, 10Cloud-Services, 10Mail, 10User-herron, 10cloud-services-team (Kanban): Routing RFC1918 private IP addresses to/from WMCS floating IPs - https://phabricator.wikimedia.org/T206261 (10aborrero) [10:33:47] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/13098/" [puppet] - 10https://gerrit.wikimedia.org/r/468542 (owner: 10Elukey) [10:37:03] (03PS1) 10Alex Monk: certcentral: Replace service unit file [puppet] - 10https://gerrit.wikimedia.org/r/468543 [10:37:51] (03CR) 10Giuseppe Lavagetto: [C: 031] "A comment on the redis configuration, but the manifest LGTM." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/467734 (https://phabricator.wikimedia.org/T206450) (owner: 10Effie Mouzeli) [10:38:21] (03PS1) 10Alex Monk: certcentral: Set http_proxy environment when running backend service [puppet] - 10https://gerrit.wikimedia.org/r/468544 [10:41:03] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [10:41:28] * elukey shouts against mcrouter and memcached [10:42:07] this is again the mc1035 issue [10:42:12] going to recover soon [10:42:50] there is a patch from Aaron that may fix/reduce this problem, will be rolled out next week [10:44:23] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [10:48:49] (03PS1) 10Mobrovac: Proton: Configure the fonts [puppet] - 10https://gerrit.wikimedia.org/r/468545 (https://phabricator.wikimedia.org/T199264) [10:49:27] (03PS1) 10Arturo Borrero Gonzalez: cloudvps: eqiad1: exclude SNAT between VMs [puppet] - 10https://gerrit.wikimedia.org/r/468546 (https://phabricator.wikimedia.org/T206261) [10:49:37] (03CR) 10jerkins-bot: [V: 04-1] Proton: Configure the fonts [puppet] - 10https://gerrit.wikimedia.org/r/468545 (https://phabricator.wikimedia.org/T199264) (owner: 10Mobrovac) [10:50:27] (03CR) 10Arturo Borrero Gonzalez: [C: 032] cloudvps: eqiad1: exclude SNAT between VMs [puppet] - 10https://gerrit.wikimedia.org/r/468546 (https://phabricator.wikimedia.org/T206261) (owner: 10Arturo Borrero Gonzalez) [10:51:48] (03PS2) 10Mobrovac: Proton: Configure the fonts [puppet] - 10https://gerrit.wikimedia.org/r/468545 (https://phabricator.wikimedia.org/T199264) [10:52:51] (03PS2) 10Alex Monk: certcentral: Set HTTP_PROXY/HTTPS_PROXY environment when running backend service [puppet] - 10https://gerrit.wikimedia.org/r/468544 [10:53:07] !log icinga downtime for 2h for clounet1003/1004 to deploy patch related to T206261 [10:53:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:11] T206261: Routing RFC1918 private IP addresses to/from WMCS floating IPs - https://phabricator.wikimedia.org/T206261 [10:56:43] (03CR) 10Mobrovac: "PCC OK - https://puppet-compiler.wmflabs.org/compiler1002/13101/" [puppet] - 10https://gerrit.wikimedia.org/r/468545 (https://phabricator.wikimedia.org/T199264) (owner: 10Mobrovac) [10:56:52] 10Operations, 10Cloud-Services, 10Mail, 10Patch-For-Review, 10User-herron: Create a Cloud VPS SMTP smarthost - https://phabricator.wikimedia.org/T41785 (10aborrero) [10:56:58] 10Operations, 10Cloud-Services, 10Mail, 10Patch-For-Review, and 2 others: Routing RFC1918 private IP addresses to/from WMCS floating IPs - https://phabricator.wikimedia.org/T206261 (10aborrero) 05Open>03Resolved hey @herron it should be working now. This was my test: ``` aborrero@cloudinfra-puppetmas... [10:57:46] (03PS3) 10Alex Monk: certcentral: Set HTTP_PROXY/HTTPS_PROXY environment when running backend service [puppet] - 10https://gerrit.wikimedia.org/r/468544 [10:58:07] (03PS1) 10Ladsgroup: Revert back wikidata for change_tag backend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468547 (https://phabricator.wikimedia.org/T207313) [10:58:38] (03CR) 10jerkins-bot: [V: 04-1] certcentral: Set HTTP_PROXY/HTTPS_PROXY environment when running backend service [puppet] - 10https://gerrit.wikimedia.org/r/468544 (owner: 10Alex Monk) [10:59:49] jouncebot: now [10:59:50] No deployments scheduled for the next 71 hour(s) and 30 minute(s) [10:59:58] oh, of course, it is friday [11:00:27] * addshore needs to and is gonna deploy https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/468547/ to fix an UBN on wikidata [11:01:00] (03CR) 10Mobrovac: [C: 031] "Tested in beta, works as advertised." [puppet] - 10https://gerrit.wikimedia.org/r/468545 (https://phabricator.wikimedia.org/T199264) (owner: 10Mobrovac) [11:03:07] (03PS4) 10Alex Monk: certcentral: Set HTTP_PROXY/HTTPS_PROXY environment when running backend service [puppet] - 10https://gerrit.wikimedia.org/r/468544 [11:04:01] (03CR) 10jerkins-bot: [V: 04-1] certcentral: Set HTTP_PROXY/HTTPS_PROXY environment when running backend service [puppet] - 10https://gerrit.wikimedia.org/r/468544 (owner: 10Alex Monk) [11:08:06] (03PS5) 10Alex Monk: certcentral: Set HTTP_PROXY/HTTPS_PROXY environment when running backend service [puppet] - 10https://gerrit.wikimedia.org/r/468544 [11:09:58] (03PS6) 10Alex Monk: certcentral: Set HTTP_PROXY/HTTPS_PROXY environment when running backend service [puppet] - 10https://gerrit.wikimedia.org/r/468544 [11:16:51] (03PS2) 10Alex Monk: certcentral: Replace service unit file [puppet] - 10https://gerrit.wikimedia.org/r/468543 [11:16:53] (03PS7) 10Alex Monk: certcentral: Set HTTP_PROXY/HTTPS_PROXY environment when running backend service [puppet] - 10https://gerrit.wikimedia.org/r/468544 [11:22:01] (03CR) 10Vgutierrez: [C: 032] "pcc is happy: https://puppet-compiler.wmflabs.org/compiler1002/13103/certcentral1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/468544 (owner: 10Alex Monk) [11:22:09] (03CR) 10Vgutierrez: [C: 032] certcentral: Replace service unit file [puppet] - 10https://gerrit.wikimedia.org/r/468543 (owner: 10Alex Monk) [11:22:32] (03PS3) 10Vgutierrez: certcentral: Replace service unit file [puppet] - 10https://gerrit.wikimedia.org/r/468543 (owner: 10Alex Monk) [11:22:41] (03PS8) 10Vgutierrez: certcentral: Set HTTP_PROXY/HTTPS_PROXY environment when running backend service [puppet] - 10https://gerrit.wikimedia.org/r/468544 (owner: 10Alex Monk) [11:22:46] 10Operations, 10Cloud-VPS, 10netops, 10cloud-services-team (Kanban): dmz_cidr only includes some wikimedia public IP ranges, leading to some very strange behaviour - https://phabricator.wikimedia.org/T174596 (10aborrero) Does this still affect us? If so, which concrete subnets are affected? [11:26:10] 10Operations, 10Cloud-VPS, 10netops, 10cloud-services-team (Kanban): dmz_cidr only includes some wikimedia public IP ranges, leading to some very strange behaviour - https://phabricator.wikimedia.org/T174596 (10Krenair) I can't seem to access eqiad1.bastion.wmflabs.org right now but: ```krenair@bastion-01:... [11:27:27] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1001/13105/puppetmaster1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/468536 (owner: 10Giuseppe Lavagetto) [11:27:37] (03PS5) 10Giuseppe Lavagetto: pybal::web: convert to puppet 4.x [puppet] - 10https://gerrit.wikimedia.org/r/468536 [11:29:07] * addshore is going to do the change now [11:29:13] (03CR) 10Addshore: [C: 032] Revert back wikidata for change_tag backend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468547 (https://phabricator.wikimedia.org/T207313) (owner: 10Ladsgroup) [11:30:33] (03Merged) 10jenkins-bot: Revert back wikidata for change_tag backend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468547 (https://phabricator.wikimedia.org/T207313) (owner: 10Ladsgroup) [11:31:30] (03PS1) 10Alex Monk: certcentral: List validation DNS servers as IPs instead of hostnames [puppet] - 10https://gerrit.wikimedia.org/r/468551 [11:31:56] (03CR) 10jerkins-bot: [V: 04-1] certcentral: List validation DNS servers as IPs instead of hostnames [puppet] - 10https://gerrit.wikimedia.org/r/468551 (owner: 10Alex Monk) [11:32:12] syncing [11:32:42] (03PS2) 10Alex Monk: certcentral: List validation DNS servers as IPs instead of hostnames [puppet] - 10https://gerrit.wikimedia.org/r/468551 [11:33:09] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T207313 UBN - Revert back wikidata for change_tag backend (duration: 00m 59s) [11:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:12] T207313: Some administrative and log actions on Wikidata take longer than 60 seconds and time out - https://phabricator.wikimedia.org/T207313 [11:35:27] 10Operations, 10ops-eqiad, 10Operations-Software-Development: rack/setup/install clustermgmt1001.eqiad.wmnet (new cumin master) - https://phabricator.wikimedia.org/T201346 (10Volans) 05Open>03Resolved [11:36:04] (03PS8) 10Effie Mouzeli: Added new role::redis::misc for general purposes redis servers [puppet] - 10https://gerrit.wikimedia.org/r/467734 (https://phabricator.wikimedia.org/T206450) [11:45:43] (03CR) 10jenkins-bot: Revert back wikidata for change_tag backend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468547 (https://phabricator.wikimedia.org/T207313) (owner: 10Ladsgroup) [11:46:37] !log starting compression of s4 tables @dbstore2002 (T204930) [11:46:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:40] T204930: dbstore2002 tables compression status check - https://phabricator.wikimedia.org/T204930 [11:52:24] (03PS1) 10Vgutierrez: dns_validation: Allow hostnames as DNS validation sync_dns_servers [software/certcentral] - 10https://gerrit.wikimedia.org/r/468554 (https://phabricator.wikimedia.org/T207457) [11:53:04] (03PS2) 10Vgutierrez: dns_validation: Allow hostnames as DNS validation servers [software/certcentral] - 10https://gerrit.wikimedia.org/r/468554 (https://phabricator.wikimedia.org/T207457) [11:53:47] (03CR) 10Gehel: [C: 04-1] scap::target: added additional_services_names param (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464807 (https://phabricator.wikimedia.org/T206314) (owner: 10Mathew.onipe) [11:54:37] (03PS3) 10Gehel: enable rspec tests [puppet] - 10https://gerrit.wikimedia.org/r/467692 (https://phabricator.wikimedia.org/T204240) (owner: 10Mathew.onipe) [11:54:43] (03CR) 10jerkins-bot: [V: 04-1] dns_validation: Allow hostnames as DNS validation servers [software/certcentral] - 10https://gerrit.wikimedia.org/r/468554 (https://phabricator.wikimedia.org/T207457) (owner: 10Vgutierrez) [11:55:33] (03CR) 10Gehel: [C: 032] enable rspec tests [puppet] - 10https://gerrit.wikimedia.org/r/467692 (https://phabricator.wikimedia.org/T204240) (owner: 10Mathew.onipe) [11:58:21] (03PS9) 10Effie Mouzeli: Added new role::redis::misc for general purposes redis servers [puppet] - 10https://gerrit.wikimedia.org/r/467734 (https://phabricator.wikimedia.org/T206450) [11:59:17] 10Operations, 10Cloud-Services, 10Mail, 10Patch-For-Review, 10User-herron: Create a Cloud VPS SMTP smarthost - https://phabricator.wikimedia.org/T41785 (10aborrero) [11:59:20] 10Operations, 10Cloud-Services, 10Mail, 10Patch-For-Review, and 2 others: Routing RFC1918 private IP addresses to/from WMCS floating IPs - https://phabricator.wikimedia.org/T206261 (10aborrero) 05Resolved>03Open Reopening and reverting patch. I can confirm is causing at least 2 issues: 1) ssh issue to... [11:59:36] (03PS1) 10Arturo Borrero Gonzalez: Revert "cloudvps: eqiad1: exclude SNAT between VMs" [puppet] - 10https://gerrit.wikimedia.org/r/468555 [11:59:46] (03PS2) 10Arturo Borrero Gonzalez: Revert "cloudvps: eqiad1: exclude SNAT between VMs" [puppet] - 10https://gerrit.wikimedia.org/r/468555 [12:00:46] (03CR) 10Arturo Borrero Gonzalez: [C: 032] Revert "cloudvps: eqiad1: exclude SNAT between VMs" [puppet] - 10https://gerrit.wikimedia.org/r/468555 (owner: 10Arturo Borrero Gonzalez) [12:07:53] (03PS10) 10Effie Mouzeli: Added new role::redis::misc for general purposes redis servers [puppet] - 10https://gerrit.wikimedia.org/r/467734 (https://phabricator.wikimedia.org/T206450) [12:08:04] (03PS1) 10Elukey: profile::beta::autoupdater: move declaration of scap::users [puppet] - 10https://gerrit.wikimedia.org/r/468557 [12:08:48] (03CR) 10jerkins-bot: [V: 04-1] profile::beta::autoupdater: move declaration of scap::users [puppet] - 10https://gerrit.wikimedia.org/r/468557 (owner: 10Elukey) [12:09:08] (03PS1) 10Mathew.onipe: elasticsearch_cluster: multi-cluster support [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T202885) [12:10:14] (03CR) 10Elukey: "of course this leads to:" [puppet] - 10https://gerrit.wikimedia.org/r/468557 (owner: 10Elukey) [12:11:23] (03PS1) 10Jgreen: add check_impression_logs to nsca_frack.cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/468559 (https://phabricator.wikimedia.org/T125687) [12:11:44] (03Abandoned) 10Elukey: profile::beta::autoupdater: move declaration of scap::users [puppet] - 10https://gerrit.wikimedia.org/r/468557 (owner: 10Elukey) [12:12:15] (03CR) 10Jgreen: [C: 032] add check_impression_logs to nsca_frack.cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/468559 (https://phabricator.wikimedia.org/T125687) (owner: 10Jgreen) [12:12:50] (03PS11) 10Effie Mouzeli: Added new role::redis::misc for general purposes redis servers [puppet] - 10https://gerrit.wikimedia.org/r/467734 (https://phabricator.wikimedia.org/T206450) [12:17:38] (03PS1) 10Elukey: profile::beta::autoupdater: remove declaration of scap::users [puppet] - 10https://gerrit.wikimedia.org/r/468563 [12:20:02] (03PS2) 10Elukey: profile::beta::autoupdater: remove declaration of scap::users [puppet] - 10https://gerrit.wikimedia.org/r/468563 [12:20:48] (03PS3) 10Elukey: profile::beta::autoupdater: remove declaration of scap::scripts [puppet] - 10https://gerrit.wikimedia.org/r/468563 [12:23:00] (03CR) 10Elukey: "Related to https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/385461/" [puppet] - 10https://gerrit.wikimedia.org/r/468563 (owner: 10Elukey) [12:23:24] (03PS4) 10Elukey: profile::beta::autoupdater: remove declaration of scap::scripts [puppet] - 10https://gerrit.wikimedia.org/r/468563 [12:23:26] (03PS3) 10Vgutierrez: dns_validation: Allow hostnames as DNS validation servers [software/certcentral] - 10https://gerrit.wikimedia.org/r/468554 (https://phabricator.wikimedia.org/T207457) [12:25:33] (03CR) 10jerkins-bot: [V: 04-1] dns_validation: Allow hostnames as DNS validation servers [software/certcentral] - 10https://gerrit.wikimedia.org/r/468554 (https://phabricator.wikimedia.org/T207457) (owner: 10Vgutierrez) [12:27:34] (03PS4) 10Vgutierrez: dns_validation: Allow hostnames as DNS validation servers [software/certcentral] - 10https://gerrit.wikimedia.org/r/468554 (https://phabricator.wikimedia.org/T207457) [12:29:52] (03PS5) 10Elukey: scap::scripts: conditionally require mediawiki::users [puppet] - 10https://gerrit.wikimedia.org/r/468563 [12:36:18] 10Operations, 10ops-codfw: es2017 and es2019 have an idrac ethernet interface in Linux - https://phabricator.wikimedia.org/T207328 (10faidon) 05Open>03Resolved a:03faidon OK, you were right about the cause. I addressed the symptom, which was to go into iDRAC's web interface, and under Overview > iDRAC Se... [12:36:45] (03CR) 10Alex Monk: [C: 032] dns_validation: Allow hostnames as DNS validation servers [software/certcentral] - 10https://gerrit.wikimedia.org/r/468554 (https://phabricator.wikimedia.org/T207457) (owner: 10Vgutierrez) [12:37:04] (03PS1) 10Alex Monk: dns_validation: Allow hostnames as DNS validation servers [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/468564 (https://phabricator.wikimedia.org/T207457) [12:38:41] (03CR) 10Vgutierrez: [V: 032 C: 032] dns_validation: Allow hostnames as DNS validation servers [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/468564 (https://phabricator.wikimedia.org/T207457) (owner: 10Alex Monk) [12:39:43] (03PS1) 10Alex Monk: Package new version with latest commit [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/468565 [12:40:32] (03CR) 10jenkins-bot: dns_validation: Allow hostnames as DNS validation servers [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/468564 (https://phabricator.wikimedia.org/T207457) (owner: 10Alex Monk) [12:40:57] (03PS2) 10Alex Monk: Package new version with latest commit [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/468565 [12:41:33] (03CR) 10Vgutierrez: [V: 032 C: 032] Package new version with latest commit [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/468565 (owner: 10Alex Monk) [12:43:40] (03CR) 10jenkins-bot: Package new version with latest commit [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/468565 (owner: 10Alex Monk) [12:45:01] (03PS1) 10Alex Monk: bump setup.py version [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/468567 [12:45:05] vgutierrez, ^ [12:45:36] Krenair: we need that in master as well [12:45:41] and cherry-picked to debian [12:45:45] not the other way around [12:46:41] (03PS1) 10Elukey: deployment-prep: add turnilo to scap repos on the deploy host [puppet] - 10https://gerrit.wikimedia.org/r/468568 [12:46:47] bah [12:46:48] Can cherry pick in the other direction... Hardly makes much difference :) [12:46:53] yeah [12:47:01] (03PS1) 10Alex Monk: bump setup.py version [software/certcentral] - 10https://gerrit.wikimedia.org/r/468569 [12:47:05] OCD matters :) [12:47:06] vgutierrez, ^ [12:47:08] <3 [12:47:09] thx [12:47:22] sigh.. my OCD is not happy with this one :) [12:47:44] (03CR) 10Vgutierrez: [C: 032] bump setup.py version [software/certcentral] - 10https://gerrit.wikimedia.org/r/468569 (owner: 10Alex Monk) [12:48:18] Can't you get gerrit to change the branch on the ps? [12:49:00] I could've pushed the same commit to refs/for/master [12:49:05] yeah.. the "Move change" button [12:50:27] (03Merged) 10jenkins-bot: bump setup.py version [software/certcentral] - 10https://gerrit.wikimedia.org/r/468569 (owner: 10Alex Monk) [12:52:33] (03CR) 10Vgutierrez: [C: 032] bump setup.py version [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/468567 (owner: 10Alex Monk) [12:52:40] You need to also rebase after moving branch [12:53:13] (03CR) 10jenkins-bot: bump setup.py version [software/certcentral] - 10https://gerrit.wikimedia.org/r/468569 (owner: 10Alex Monk) [12:53:19] (03PS1) 10BBlack: add reflect.wikimedia.org for debug [dns] - 10https://gerrit.wikimedia.org/r/468570 [12:53:33] (03CR) 10jerkins-bot: [V: 04-1] add reflect.wikimedia.org for debug [dns] - 10https://gerrit.wikimedia.org/r/468570 (owner: 10BBlack) [12:53:35] Per the red text I added [12:54:07] (03PS9) 10Mathew.onipe: scap::target: added additional_services_names param [puppet] - 10https://gerrit.wikimedia.org/r/464807 (https://phabricator.wikimedia.org/T206314) [12:54:35] (03CR) 10jenkins-bot: bump setup.py version [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/468567 (owner: 10Alex Monk) [12:57:07] (03PS1) 10BBlack: authdns: add the reflection plugin [puppet] - 10https://gerrit.wikimedia.org/r/468571 [12:59:23] (03CR) 10BBlack: [C: 032] authdns: add the reflection plugin [puppet] - 10https://gerrit.wikimedia.org/r/468571 (owner: 10BBlack) [13:01:06] (03CR) 10jenkins-bot: dns_validation: Allow hostnames as DNS validation servers [software/certcentral] - 10https://gerrit.wikimedia.org/r/468554 (https://phabricator.wikimedia.org/T207457) (owner: 10Vgutierrez) [13:07:19] (03PS1) 10BBlack: GeoDNS: Add explicit entry for Wikia [dns] - 10https://gerrit.wikimedia.org/r/468572 [13:07:32] (03CR) 10jerkins-bot: [V: 04-1] GeoDNS: Add explicit entry for Wikia [dns] - 10https://gerrit.wikimedia.org/r/468572 (owner: 10BBlack) [13:07:57] (03CR) 10BBlack: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/468570 (owner: 10BBlack) [13:08:09] (03CR) 10jerkins-bot: [V: 04-1] add reflect.wikimedia.org for debug [dns] - 10https://gerrit.wikimedia.org/r/468570 (owner: 10BBlack) [13:09:18] (03PS2) 10BBlack: GeoDNS: Add explicit entry for Wikia [dns] - 10https://gerrit.wikimedia.org/r/468572 [13:09:21] (03PS2) 10BBlack: add reflect.wikimedia.org for debug [dns] - 10https://gerrit.wikimedia.org/r/468570 [13:09:32] (03CR) 10jerkins-bot: [V: 04-1] add reflect.wikimedia.org for debug [dns] - 10https://gerrit.wikimedia.org/r/468570 (owner: 10BBlack) [13:15:23] (03PS9) 10Gehel: wdqs: cleanup logback configuration [puppet] - 10https://gerrit.wikimedia.org/r/463254 (https://phabricator.wikimedia.org/T200563) [13:21:18] (03CR) 10BBlack: [C: 032] GeoDNS: Add explicit entry for Wikia [dns] - 10https://gerrit.wikimedia.org/r/468572 (owner: 10BBlack) [13:21:53] (03CR) 10Hashar: "Good idea! I connected with the other keys and created a file on deploy1001.eqiad.wmnet : /home/hashar/proof.txt" [puppet] - 10https://gerrit.wikimedia.org/r/468532 (owner: 10Hashar) [13:24:10] (03CR) 10Ottomata: "Ah sorry I took out the conditional that was there before because I was using the ensure, but I was doing it wrong! I dont' need to condi" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/468541 (owner: 10Elukey) [13:28:37] (03CR) 10Filippo Giunchedi: [C: 04-1] "AFAIK we can't deprecate this yet as there isn't a replacement/equivalent for Prometheus (yet)" [puppet] - 10https://gerrit.wikimedia.org/r/468480 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [13:28:59] (03PS1) 10BBlack: authdns::lint: install latest package [puppet] - 10https://gerrit.wikimedia.org/r/468576 [13:29:06] 10Operations, 10SRE-Access-Requests: Additional ssh key for Antoine "hashar" Musso - https://phabricator.wikimedia.org/T207470 (10hashar) [13:29:40] (03PS2) 10Hashar: admin: add ssh key for Antoine Musso [puppet] - 10https://gerrit.wikimedia.org/r/468532 (https://phabricator.wikimedia.org/T207470) [13:30:10] (03CR) 10Hashar: "Per IRC conversation with SRE, that is handled via an access request task. I have filled T207470" [puppet] - 10https://gerrit.wikimedia.org/r/468532 (https://phabricator.wikimedia.org/T207470) (owner: 10Hashar) [13:30:34] (03CR) 10Ottomata: [C: 031] confluent::kafka::common: dont use enable => 'mask' on jessie hosts [puppet] - 10https://gerrit.wikimedia.org/r/468498 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [13:30:36] (03CR) 10BBlack: [C: 032] authdns::lint: install latest package [puppet] - 10https://gerrit.wikimedia.org/r/468576 (owner: 10BBlack) [13:32:43] (03CR) 10Filippo Giunchedi: [C: 031] Enable base::service_auto_restart for nginx on debug proxies [puppet] - 10https://gerrit.wikimedia.org/r/466852 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:33:13] (03CR) 10Ottomata: "Hm, we likely only have about 7 days of logs from camus (if it has run yet). Should we wait until Camus has a good backup before we do th" [puppet] - 10https://gerrit.wikimedia.org/r/467648 (https://phabricator.wikimedia.org/T206542) (owner: 10Elukey) [13:34:20] (03CR) 10Filippo Giunchedi: [C: 04-1] "See inline, unrelated change?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/465456 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [13:36:34] (03CR) 10KartikMistry: "This should be deploy first and then apertium. hfst, hfst-ospell, cg3 after that and then apertium-separable and apertium-lex-tools which " [debs/contenttranslation/lttoolbox] - 10https://gerrit.wikimedia.org/r/465932 (https://phabricator.wikimedia.org/T206439) (owner: 10KartikMistry) [13:36:48] (03CR) 10Filippo Giunchedi: [C: 031] Remove Diamond from DB roles [puppet] - 10https://gerrit.wikimedia.org/r/467264 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [13:39:13] (03PS2) 10Banyek: admin: Change banyek's .bash_profile [puppet] - 10https://gerrit.wikimedia.org/r/466901 [13:40:11] 10Operations, 10Traffic, 10Continuous-Integration-Config: CI jobs for authdns linting need to run on Stretch - https://phabricator.wikimedia.org/T205439 (10hashar) Here is the full context **Current** The `operations-dns-lint` job runs on **Jessie** WMCS instances, they are provisioned by puppet and eventu... [13:40:42] (03PS3) 10Herron: confluent::kafka::common: force provider => 'systemd' for services [puppet] - 10https://gerrit.wikimedia.org/r/468498 (https://phabricator.wikimedia.org/T206454) [13:41:07] (03CR) 10Filippo Giunchedi: [C: 031] "FYI memcached-exporter has been recently added by Elukey in Prometheus:" [puppet] - 10https://gerrit.wikimedia.org/r/466907 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [13:41:51] (03PS4) 10Herron: confluent::kafka::common: force provider => 'systemd' for services [puppet] - 10https://gerrit.wikimedia.org/r/468498 (https://phabricator.wikimedia.org/T206454) [13:43:45] herron: that's weird, it shouldn't happen [13:44:03] provider => systemd shouldn't be needed, even on jessie [13:44:03] the mask error? [13:44:52] it's the default [13:45:16] (03PS1) 10BBlack: Fix authdns-lint for 3.x [puppet] - 10https://gerrit.wikimedia.org/r/468578 [13:45:43] what happens without it is this: Error: /Stage[main]/Confluent::Kafka::Common/Service[confluent-kafka]/enable: change from false to mask failed: Could not set 'mask' on enable: undefined method `mask' for Service[confluent-kafka](provider=debian):Puppet::Type::Service::ProviderDebian at /etc/puppet/modules/confluent/manifests/kafka/common.pp:80 [13:46:26] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, I'll merge next week just in case" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/454644 (owner: 10EBernhardson) [13:46:26] hmmm [13:46:39] thrown for confluent-kafka confluent-kafka-connect and confluent-zookeeper [13:46:46] services [13:46:48] puppet (4.8.2-5~bpo8+1) jessie-backports; urgency=medium [13:46:49] * Rebuild for jessie-backports. [13:46:49] * Switch the default service provider to `debian' and fix SysV service [13:46:52] enable/disable under systemd. [13:47:01] so apparently the provider isn't systemd in jessie, interesting [13:47:38] 10Operations, 10Traffic, 10Continuous-Integration-Config: CI jobs for authdns linting need to run on Stretch - https://phabricator.wikimedia.org/T205439 (10BBlack) Shouldn't the container be able to puppetize from `authdns::lint` directly, which would provide all the pathways for updating the package/config/... [13:47:47] 10Operations, 10Analytics, 10Traffic: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066 (10mforns) a:05elukey>03mforns [13:48:00] (03CR) 10DCausse: elasticsearch_cluster: multi-cluster support (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [13:48:09] herron: ok, I take it back then... but do leave a comment that the provider line is only needed for jessie! [13:48:30] (03PS2) 10BBlack: Fix authdns-lint for 3.x [puppet] - 10https://gerrit.wikimedia.org/r/468578 (https://phabricator.wikimedia.org/T205439) [13:49:05] also, any stretch plans for logstash? ;) [13:49:18] (03PS1) 10Alex Monk: Fix debian control Maintainer field [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/468580 [13:50:04] hehe will do [13:50:28] (03CR) 10Giuseppe Lavagetto: [C: 031] "Let's merge this!" [puppet] - 10https://gerrit.wikimedia.org/r/467734 (https://phabricator.wikimedia.org/T206450) (owner: 10Effie Mouzeli) [13:50:43] <_joe_> jijiki: let's merge your change? [13:51:37] to me makes sense to handle stretch upgrade in conjunction with new hardware [13:51:43] (03CR) 10Banyek: [C: 031] "As I understand what is happening here LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/467264 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [13:51:50] but open to other ideas/timing of course [13:54:13] (03CR) 10Vgutierrez: [C: 032] Fix debian control Maintainer field [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/468580 (owner: 10Alex Monk) [13:55:40] (03PS3) 10BBlack: add reflect.wikimedia.org for debug [dns] - 10https://gerrit.wikimedia.org/r/468570 [13:56:39] (03CR) 10jenkins-bot: Fix debian control Maintainer field [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/468580 (owner: 10Alex Monk) [13:56:56] (03CR) 10DCausse: elasticsearch_cluster: multi-cluster support (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [13:56:58] (03CR) 10BBlack: [C: 032] add reflect.wikimedia.org for debug [dns] - 10https://gerrit.wikimedia.org/r/468570 (owner: 10BBlack) [13:57:10] bblack: oooh fancy! [13:57:33] I've been using o-o.myaddr.l.google.com [13:57:40] and before that Akamai's [13:57:52] we've had the capability forever, I guess we just never bothered to add the record here [13:58:06] probably I got lost on how CI config works at some past point and gave up too quickly :) [13:58:18] !log Uploaded certcentral 0.2 to apt.wikimedia.org (stretch) - T207457 [13:58:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:23] T207457: Allow validation_dns_servers to be a list of hostnames - https://phabricator.wikimedia.org/T207457 [13:58:34] I think Akamai's gave you both your recursor's IP, and edns-client-subnet or something [13:59:08] yeah, we have those options too [13:59:19] (03CR) 10Filippo Giunchedi: "LGTM, this will help for sure in general since graphite receives many packets/s, though I believe the fix for UDP drops will be to bump th" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/468388 (https://phabricator.wikimedia.org/T196484) (owner: 10Cwhite) [13:59:20] hmmm so this is an A, not a TXT? [13:59:31] that's a bit scary isn't it? [13:59:51] _joe_: yep [13:59:53] I wonder if you can exploit it and issue a cert for reflect.wikimedia.org or something :P [14:00:02] hah [14:00:03] I guess only if you're sitting in LE's network :) [14:00:36] paravoid: by default, ours gives ednsc info if available, or the cache's IP otherwise [14:00:51] (it can be configured to do just one or the other, but this matches what our GeoDNS will match on) [14:02:12] akamai's was whoami.akamai.net [14:02:20] apparently they switched it to a separate domain: https://developer.akamai.com/blog/2018/05/10/introducing-new-whoami-tool-dns-resolver-information [14:02:39] paravoid: probably the most attacky pathway would be to own a 3rd party recursor that LE was using for DV checks. You could probably do it then. [14:02:45] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, added a couple of folks too for signoff" [puppet] - 10https://gerrit.wikimedia.org/r/466905 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [14:02:52] but then, they shouldn't be doing DV or CAA checks through 3rd party recursor caches [14:03:08] well yeah, if you do that, then it doesn't matter if this record exists or not [14:03:24] you can just manipulate en.wikipedia.org :P [14:04:21] (03CR) 10Effie Mouzeli: Added new role::redis::misc for general purposes redis servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/467734 (https://phabricator.wikimedia.org/T206450) (owner: 10Effie Mouzeli) [14:04:37] this (= reflect.wm.org A) may be more of a vuln with DNSSEC I suppose [14:05:08] if our NSes authoritatively sign that reflect.wikimedia.org resolves to e.g. your hotspot's NAT IP, stuff like that/ [14:05:35] ? [14:05:55] what? [14:05:56] I don't see how DNSSEC would make it worse, it's still local to cache, or an ednsc split of a cache [14:06:26] (03PS1) 10Giuseppe Lavagetto: mediawiki::web::beta_sites: convert to using mediawiki::web::vhost [puppet] - 10https://gerrit.wikimedia.org/r/468583 [14:06:44] <_joe_> Krenair: ^^ [14:06:52] <_joe_> as promised, it's coming :P [14:07:21] I'm trying to think of attack scenarios :) [14:07:29] (03CR) 10Gehel: [C: 04-1] "minor comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/466574 (https://phabricator.wikimedia.org/T206639) (owner: 10Mathew.onipe) [14:07:32] (03CR) 10Effie Mouzeli: [C: 032] Added new role::redis::misc for general purposes redis servers [puppet] - 10https://gerrit.wikimedia.org/r/467734 (https://phabricator.wikimedia.org/T206450) (owner: 10Effie Mouzeli) [14:07:44] I think they all involve some sec-failure independent of this [14:07:44] (03PS1) 10Marostegui: db-codfw.php: Remove useless comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468584 [14:08:04] (03PS12) 10Effie Mouzeli: Added new role::redis::misc for general purposes redis servers [puppet] - 10https://gerrit.wikimedia.org/r/467734 (https://phabricator.wikimedia.org/T206450) [14:08:33] paravoid: also, in case it wasn't clearly implied by the above: in the case that ednsc is used, the response is also edns-scoped. [14:08:33] _joe_, yay [14:08:41] yeah, got it [14:09:00] so that goes through DYN plugins, so I suppose TXT is not even possible right now, right? [14:09:09] given there's no DYNTXT IIRC? [14:09:53] (03CR) 10Marostegui: [C: 032] db-codfw.php: Remove useless comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468584 (owner: 10Marostegui) [14:10:35] paravoid: yeah [14:10:45] oh well [14:11:24] (03Merged) 10jenkins-bot: db-codfw.php: Remove useless comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468584 (owner: 10Marostegui) [14:12:42] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Remove useless comments (duration: 00m 54s) [14:12:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:09] !log disconnecting s4 replication on dbstore2002 (T204930) [14:16:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:12] T204930: dbstore2002 tables compression status check - https://phabricator.wikimedia.org/T204930 [14:18:26] PROBLEM - HHVM jobrunner on mw1301 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [14:19:36] RECOVERY - HHVM jobrunner on mw1301 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.004 second response time [14:20:11] (03CR) 10jenkins-bot: db-codfw.php: Remove useless comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468584 (owner: 10Marostegui) [14:21:50] (03PS10) 10Gehel: scap::target: added additional_services_names param [puppet] - 10https://gerrit.wikimedia.org/r/464807 (https://phabricator.wikimedia.org/T206314) (owner: 10Mathew.onipe) [14:25:52] (03CR) 10Banyek: "I did not checked across all the roles, but here's some sample output" [puppet] - 10https://gerrit.wikimedia.org/r/467264 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [14:30:53] PROBLEM - Check systemd state on rdb1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:32:23] (03PS1) 10Effie Mouzeli: redis::misc Fixed typos [puppet] - 10https://gerrit.wikimedia.org/r/468586 (https://phabricator.wikimedia.org/T206450) [14:32:44] PROBLEM - puppet last run on rdb1009 is CRITICAL: CRITICAL: Puppet has 5 failures. Last run 12 minutes ago with 5 failures. Failed resources (up to 3 shown): Service[redis-instance-tcp_6378],Service[redis-instance-tcp_6379],Service[redis-instance-tcp_6380],Service[redis-instance-tcp_6381] [14:33:41] (03CR) 10Giuseppe Lavagetto: [C: 031] "remove the whitespace in master.yaml, but LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/468586 (https://phabricator.wikimedia.org/T206450) (owner: 10Effie Mouzeli) [14:33:55] rdb1009 is me, please ignore [14:36:01] 10Operations, 10Traffic, 10Continuous-Integration-Config, 10Patch-For-Review: CI jobs for authdns linting need to run on Stretch - https://phabricator.wikimedia.org/T205439 (10hashar) None of the containers get provisioned via puppet. For CI puppet was used mostly to provide a list of packages. The variou... [14:37:07] (03CR) 10Gehel: "puppet compiler is mostly happy: https://puppet-compiler.wmflabs.org/compiler1001/13110/" [puppet] - 10https://gerrit.wikimedia.org/r/464807 (https://phabricator.wikimedia.org/T206314) (owner: 10Mathew.onipe) [14:38:23] (03PS2) 10Effie Mouzeli: redis::misc Fixed typos [puppet] - 10https://gerrit.wikimedia.org/r/468586 (https://phabricator.wikimedia.org/T206450) [14:39:40] (03CR) 10Effie Mouzeli: redis::misc Fixed typos (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/468586 (https://phabricator.wikimedia.org/T206450) (owner: 10Effie Mouzeli) [14:40:00] (03CR) 10Effie Mouzeli: [C: 032] redis::misc Fixed typos [puppet] - 10https://gerrit.wikimedia.org/r/468586 (https://phabricator.wikimedia.org/T206450) (owner: 10Effie Mouzeli) [14:40:04] PROBLEM - Check health of redis instance on 6378 on rdb1009 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6378 [14:41:12] (03PS1) 10Mforns: Fine tune eventlogging_to_druid_job spark and druid parameters [puppet] - 10https://gerrit.wikimedia.org/r/468588 (https://phabricator.wikimedia.org/T206342) [14:41:54] PROBLEM - Check health of redis instance on 6379 on rdb1009 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [14:43:22] 10Operations, 10DBA, 10Gerrit, 10Patch-For-Review, 10Release-Engineering-Team (Next): Gerrit is failing to connect to db on gerrit2001 thus preventing systemd from working - https://phabricator.wikimedia.org/T176532 (10Paladox) Gerrit is now 100% NoteDB from 2.16 see https://twitter.com/GerritReview/stat... [14:43:43] PROBLEM - Check health of redis instance on 6380 on rdb1009 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6380 [14:45:34] PROBLEM - Check health of redis instance on 6381 on rdb1009 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6381 [14:46:00] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10monitoring: Releases Jenkins Icinga check failing after restricting access - https://phabricator.wikimedia.org/T206579 (10hashar) it is not a firewall issue, the server is reacheable and responds with a `403`. Locally whe... [14:46:04] 10Operations, 10DBA, 10Gerrit, 10Release-Engineering-Team (Next): Gerrit is failing to connect to db on gerrit2001 thus preventing systemd from working - https://phabricator.wikimedia.org/T176532 (10Paladox) [14:46:06] 10Operations, 10Elasticsearch, 10Icinga, 10Discovery-Search (Current work), 10Patch-For-Review: reconfigure Icinga alert for elasticsearch_shard_size to reduce false positive alerts - https://phabricator.wikimedia.org/T206187 (10debt) 05Open>03Resolved [14:46:18] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Refactor 'use_git_deploy' in wdqs puppet module to cater for scap3 and autodeployment modes - https://phabricator.wikimedia.org/T206597 (10debt) 05Open>03Resolved [14:46:34] RECOVERY - Check health of redis instance on 6381 on rdb1009 is OK: OK: REDIS 3.2.6 on 127.0.0.1:6381 has 0 databases (), up 14 seconds [14:46:42] (03CR) 10Dzahn: "blocked per linked ticket, right?" [puppet] - 10https://gerrit.wikimedia.org/r/467525 (https://phabricator.wikimedia.org/T207052) (owner: 10Urbanecm) [14:46:53] RECOVERY - Check health of redis instance on 6380 on rdb1009 is OK: OK: REDIS 3.2.6 on 127.0.0.1:6380 has 0 databases (), up 29 seconds [14:47:04] RECOVERY - Check health of redis instance on 6379 on rdb1009 is OK: OK: REDIS 3.2.6 on 127.0.0.1:6379 has 0 databases (), up 45 seconds [14:47:13] RECOVERY - Check systemd state on rdb1009 is OK: OK - running: The system is fully operational [14:47:14] RECOVERY - Check health of redis instance on 6378 on rdb1009 is OK: OK: REDIS 3.2.6 on 127.0.0.1:6378 has 0 databases (), up 56 seconds [14:47:53] RECOVERY - puppet last run on rdb1009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:55:45] (03PS1) 10Dzahn: releases/icinga: fix releases-jenkins URL check by adding /login/ path [puppet] - 10https://gerrit.wikimedia.org/r/468594 (https://phabricator.wikimedia.org/T206579) [14:56:43] 10Operations, 10ops-eqiad, 10Analytics: analytics1068 doesn't boot - https://phabricator.wikimedia.org/T203244 (10Cmjohnson) @elukey Finally, they agreed to replace the mother board. This should happen Monday or Tues next week. [14:57:04] (03PS2) 10Mforns: Fine tune eventlogging_to_druid_job spark and druid parameters [puppet] - 10https://gerrit.wikimedia.org/r/468588 (https://phabricator.wikimedia.org/T206342) [14:57:37] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10Cmjohnson) The update, I received an email from HP last night, they are sending 4 new disks. [15:01:52] (03PS3) 10Mforns: Fine tune eventlogging_to_druid_job spark and druid parameters [puppet] - 10https://gerrit.wikimedia.org/r/468588 (https://phabricator.wikimedia.org/T206342) [15:02:15] 10Operations, 10Patch-For-Review, 10User-Joe, 10User-jijiki: Reorganize our redis rdb1/rdb2 clusters - https://phabricator.wikimedia.org/T206450 (10jijiki) [15:03:23] (03PS1) 10Alex Monk: certcentral: rm pinkunicorn wildcard [puppet] - 10https://gerrit.wikimedia.org/r/468595 [15:05:38] 10Operations, 10Certcentral, 10Traffic: Create production LE account - https://phabricator.wikimedia.org/T207476 (10Krenair) p:05Triage>03Normal [15:13:18] (03Abandoned) 10Vgutierrez: certcentral: List validation DNS servers as IPs instead of hostnames [puppet] - 10https://gerrit.wikimedia.org/r/468551 (owner: 10Alex Monk) [15:14:40] (03CR) 10Dzahn: [C: 032] releases/icinga: fix releases-jenkins URL check by adding /login/ path [puppet] - 10https://gerrit.wikimedia.org/r/468594 (https://phabricator.wikimedia.org/T206579) (owner: 10Dzahn) [15:15:18] cmjohnson1: \o/ [15:15:19] finally [15:15:20] :) [15:16:19] (03CR) 10Mforns: Fine tune eventlogging_to_druid_job spark and druid parameters (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/468588 (https://phabricator.wikimedia.org/T206342) (owner: 10Mforns) [15:17:57] (03PS2) 10Alex Monk: certcentral: rm pinkunicorn wildcard [puppet] - 10https://gerrit.wikimedia.org/r/468595 [15:19:00] (03CR) 10Vgutierrez: [C: 032] certcentral: rm pinkunicorn wildcard [puppet] - 10https://gerrit.wikimedia.org/r/468595 (owner: 10Alex Monk) [15:19:12] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066 (10mforns) [15:20:44] (03CR) 10Ottomata: [C: 031] Fine tune eventlogging_to_druid_job spark and druid parameters [puppet] - 10https://gerrit.wikimedia.org/r/468588 (https://phabricator.wikimedia.org/T206342) (owner: 10Mforns) [15:21:28] (03PS3) 10Alex Monk: certcentral: rm pinkunicorn wildcard [puppet] - 10https://gerrit.wikimedia.org/r/468595 [15:22:26] (03PS1) 10Filippo Giunchedi: Rebuild for jessie-wikimedia [debs/rsyslog] (debian/jessie-wikimedia) - 10https://gerrit.wikimedia.org/r/468600 [15:22:28] (03PS1) 10Filippo Giunchedi: Drop mongodb/relp/czmq integrations, not used at WMF and missing/old from jessie(-backports) [debs/rsyslog] (debian/jessie-wikimedia) - 10https://gerrit.wikimedia.org/r/468601 [15:22:30] (03PS1) 10Filippo Giunchedi: Build-depend on newer librdkafka 0.11 [debs/rsyslog] (debian/jessie-wikimedia) - 10https://gerrit.wikimedia.org/r/468602 [15:22:32] (03PS1) 10Filippo Giunchedi: Enable mmkubernetes (build depends on libcurl and liblognorm) [debs/rsyslog] (debian/jessie-wikimedia) - 10https://gerrit.wikimedia.org/r/468603 [15:23:50] (03PS1) 10Alex Monk: [WIP] Ditch certcentral config template, configure in puppet [puppet] - 10https://gerrit.wikimedia.org/r/468604 [15:24:44] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Ditch certcentral config template, configure in puppet [puppet] - 10https://gerrit.wikimedia.org/r/468604 (owner: 10Alex Monk) [15:26:39] (03CR) 10Cwhite: icinga: add puppet types for parameters (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/468468 (owner: 10Dzahn) [15:27:13] (03PS2) 10Alex Monk: Ditch certcentral config template, configure in puppet [puppet] - 10https://gerrit.wikimedia.org/r/468604 [15:28:28] vgutierrez, ^ mind running PCC on that? [15:29:47] Krenair: https://puppet-compiler.wmflabs.org/compiler1002/13112/ [15:30:11] heh ok [15:31:01] (03PS3) 10Alex Monk: Ditch certcentral config template, configure in puppet [puppet] - 10https://gerrit.wikimedia.org/r/468604 [15:31:04] vgutierrez, ^ [15:31:48] last till Monday :) [15:32:50] https://puppet-compiler.wmflabs.org/compiler1002/13113/ [15:33:48] okay it looks equivalent [15:34:01] different formatting, indentation and ordering etc. [15:34:03] but still [15:35:07] (03CR) 10Alex Monk: "Valentin ran PCC, it looks equivalent, just with different YAML formatting (indentation, ordering etc.) - https://puppet-compiler.wmflabs." [puppet] - 10https://gerrit.wikimedia.org/r/468604 (owner: 10Alex Monk) [15:37:17] RECOVERY - Check systemd state on kafkamon1001 is OK: OK - running: The system is fully operational [15:37:44] (03PS2) 10Alex Monk: mediawiki::web::beta_sites: convert to using mediawiki::web::vhost [puppet] - 10https://gerrit.wikimedia.org/r/468583 (https://phabricator.wikimedia.org/T1256) (owner: 10Giuseppe Lavagetto) [15:40:27] PROBLEM - Check systemd state on kafkamon1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:43:01] (03CR) 10Volans: [C: 04-1] "Some nitpicks on the data types inline" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/468468 (owner: 10Dzahn) [15:48:23] (03PS6) 10Cwhite: nagios_common: set flag -2 on check_nrpe for nrpe on stretch [puppet] - 10https://gerrit.wikimedia.org/r/466935 (https://phabricator.wikimedia.org/T202782) [15:48:56] (03CR) 10Volans: [C: 04-1] "Forgot to mention one thing" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/468468 (owner: 10Dzahn) [15:54:15] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10monitoring, 10Patch-For-Review: Releases Jenkins Icinga check failing after restricting access - https://phabricator.wikimedia.org/T206579 (10Dzahn) adding /login/ to the URL changes it from "403 Forbidden" to "404 Not F... [15:54:50] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10monitoring, 10Patch-For-Review: Releases Jenkins Icinga check failing after restricting access - https://phabricator.wikimedia.org/T206579 (10Dzahn) https://releases-jenkins.wikimedia.org/login/ -> HTTP ERROR 404 [15:55:38] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10monitoring, 10Patch-For-Review: Releases Jenkins Icinga check failing after restricting access - https://phabricator.wikimedia.org/T206579 (10Dzahn) Needs to be `/login` but not `/login/` [15:56:35] (03CR) 10Urbanecm: [C: 04-1] "Yep" [puppet] - 10https://gerrit.wikimedia.org/r/467525 (https://phabricator.wikimedia.org/T207052) (owner: 10Urbanecm) [15:57:02] (03PS1) 10Dzahn: releases/icinga: remove trailing slash in URL to fix Icinga check [puppet] - 10https://gerrit.wikimedia.org/r/468610 (https://phabricator.wikimedia.org/T206579) [15:57:25] (03CR) 10Cwhite: [C: 032] "Changes look correct. https://puppet-compiler.wmflabs.org/compiler1002/13114/" [puppet] - 10https://gerrit.wikimedia.org/r/466935 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [15:57:33] (03PS2) 10Dzahn: releases/icinga: remove trailing slash in URL to fix Icinga check [puppet] - 10https://gerrit.wikimedia.org/r/468610 (https://phabricator.wikimedia.org/T206579) [15:58:03] (03CR) 10Dzahn: [C: 032] releases/icinga: remove trailing slash in URL to fix Icinga check [puppet] - 10https://gerrit.wikimedia.org/r/468610 (https://phabricator.wikimedia.org/T206579) (owner: 10Dzahn) [16:00:33] (03PS7) 10Dzahn: nagios_common: set flag -2 on check_nrpe for nrpe on stretch [puppet] - 10https://gerrit.wikimedia.org/r/466935 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [16:09:22] RECOVERY - HTTP releases-jenkins.wikimedia.org on releases2001 is OK: HTTP OK: HTTP/1.1 200 OK - 6611 bytes in 0.701 second response time [16:09:53] RECOVERY - HTTP releases-jenkins.wikimedia.org on releases1001 is OK: HTTP OK: HTTP/1.1 200 OK - 2829 bytes in 0.020 second response time [16:12:25] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10monitoring, 10Patch-For-Review: Releases Jenkins Icinga check failing after restricting access - https://phabricator.wikimedia.org/T206579 (10Dzahn) 05Open>03Resolved a:03Dzahn alright, fixed ! https://icinga.wiki... [16:13:01] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-deploy* - https://phabricator.wikimedia.org/T207487 (10Krenair) [16:16:33] (03CR) 10Alex Monk: "T207487 ?" [puppet] - 10https://gerrit.wikimedia.org/r/467642 (owner: 10Giuseppe Lavagetto) [16:18:13] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-deploy* - https://phabricator.wikimedia.org/T207487 (10Dzahn) looks like @elukey made this as the fix: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/468563/ [16:18:29] (03CR) 10Dzahn: "https://phabricator.wikimedia.org/T207487" [puppet] - 10https://gerrit.wikimedia.org/r/468563 (owner: 10Elukey) [16:24:05] 10Operations, 10ops-eqiad, 10netops: Rack/setup cr2-eqord - https://phabricator.wikimedia.org/T204170 (10RobH) So I just gave @papaul full rights to access CH2, identical to his rights in DA1 (so he should have no issues with access.) [16:27:49] (03PS1) 10CRusnov: nagios_common: add crusnov to contact groups; icinga: add crusnov to icinga authorizations [puppet] - 10https://gerrit.wikimedia.org/r/468612 (https://phabricator.wikimedia.org/T207009) [16:28:32] (03CR) 10jerkins-bot: [V: 04-1] nagios_common: add crusnov to contact groups; icinga: add crusnov to icinga authorizations [puppet] - 10https://gerrit.wikimedia.org/r/468612 (https://phabricator.wikimedia.org/T207009) (owner: 10CRusnov) [16:29:04] (03PS2) 10CRusnov: nagios_common: add crusnov to contact groups; icinga: add crusnov to icinga authorizations [puppet] - 10https://gerrit.wikimedia.org/r/468612 (https://phabricator.wikimedia.org/T207009) [16:29:51] (03CR) 10jerkins-bot: [V: 04-1] nagios_common: add crusnov to contact groups; icinga: add crusnov to icinga authorizations [puppet] - 10https://gerrit.wikimedia.org/r/468612 (https://phabricator.wikimedia.org/T207009) (owner: 10CRusnov) [16:29:54] PROBLEM - High load average on labstore1006 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [24.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [16:30:04] chaomodus: it doesn't like that the commit message is over 80 chars [16:30:16] (the first line of it) [16:32:08] ahh [16:32:12] yah [16:32:45] chaomodus: see the link in the CR added by jenkins and scroll a bit, it's not the most readable one, but tells you what's wrong ;) [16:32:58] yes i see [16:33:26] ^ i usually do "Ctrl + F" and search for "error" inside that link because it's hard to see [16:35:01] i think you need to use "CRusnov" here [16:35:09] that's the "cn" field in LDAP [16:35:16] it'd be the login name yes? [16:35:18] ah [16:35:42] not the shellname or whatever? that's what i use to login [16:36:07] LDAP username [16:36:16] the LDAP user needs to match the Icinga contact [16:36:22] is the one you login with and is the one icinga recognize you with [16:36:25] and check the permissions [16:36:38] !log deactivate BGP to 15426 in ams-ix (down and no reply to emails) - T207428 [16:36:40] i always got it wrong at first attempt .. then you get a login but no permissions [16:36:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:42] T207428: cr2-esams - BGP WARNING - AS15426/IPv4 - https://phabricator.wikimedia.org/T207428 [16:36:55] this is how i checked LDAP fields btw: [16:36:55] mutante: if the contact is wrong is my fault ;) [16:37:00] and might be [16:37:00] [mwmaint1002:~] $ ldaplist -l passwd crusn [16:37:02] RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 407, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:37:18] (03CR) 10Alex Monk: "T207495" [puppet] - 10https://gerrit.wikimedia.org/r/467661 (https://phabricator.wikimedia.org/T205452) (owner: 10Mobrovac) [16:37:21] eh. with the full username of course [16:38:14] 10Operations, 10Recommendation-API, 10Research, 10Core Platform Team Kanban (Doing), and 3 others: Setup access from service to mysql - https://phabricator.wikimedia.org/T205452 (10Krenair) >>! In T205452#4669764, @mobrovac wrote: > @jcrespo has the same DB been set up in Beta as well? Can you share the cr... [16:38:22] 10Operations, 10netops: cr2-esams - BGP WARNING - AS15426/IPv4 - https://phabricator.wikimedia.org/T207428 (10ayounsi) 05Open>03Resolved a:03ayounsi Thanks for following the doc! That BGP peer has been deactivated as it never replied to our notification. [16:39:12] yea, you might have to rename the contact in private repo [16:39:22] (03CR) 10Alex Monk: "T207488" [puppet] - 10https://gerrit.wikimedia.org/r/465608 (https://phabricator.wikimedia.org/T205870) (owner: 10Filippo Giunchedi) [16:40:30] and the extra caveat is that simple auth doesn't care about capitalization but Icinga internally does.. for the contact name [16:40:40] mutante: my contact_name in icinga is lowercase, but my LDAP is ucfirst [16:40:48] so you can be logged in but still get no privs because it's the wrong capitalization [16:40:53] yeah so you can keep the contact name [16:40:58] and just use the right one in icinga config [16:40:59] :D [16:41:44] heh, ok, confusing each time we add one :) [16:42:06] indeed [16:42:16] i think we had at least 2 changes for almost all of them [16:42:25] so dont worry [16:46:13] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is CRITICAL: 59.4 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [16:48:22] (03PS3) 10Dzahn: mediawiki/apache: add za.wikimedia.org as prod_site [puppet] - 10https://gerrit.wikimedia.org/r/468074 (https://phabricator.wikimedia.org/T195926) [16:48:32] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is OK: (C)60 le (W)70 le 70.15 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [16:48:44] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-deploy* - https://phabricator.wikimedia.org/T207487 (10Krenair) a:03elukey great! [16:48:54] (03PS6) 10Alex Monk: scap::scripts: conditionally require mediawiki::users [puppet] - 10https://gerrit.wikimedia.org/r/468563 (https://phabricator.wikimedia.org/T207487) (owner: 10Elukey) [16:52:49] (03PS3) 10CRusnov: icinga: add crusnov to icinga stuff [puppet] - 10https://gerrit.wikimedia.org/r/468612 (https://phabricator.wikimedia.org/T207009) [16:53:35] (03CR) 10jerkins-bot: [V: 04-1] icinga: add crusnov to icinga stuff [puppet] - 10https://gerrit.wikimedia.org/r/468612 (https://phabricator.wikimedia.org/T207009) (owner: 10CRusnov) [16:55:24] RECOVERY - High load average on labstore1006 is OK: OK: Less than 85.00% above the threshold [16.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [16:58:09] 10Operations, 10fundraising-tech-ops, 10netops: add icinga1001 to send_nsca and pfw rules in FRACK - https://phabricator.wikimedia.org/T207175 (10ayounsi) a:03ayounsi From IRC, the Juniper seems to include more than what's mentioned in the description, at least: saiph renaming and betelgeuse removal I can... [17:01:28] 10Operations, 10Citoid, 10Services, 10Patch-For-Review, and 3 others: Deploy translation-server-v2 - https://phabricator.wikimedia.org/T201611 (10thcipriani) I was able to generate the image `docker-registry.wikimedia.org/wikimedia/mediawiki-services-zotero:20181019165254-production` which runs, and listen... [17:05:52] 10Operations, 10Traffic: Document eqsin power connections in Netbox - https://phabricator.wikimedia.org/T207138 (10ayounsi) Opened https://github.com/digitalocean/netbox/issues/2531 [17:08:30] (03Abandoned) 10Cwhite: diamond: remove nagios collector [puppet] - 10https://gerrit.wikimedia.org/r/468480 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [17:08:37] (03Abandoned) 10Cwhite: keyholder, diamond: remove nagios collector and diamond [puppet] - 10https://gerrit.wikimedia.org/r/468481 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [17:11:27] (03CR) 10Cwhite: hiera: diamond::remove on openstack control role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/465456 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [17:13:42] (03CR) 10Dzahn: [C: 032] mediawiki/apache: add za.wikimedia.org as prod_site [puppet] - 10https://gerrit.wikimedia.org/r/468074 (https://phabricator.wikimedia.org/T195926) (owner: 10Dzahn) [17:13:48] 10Operations, 10ops-eqiad, 10Dumps-Generation: Move dumpsdata1001 - https://phabricator.wikimedia.org/T207278 (10ayounsi) Thank you! I created a calendar event. Feel free to move it around if any issue arise. [17:25:44] 10Operations, 10Wikimedia-General-or-Unknown, 10Patch-For-Review: Wikimedia Foundation to host Wikimedia South Africa sites - https://phabricator.wikimedia.org/T195926 (10Dzahn) https://za.wikimedia.org https://za.m.wikimedia.org have been created in DNS and added as a server alias to Apache cluster config... [17:28:00] 10Operations, 10Wikimedia-General-or-Unknown, 10Patch-For-Review: Wikimedia Foundation to host Wikimedia South Africa sites - https://phabricator.wikimedia.org/T195926 (10Dzahn) p:05Triage>03Normal [17:28:41] 10Operations, 10Wiki-Setup, 10Wikimedia-General-or-Unknown: Wikimedia Foundation to host Wikimedia South Africa sites - https://phabricator.wikimedia.org/T195926 (10Dzahn) [17:29:01] 10Operations, 10Wikimedia-General-or-Unknown, 10User-Urbanecm, 10Wiki-Setup (Create): Wikimedia Foundation to host Wikimedia South Africa sites - https://phabricator.wikimedia.org/T195926 (10Dzahn) [17:40:57] (03PS3) 10Cwhite: graphite: add interface::rps settings to graphite hosts [puppet] - 10https://gerrit.wikimedia.org/r/468388 (https://phabricator.wikimedia.org/T196484) [17:43:01] (03PS3) 10Dzahn: Add shn to langs.tmpl [dns] - 10https://gerrit.wikimedia.org/r/467080 (https://phabricator.wikimedia.org/T206777) (owner: 10Urbanecm) [17:43:47] (03CR) 10Dzahn: [C: 031] "https://meta.wikimedia.org/wiki/Requests_for_new_languages/Wikipedia_Shan" [dns] - 10https://gerrit.wikimedia.org/r/467080 (https://phabricator.wikimedia.org/T206777) (owner: 10Urbanecm) [17:44:10] 10Operations, 10Cloud-VPS, 10netops, 10cloud-services-team (Kanban): dmz_cidr only includes some wikimedia public IP ranges, leading to some very strange behaviour - https://phabricator.wikimedia.org/T174596 (10ayounsi) This is also the reason we have to have the following route on cr1/2-eqiad `static rout... [17:44:49] (03CR) 10Dzahn: [C: 032] "https://en.wikipedia.org/wiki/Shan_language" [dns] - 10https://gerrit.wikimedia.org/r/467080 (https://phabricator.wikimedia.org/T206777) (owner: 10Urbanecm) [17:47:12] !log DNS - 'authdns-gen-zones -f /srv/authdns/git/templates /etc/gdnsd/zones && gdnsd checkconf && gdnsd reload-zones' - needed when adding new languages to langs.tmpl - adding "shn" (Shan language) T206777 [17:47:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:16] T206777: Create Wikipedia Shan - https://phabricator.wikimedia.org/T206777 [17:50:58] bblack: is "gdnsd reload-zones" outdated by any chance? [17:51:32] (03CR) 10GTirloni: [C: 032] sonofgridengine: Correct and expand things enough to deploy a grid master [puppet] - 10https://gerrit.wikimedia.org/r/468462 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [17:51:35] i did the usual commands needed when adding a new language to langs.tmpl, but only on ns2 so far [17:52:18] but i got gdns's Usage info for the last part of it [18:04:33] 10Operations, 10Fundraising-Backlog, 10SRE-Access-Requests, 10Wikimedia-Fundraising, 10fundraising-tech-ops: Frdev1001 server and mysql access - https://phabricator.wikimedia.org/T206478 (10jkim_wikimedia) Hey @cwdent, am i doing something wrong here? {F26636825} [18:07:55] mutante, that might be under gdnsdctl now? [18:07:58] one sec [18:09:31] yeah looks like that might be it [18:09:59] oh, thanks, i had no idea it changed. we need to update docs [18:10:15] i will have this confirmed though before just running things.. a bit too scary [18:10:26] (to recreate all the DNS zones on Friday) [18:10:45] https://github.com/blblack/gdnsd/commit/23f237f0ed26f9f0cb1994a2d23b94250d706bfa [18:10:50] thx [18:11:00] gdnsdctl reload-zones [18:11:52] ok [18:22:07] (03PS3) 10Bstorm: sonofgridengine: Correct and expand things enough to deploy a grid master [puppet] - 10https://gerrit.wikimedia.org/r/468462 (https://phabricator.wikimedia.org/T200557) [18:50:18] (03Abandoned) 10Dduvall: Project clone URLs based on access control [software/gerrit/plugins/go-import] (stable-2.15) - 10https://gerrit.wikimedia.org/r/467843 (owner: 10Dduvall) [18:58:33] (03PS3) 10Andrew Bogott: nova: update scheduling pools for main and eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/468377 [18:58:35] (03PS1) 10Andrew Bogott: openstack clients: use OS_PROJECT_ID instead of OS_TENANT_NAME [puppet] - 10https://gerrit.wikimedia.org/r/468627 [18:58:37] (03PS1) 10Andrew Bogott: cold-migrate: make region aware [puppet] - 10https://gerrit.wikimedia.org/r/468628 [19:00:04] (03CR) 10Andrew Bogott: [C: 032] openstack clients: use OS_PROJECT_ID instead of OS_TENANT_NAME [puppet] - 10https://gerrit.wikimedia.org/r/468627 (owner: 10Andrew Bogott) [19:00:15] (03CR) 10Andrew Bogott: [C: 032] cold-migrate: make region aware [puppet] - 10https://gerrit.wikimedia.org/r/468628 (owner: 10Andrew Bogott) [19:01:43] mutante: yeah, sorry, I hadn't updated those docs. arguably you could skip the checkconf step too, and s/gdnsd reload-zones/gdnsdctl reload-zones/ [19:02:22] (checkconf happens implicitly anyways as part of the reload, and the reload will fail (error msg + nonzero exit code) if things aren't right, and leave all the existing runtime data as-is [19:02:26] ) [19:04:13] (03CR) 10Dzahn: "hmm.. seeing this in compiler https://puppet-compiler.wmflabs.org/compiler1002/13117/install2002.wikimedia.org/change.install2002.wikimedi" [puppet] - 10https://gerrit.wikimedia.org/r/467982 (owner: 10Muehlenhoff) [19:04:26] bblack: alright! thanks, let me change the wikitech page [19:05:41] fail, enabled 2fa but dont have the factor :) [19:05:55] fixing that first, heh [19:10:12] (03PS4) 10Andrew Bogott: nova: update scheduling pools for main and eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/468377 [19:10:14] (03PS1) 10Andrew Bogott: horizon: move 'planet' and 'general-k8s' to eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/468629 (https://phabricator.wikimedia.org/T204745) [19:11:45] (03CR) 10Andrew Bogott: [C: 032] horizon: move 'planet' and 'general-k8s' to eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/468629 (https://phabricator.wikimedia.org/T204745) (owner: 10Andrew Bogott) [19:12:44] !log labweb1001 / wikitech - disabling 2fa for myself, logging in , re-enabling it again [19:12:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:11] !log ns2/multatuli - gnddctl reload-zones [19:16:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:52] bblack: updated docs and running it on ns0 and ns1 then https://wikitech.wikimedia.org/w/index.php?title=Add_a_wiki&diff=prev&oldid=1806491 [19:18:00] we are adding "shn" which is "Shan language" [19:18:14] (03PS1) 10Mathew.onipe: maps: remove osmupdater and osmimporter hiera passwords [labs/private] - 10https://gerrit.wikimedia.org/r/468631 (https://phabricator.wikimedia.org/T206639) [19:20:57] !log ns0 / ns1 - authdns-gen-zones -f /srv/authdns/git/templates /etc/gdnsd/zones && gdnsdctl reload-zones - to add new language shn (T206777) [19:21:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:03] T206777: Create Wikipedia Shan - https://phabricator.wikimedia.org/T206777 [19:28:53] (03PS2) 10Bstorm: labstore: correct the service name for stretch [puppet] - 10https://gerrit.wikimedia.org/r/467969 (https://phabricator.wikimedia.org/T203254) [19:46:27] (03PS5) 10Andrew Bogott: nova: update scheduling pools for main and eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/468377 [19:46:29] (03PS1) 10Andrew Bogott: horizon: move striker to eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/468634 (https://phabricator.wikimedia.org/T204745) [19:52:58] (03CR) 10Andrew Bogott: [C: 032] horizon: move striker to eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/468634 (https://phabricator.wikimedia.org/T204745) (owner: 10Andrew Bogott) [19:56:05] (03Restored) 10Cwhite: keyholder, diamond: remove nagios collector and diamond [puppet] - 10https://gerrit.wikimedia.org/r/468481 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [19:56:30] (03Restored) 10Cwhite: diamond: remove nagios collector [puppet] - 10https://gerrit.wikimedia.org/r/468480 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [20:04:55] twentyafterfour: Thanks for cherry-picking the watchlist fix, I was getting lunch [20:05:05] Here now to test it once you deploy it [20:05:16] RoanKattouw: thanks, waiting for CI now [20:13:37] 10Operations, 10Fundraising-Backlog, 10SRE-Access-Requests, 10Wikimedia-Fundraising, 10fundraising-tech-ops: Frdev1001 server and mysql access - https://phabricator.wikimedia.org/T206478 (10cwdent) @jkim_wikimedia it looks like the file isn't there. I am not personally knowledgeable about Apple computer... [20:24:28] RoanKattouw: should we test on mwdebug first or just sync it? [20:24:57] (the patch is merged and on deployment host) [20:26:15] Let's test on 1002 [20:26:40] ok [20:27:07] I have wikiversions setting enwiki to wmf.24 on 1001 so I'd like to test alongside that too [20:27:17] RoanKattouw: ok it's on 1002 [20:28:09] !log deployed https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/468636/ to mwdebug1002 [20:28:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:01] OK looking [20:29:57] twentyafterfour: 1001 and 1002 look identical, which is what I was hoping for. Let's roll [20:31:10] RoanKattouw: cool thanks for testing! [20:31:29] !log deploying https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/468636/ to the full cluster. [20:31:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:49] !log twentyafterfour@deploy1001 Synchronized php-1.32.0-wmf.26/resources/src/mediawiki.rcfilters/styles/: sync https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/468636/ (duration: 00m 54s) [20:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:37] !log deployed RCFilters: Fix completely broken highlight circles refs T207472 [20:33:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:40] T207472: Improved watchlist is no longer showing seen indicators and point types have become small - https://phabricator.wikimedia.org/T207472 [20:37:26] OK I found a bug but it's minor and I think it's pre-existing [20:37:31] Otherwise everything works great! Thanks! [20:39:03] nice! :) [20:53:10] Turns out it wasn't pre-existing, but it's too minor to do a Friday deployment for. I'll SWAT it on Monday [21:24:02] PROBLEM - HHVM jobrunner on mw1306 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.006 second response time [21:25:12] RECOVERY - HHVM jobrunner on mw1306 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [21:32:24] (03PS1) 10Cwhite: prometheus: add scrapes for apache on phabricator instances [puppet] - 10https://gerrit.wikimedia.org/r/468677 (https://phabricator.wikimedia.org/T183454) [23:51:55] (03PS1) 10Bstorm: sonofgridengine: making some adjustments to get puppet to pass [puppet] - 10https://gerrit.wikimedia.org/r/468685 [23:54:09] (03Abandoned) 10Bstorm: sonofgridengine: making some adjustments to get puppet to pass [puppet] - 10https://gerrit.wikimedia.org/r/468685 (owner: 10Bstorm) [23:55:16] (03PS1) 10Bstorm: sonofgridengine: remove non-working hiera file [puppet] - 10https://gerrit.wikimedia.org/r/468686