[01:24:33] 10Operations, 10Maps: Wikipedia Maps sync lag with OSM exceeded 5 days - https://phabricator.wikimedia.org/T237209 (10Arjunaraoc) [01:25:18] 10Operations, 10Maps: Wikipedia Maps replication failed - https://phabricator.wikimedia.org/T237209 (10Arjunaraoc) [01:26:31] RECOVERY - snapshot of s4 in eqiad on db1115 is OK: snapshot for s4 at eqiad taken less than 4 days ago and larger than 90 GB: Last one 2019-11-03 23:20:49 from db1102.eqiad.wmnet:3314 (1080 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [01:40:15] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:07:35] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:39:23] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:51:23] 10Operations, 10Traffic: Enforce POST size limit on ats-tls - https://phabricator.wikimedia.org/T236755 (10Vgutierrez) 05Open→03Resolved p:05Triage→03Normal [03:51:28] 10Operations, 10Traffic, 10Patch-For-Review: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 (10Vgutierrez) [03:56:32] 10Operations, 10Traffic: Provide an easy way of picking the traffic serving TLS certificate used by ATS - https://phabricator.wikimedia.org/T234803 (10Vgutierrez) 05Open→03Stalled I'm marking this task as stalled, it will be resolved as soon as T231627 is completed [03:56:35] 10Operations, 10Acme-chief, 10Traffic: Decide/document criteria needed to serve acme-chief LE issued unified certificate to end users - https://phabricator.wikimedia.org/T230687 (10Vgutierrez) [04:05:52] (03CR) 10Vgutierrez: 8.0.5-1wm10: fix #4635 with upstream patch (031 comment) [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/547736 (owner: 10Ema) [04:06:45] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:19:34] (03PS1) 10Andrew Bogott: Move cloudbackup2002 from 10.192.16 to 10.192.32 [dns] - 10https://gerrit.wikimedia.org/r/548004 (https://phabricator.wikimedia.org/T224528) [04:20:13] (03CR) 10Andrew Bogott: [C: 03+2] Move cloudbackup2002 from 10.192.16 to 10.192.32 [dns] - 10https://gerrit.wikimedia.org/r/548004 (https://phabricator.wikimedia.org/T224528) (owner: 10Andrew Bogott) [04:33:39] 10Operations, 10ops-codfw, 10Cloud-Services, 10Patch-For-Review: rack/setup codfw: cloudbackup2001.codfw.wmnet and cloudbackup2002.codfw.wmnet - https://phabricator.wikimedia.org/T224528 (10Andrew) Attached patch doesn't seem to make a difference, but also IP address doesn't matter until after the debian i... [04:37:35] (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to port 4443 on cp5008 [puppet] - 10https://gerrit.wikimedia.org/r/548007 (https://phabricator.wikimedia.org/T231627) [04:37:37] (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 8443 to port 443 on cp5008 [puppet] - 10https://gerrit.wikimedia.org/r/548008 (https://phabricator.wikimedia.org/T231627) [04:39:50] !log Switch from nginx to ats-tls on cp5008 - T231627 [04:39:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:39:56] T231627: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 [04:40:26] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to port 4443 on cp5008 [puppet] - 10https://gerrit.wikimedia.org/r/548007 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [04:40:35] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:42:54] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to port 443 on cp5008 [puppet] - 10https://gerrit.wikimedia.org/r/548008 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [04:48:21] PROBLEM - Maps - OSM synchronization lag - eqiad on icinga1001 is CRITICAL: 6.221e+05 ge 2.592e+05 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=11&fullscreen&orgId=1 [04:50:17] 10Operations, 10Traffic, 10Patch-For-Review: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 (10Vgutierrez) [04:53:47] (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to port 4443 on cp5009 [puppet] - 10https://gerrit.wikimedia.org/r/548009 (https://phabricator.wikimedia.org/T231627) [04:53:49] (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 8443 to port 443 on cp5009 [puppet] - 10https://gerrit.wikimedia.org/r/548010 (https://phabricator.wikimedia.org/T231627) [04:53:54] !log Switch from nginx to ats-tls on cp5009 - T231627 [04:53:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:53:58] T231627: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 [04:54:45] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to port 4443 on cp5009 [puppet] - 10https://gerrit.wikimedia.org/r/548009 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [04:57:10] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to port 443 on cp5009 [puppet] - 10https://gerrit.wikimedia.org/r/548010 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [05:05:32] 10Operations, 10Traffic, 10Patch-For-Review: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 (10Vgutierrez) [05:19:27] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:01:18] 10Operations, 10ops-codfw, 10Cloud-Services: rack/setup codfw: cloudbackup2001.codfw.wmnet and cloudbackup2002.codfw.wmnet - https://phabricator.wikimedia.org/T224528 (10Papaul) I can see that the installation is in progress... [06:03:22] 10Operations, 10ops-codfw, 10Cloud-Services: rack/setup codfw: cloudbackup2001.codfw.wmnet and cloudbackup2002.codfw.wmnet - https://phabricator.wikimedia.org/T224528 (10Papaul) Note that cloudbackup2002.codfw.wmnet is still using the old mgmt password . Please update it to the new mgmt password. Thanks [06:23:37] 10Operations, 10ops-codfw, 10Cloud-Services: rack/setup codfw: cloudbackup2001.codfw.wmnet and cloudbackup2002.codfw.wmnet - https://phabricator.wikimedia.org/T224528 (10Papaul) OS install complete, first puppet run complete. [06:24:28] 10Operations, 10ops-codfw, 10Cloud-Services: rack/setup codfw: cloudbackup2001.codfw.wmnet and cloudbackup2002.codfw.wmnet - https://phabricator.wikimedia.org/T224528 (10Papaul) [07:23:07] (03PS1) 10Elukey: role::analytics_test_cluster::client: add git proxy configuration [puppet] - 10https://gerrit.wikimedia.org/r/548013 [07:23:19] (03Abandoned) 10Elukey: Revert TLS MapReduce shuffle configuration for Hadoop Analytics [puppet] - 10https://gerrit.wikimedia.org/r/547590 (owner: 10Elukey) [07:27:04] (03CR) 10Elukey: [C: 03+2] role::analytics_test_cluster::client: add git proxy configuration [puppet] - 10https://gerrit.wikimedia.org/r/548013 (owner: 10Elukey) [07:46:20] 10Operations, 10Patch-For-Review: Build cergen for buster - https://phabricator.wikimedia.org/T235405 (10MoritzMuehlenhoff) >>! In T235405#5622193, @elukey wrote: > On puppetmaster2001 I cannot see /etc/apt/sources.list.d/buster-cergen.list, hence the new package version seems not available.. expected? Yes, i... [08:09:24] (03CR) 10Muehlenhoff: Add Icinga check for correct application of microcode mitigations (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/547747 (https://phabricator.wikimedia.org/T235250) (owner: 10Muehlenhoff) [08:10:22] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:17:13] (03CR) 10Vgutierrez: [V: 03+2 C: 03+2] fifo-log-tailer: Retry on errors [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/539312 (owner: 10Vgutierrez) [08:23:14] (03CR) 10Filippo Giunchedi: [C: 03+1] Add Icinga check for correct application of microcode mitigations [puppet] - 10https://gerrit.wikimedia.org/r/547747 (https://phabricator.wikimedia.org/T235250) (owner: 10Muehlenhoff) [08:23:16] (03CR) 10Effie Mouzeli: [C: 03+2] logging: remove hhvm references [puppet] - 10https://gerrit.wikimedia.org/r/547489 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [08:23:33] (03PS6) 10Effie Mouzeli: logging: remove hhvm references [puppet] - 10https://gerrit.wikimedia.org/r/547489 (https://phabricator.wikimedia.org/T229792) [08:27:44] (03PS1) 10Muehlenhoff: Add bawolff to ldap users [puppet] - 10https://gerrit.wikimedia.org/r/548062 (https://phabricator.wikimedia.org/T236636) [08:27:53] (03CR) 10Filippo Giunchedi: [C: 03+1] netops: add host monitoring for scs systems (serial console servers) [puppet] - 10https://gerrit.wikimedia.org/r/547752 (https://phabricator.wikimedia.org/T233318) (owner: 10Herron) [08:28:29] (03CR) 10Jcrespo: "> Patch Set 2: Code-Review-1" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/547569 (https://phabricator.wikimedia.org/T221083) (owner: 10Jbond) [08:30:29] (03CR) 10Muehlenhoff: [C: 03+2] Add bawolff to ldap users [puppet] - 10https://gerrit.wikimedia.org/r/548062 (https://phabricator.wikimedia.org/T236636) (owner: 10Muehlenhoff) [08:30:39] (03PS1) 10Vgutierrez: Release fifo-log-demux 0.6 [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/548137 [08:31:32] (03CR) 10Filippo Giunchedi: [C: 03+1] netops: add host monitoring for scs systems (serial console servers) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/547752 (https://phabricator.wikimedia.org/T233318) (owner: 10Herron) [08:32:12] (03CR) 10Filippo Giunchedi: Icinga: add parents to mgmt devices (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/547767 (owner: 10Ayounsi) [08:32:48] (03PS2) 10Muehlenhoff: Add Icinga check for correct application of microcode mitigations [puppet] - 10https://gerrit.wikimedia.org/r/547747 (https://phabricator.wikimedia.org/T235250) [08:33:36] (03CR) 10Filippo Giunchedi: [C: 03+1] motd: add the config version to the MOTD [puppet] - 10https://gerrit.wikimedia.org/r/547506 (https://phabricator.wikimedia.org/T228854) (owner: 10Jbond) [08:34:48] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/546290 (https://phabricator.wikimedia.org/T236505) (owner: 10Cwhite) [08:34:51] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Add bawolff to either NDA or WMF ldap group - https://phabricator.wikimedia.org/T236636 (10MoritzMuehlenhoff) 05Open→03Resolved I fixed the data.yaml entry [08:35:24] 10Operations, 10ops-esams: wipe backup-array1 - https://phabricator.wikimedia.org/T237041 (10MoritzMuehlenhoff) p:05Triage→03Normal [08:37:17] (03CR) 10Filippo Giunchedi: [C: 04-1] "See inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/546195 (owner: 10Jbond) [08:37:30] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:39:42] 10Operations, 10Maps: Alert SRE if Wikipedia Maps replication sync lag exceeds 1 day - https://phabricator.wikimedia.org/T237209 (10Mathew.onipe) p:05Triage→03High [08:40:03] 10Operations, 10Maps: Alert SRE if Wikipedia Maps replication sync lag exceeds 1 day - https://phabricator.wikimedia.org/T237209 (10Mathew.onipe) p:05High→03Normal [08:41:24] 10Operations, 10Maps: Alert SRE if Wikipedia Maps replication sync lag exceeds 1 day - https://phabricator.wikimedia.org/T237209 (10Mathew.onipe) We have an alert for this. https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=icinga1001&service=Maps+-+OSM+synchronization+lag+-+eqiad [08:41:39] 10Operations, 10Maps: Alert SRE if Wikipedia Maps replication sync lag exceeds 1 day - https://phabricator.wikimedia.org/T237209 (10Mathew.onipe) p:05Normal→03High [08:43:39] (03CR) 10Filippo Giunchedi: "LGTM! See inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/547527 (owner: 10CDanis) [08:44:05] 10Operations, 10serviceops, 10HHVM, 10MW-1.35-notes (1.35.0-wmf.3; 2019-10-22), and 2 others: Remove HHVM from production - https://phabricator.wikimedia.org/T229792 (10jijiki) [08:44:41] 10Operations, 10Maps: Alert SRE if Wikipedia Maps replication sync lag exceeds 1 day - https://phabricator.wikimedia.org/T237209 (10Mathew.onipe) I'm closing this task as there are icinga alerts for osm sync [08:45:05] 10Operations, 10Maps: Alert SRE if Wikipedia Maps replication sync lag exceeds 1 day - https://phabricator.wikimedia.org/T237209 (10Mathew.onipe) 05Open→03Invalid [08:49:04] (03PS1) 10Muehlenhoff: Add Raz Shuty to absented LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/548140 (https://phabricator.wikimedia.org/T237118) [08:49:14] (03CR) 10Filippo Giunchedi: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/547519 (https://phabricator.wikimedia.org/T215904) (owner: 10Filippo Giunchedi) [08:52:21] 10Operations, 10Maps: OSM Replication failed at eqiad and codfw - https://phabricator.wikimedia.org/T237228 (10Mathew.onipe) [08:52:23] (03CR) 10Muehlenhoff: [C: 03+2] Add Raz Shuty to absented LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/548140 (https://phabricator.wikimedia.org/T237118) (owner: 10Muehlenhoff) [08:52:29] 10Operations, 10Maps: OSM Replication failed at eqiad and codfw - https://phabricator.wikimedia.org/T237228 (10Mathew.onipe) p:05Triage→03High [08:53:14] 10Operations, 10LDAP-Access-Requests, 10Security-Team, 10Patch-For-Review: Offboard Raz Shuty from various Wikimedia systems - https://phabricator.wikimedia.org/T237118 (10MoritzMuehlenhoff) [08:55:07] 10Operations, 10LDAP-Access-Requests, 10Security-Team, 10Patch-For-Review: Offboard Raz Shuty from various Wikimedia systems - https://phabricator.wikimedia.org/T237118 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff Ticked off the relevant bits, closing. Two remarks: * "Disable all OI... [08:56:19] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting Access to Stat1004, Stat1006, Stat1007, notebook1003 and notebook1004 - https://phabricator.wikimedia.org/T236321 (10MoritzMuehlenhoff) Ack, just ping the task with the new SSH key when you received your new computer. [09:00:35] (03PS1) 10Elukey: eventlogging: allow sanitization script to run on all db records [puppet] - 10https://gerrit.wikimedia.org/r/548142 (https://phabricator.wikimedia.org/T236818) [09:05:02] 10Operations, 10serviceops: Kubernetes workers frequent oom-killer in action - https://phabricator.wikimedia.org/T237198 (10Joe) 05Open→03Invalid a:03Joe So: - kubernetes{1,2}00{5,6} are specialized nodes that only run kask for sessions, that's why you don't see ooms there. - The OOM killer doesn't only... [09:11:16] (03PS2) 10Ema: 8.0.5-1wm10: fix #4635 with upstream patch [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/547736 [09:13:11] 10Operations: Add docker-engine to buster-wikimedia distribution - https://phabricator.wikimedia.org/T236947 (10MoritzMuehlenhoff) You can mimic the existing thirdparty/kubeadm-k8s-docker.com component for wikimedia-stretch. And the binary name changed, it's docker-ce now. [09:13:16] 10Operations: Add docker-engine to buster-wikimedia distribution - https://phabricator.wikimedia.org/T236947 (10MoritzMuehlenhoff) p:05Triage→03Normal [09:13:29] 10Operations, 10Traffic: ats-be on the text cluster is experiencing broken connections - https://phabricator.wikimedia.org/T236988 (10MoritzMuehlenhoff) p:05Triage→03Normal [09:13:53] (03CR) 10Vgutierrez: [C: 03+1] 8.0.5-1wm10: fix #4635 with upstream patch [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/547736 (owner: 10Ema) [09:13:54] 10Operations: Ferm should log errors when failing to create all configured rules - https://phabricator.wikimedia.org/T237020 (10MoritzMuehlenhoff) p:05Triage→03Normal [09:14:17] 10Operations, 10serviceops: Kubernetes hosts raid check make facter fail - https://phabricator.wikimedia.org/T237197 (10MoritzMuehlenhoff) p:05Triage→03Normal [09:19:42] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/547714 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [09:20:33] (03CR) 10Ema: [C: 03+1] "Ship it!" [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/548137 (owner: 10Vgutierrez) [09:22:03] (03CR) 10Vgutierrez: [C: 03+2] Release fifo-log-demux 0.6 [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/548137 (owner: 10Vgutierrez) [09:24:25] (03CR) 10Effie Mouzeli: [C: 03+2] mediawiki: remove hhvm from stop_cronjobs() [software/spicerack] - 10https://gerrit.wikimedia.org/r/547714 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [09:29:52] (03Merged) 10jenkins-bot: mediawiki: remove hhvm from stop_cronjobs() [software/spicerack] - 10https://gerrit.wikimedia.org/r/547714 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [09:30:38] 10Operations, 10serviceops, 10HHVM, 10MW-1.35-notes (1.35.0-wmf.3; 2019-10-22), and 2 others: Remove HHVM from production - https://phabricator.wikimedia.org/T229792 (10jijiki) [09:31:55] 10Operations, 10Prod-Kubernetes, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops: Allow testing of feature-flag-protected features in deployment-charts CI - https://phabricator.wikimedia.org/T236899 (10Joe) 05Open→03Resolved The CI is far from perfect, but it catches the most mundane iss... [09:36:51] (03CR) 10Gehel: [C: 03+1] "@elukey / @Ottomata: are you OK with this initial implementation? It looks good enough to me and given the various discussion, I think we'" [puppet] - 10https://gerrit.wikimedia.org/r/544989 (https://phabricator.wikimedia.org/T236180) (owner: 10EBernhardson) [09:38:05] (03CR) 10Elukey: "> @elukey / @Ottomata: are you OK with this initial implementation?" [puppet] - 10https://gerrit.wikimedia.org/r/544989 (https://phabricator.wikimedia.org/T236180) (owner: 10EBernhardson) [09:41:16] elukey: thanks ^ [09:48:34] (03CR) 10Jbond: [C: 03+1] "lgtmc" [puppet] - 10https://gerrit.wikimedia.org/r/547767 (owner: 10Ayounsi) [09:52:07] (03CR) 10Jbond: [C: 03+2] puppet_compiler: ensure all working dirs have correct owner [puppet] - 10https://gerrit.wikimedia.org/r/547493 (https://phabricator.wikimedia.org/T236986) (owner: 10Jbond) [09:56:12] (03CR) 10Arturo Borrero Gonzalez: "Nice work! thanks!" (037 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/547676 (https://phabricator.wikimedia.org/T236202) (owner: 10Bstorm) [09:57:36] 10Operations, 10DBA, 10serviceops, 10Goal, 10Patch-For-Review: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10jcrespo) Status at the moment: ` == jobs_with_all_failures (6) == an-master1002.eqiad.wmnet-Monthly-1st-Mon-production-hadoop-namen... [09:58:00] 10Operations, 10Maps: Alert SRE if Wikipedia Maps replication sync lag exceeds 1 day - https://phabricator.wikimedia.org/T237209 (10Mathew.onipe) see https://phabricator.wikimedia.org/T237228 for current OSM replication issues [10:00:35] (03CR) 10Elukey: "Created user and database search_airflow on an-coord1001, and updated the puppet private repo accordingly. No more blockers!" [puppet] - 10https://gerrit.wikimedia.org/r/544989 (https://phabricator.wikimedia.org/T236180) (owner: 10EBernhardson) [10:01:25] gehel: done! [10:04:07] (03CR) 10Jbond: puppet git: add a descriptive config version (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/547505 (https://phabricator.wikimedia.org/T228854) (owner: 10Jbond) [10:04:18] (03PS9) 10Jbond: puppet git: add a descriptive config version [puppet] - 10https://gerrit.wikimedia.org/r/547505 (https://phabricator.wikimedia.org/T228854) [10:05:23] 10Operations: Vega (bugzilla-static) is trying to backup a directory that doesn't exist - https://phabricator.wikimedia.org/T237233 (10jcrespo) [10:06:51] elukey: thanks ! I'll merge that when Erik is around [10:07:30] ack! [10:08:05] it would be nice to get some real info about airflow before meeting at all hands, to share thoughts/plans/etc.. [10:13:57] 10Operations, 10serviceops, 10Kubernetes: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10Joe) [10:15:35] 10Operations: vega, bromine (bugzilla-static) is trying to backup a directory that doesn't exist - https://phabricator.wikimedia.org/T237233 (10jcrespo) [10:21:34] (03CR) 10Arturo Borrero Gonzalez: ceph: add k8s manifests for ceph deployment using rook (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/546182 (https://phabricator.wikimedia.org/T236290) (owner: 10Jhedden) [10:23:06] (03PS1) 10Jbond: puppet_compiler: update permissions [puppet] - 10https://gerrit.wikimedia.org/r/548223 [10:23:08] elukey: what kind of "real info" are you looking for? [10:23:10] 10Operations, 10hardware-requests, 10Performance-Team (Radar): eqiad: (1) misc single cpu server allocation for performance browser testing - https://phabricator.wikimedia.org/T204589 (10faidon) 05Open→03Stalled Update per IRC conversation with @Gilles: this is still needed, but is stalled and currently... [10:24:17] (03CR) 10Jbond: [C: 03+2] puppet_compiler: update permissions [puppet] - 10https://gerrit.wikimedia.org/r/548223 (owner: 10Jbond) [10:24:29] 10Operations, 10serviceops, 10Kubernetes: Collect metrics from envoy where it is enabled on k8s - https://phabricator.wikimedia.org/T237234 (10Joe) [10:25:46] gehel: any, the tool looks really awesome but I am pretty sure that we'll find bottlenecks etc.. [10:26:17] elukey: no! it is a perfect tool, without any issue whatsoever! [10:26:46] We'll get some experience by then, we can definitely have a chat and share what we'll learn [10:27:27] gehel: all shiny! [10:29:26] 10Operations, 10Packaging, 10serviceops: Build and upload envoy 1.12.0 package. - https://phabricator.wikimedia.org/T237235 (10Joe) [10:30:04] jan_drewniak: My dear minions, it's time we take the moon! Just kidding. Time for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191104T1030). [10:32:52] (03CR) 10Jbond: "updated thanks" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/547568 (https://phabricator.wikimedia.org/T221083) (owner: 10Jbond) [10:33:30] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/547735 (owner: 10Muehlenhoff) [10:34:11] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/547738 (https://phabricator.wikimedia.org/T235162) (owner: 10Muehlenhoff) [10:36:11] (03PS1) 10Kosta Harlan: [beta] Working configuration for newcomer tasks on cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548224 (https://phabricator.wikimedia.org/T236823) [10:42:10] 10Operations: Important nagios-nrpe-server errors not showing up in unit journal - https://phabricator.wikimedia.org/T237236 (10ema) [10:48:47] (03CR) 10Kosta Harlan: [C: 04-1] Initial GrowthExperiments labs configuration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547897 (https://phabricator.wikimedia.org/T237167) (owner: 10Urbanecm) [10:48:51] (03Abandoned) 10Kosta Harlan: [beta] Working configuration for newcomer tasks on cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548224 (https://phabricator.wikimedia.org/T236823) (owner: 10Kosta Harlan) [10:53:23] (03PS3) 10Urbanecm: Initial GrowthExperiments labs configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547897 (https://phabricator.wikimedia.org/T237167) [10:53:39] (03CR) 10Urbanecm: Initial GrowthExperiments labs configuration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547897 (https://phabricator.wikimedia.org/T237167) (owner: 10Urbanecm) [10:55:21] (03PS3) 10Jbond: backup::host: use fqdn_rand_string for password generation [puppet] - 10https://gerrit.wikimedia.org/r/547569 (https://phabricator.wikimedia.org/T221083) [10:55:34] (03CR) 10Kosta Harlan: [C: 03+1] Initial GrowthExperiments labs configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547897 (https://phabricator.wikimedia.org/T237167) (owner: 10Urbanecm) [10:55:58] 10Operations, 10DBA, 10serviceops, 10Goal, 10Patch-For-Review: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10jcrespo) * `an-master1002.eqiad.wmnet-Monthly-1st-Mon-production-hadoop-namenode-backup`: connectivity issue bacula client: ` Nov 04 0... [10:57:07] (03CR) 10Jbond: "> Patch Set 2: Code-Review-1" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/547569 (https://phabricator.wikimedia.org/T221083) (owner: 10Jbond) [10:59:56] 10Operations, 10netops: Update router ACLs for newer bacula hosts - https://phabricator.wikimedia.org/T237016 (10jcrespo) This is currently affecting backups from analytics1029 and an-master1002 FYI T236406#5630631 CC #Analytics @Ottomata @elukey . [11:01:16] kostajh: do you think I can merge the labs config for Growth now, or should it wait on sth? [11:01:22] Urbanecm: please go ahead [11:01:35] (03CR) 10Urbanecm: [C: 03+2] "Per Kosta, labs only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547897 (https://phabricator.wikimedia.org/T237167) (owner: 10Urbanecm) [11:01:38] kostajh: sure! [11:02:28] (03Merged) 10jenkins-bot: Initial GrowthExperiments labs configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547897 (https://phabricator.wikimedia.org/T237167) (owner: 10Urbanecm) [11:04:38] 10Operations, 10serviceops: Upgrade to PHP 7.2.24 - https://phabricator.wikimedia.org/T237239 (10MoritzMuehlenhoff) [11:06:37] 10Operations, 10DBA, 10serviceops, 10Goal, 10Patch-For-Review: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10jcrespo) @elukey @Ottomata Re: matomo1001, is there a reason not to have daily incrementals? If the reason is that it generates a full... [11:07:34] 10Operations, 10DBA, 10serviceops, 10Goal, 10Patch-For-Review: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10jcrespo) [11:07:57] 10Operations, 10DBA, 10serviceops, 10Goal, 10Patch-For-Review: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10jcrespo) [11:08:38] !log uploaded PHP 7.2.24 to apt.wikimedia.org stretch-wikimedia/component/php72 (T237239) [11:08:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:43] T237239: Upgrade to PHP 7.2.24 - https://phabricator.wikimedia.org/T237239 [11:12:19] Urbanecm: looks good, thx [11:12:56] yw kostajh. I'm wondering why tasks types aren't checkable through... [11:12:59] https://usercontent.irccloud-cdn.com/file/ZvM0Kpin/image.png [11:13:45] Urbanecm: your browser may have cached the response from the API before the config was updated. Try a hard refresh in your browser [11:14:26] kostajh: ah, now it works. Thanks! [11:15:03] np [11:18:13] 10Operations, 10Traffic: Network unreachable after network-online.target is brought up - https://phabricator.wikimedia.org/T237243 (10ema) [11:18:23] 10Operations, 10User-jbond: Manage apt sources via puppet? - https://phabricator.wikimedia.org/T158562 (10jbond) [11:23:33] (03CR) 10Jcrespo: "Ignore comment." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/547568 (https://phabricator.wikimedia.org/T221083) (owner: 10Jbond) [11:26:16] (03PS1) 10Elukey: Update Bacula configs for analytics-in filters [homer/public] - 10https://gerrit.wikimedia.org/r/548228 (https://phabricator.wikimedia.org/T237016) [11:27:10] (03PS1) 10Muehlenhoff: Add comment to clarify how contrib is enabled [puppet] - 10https://gerrit.wikimedia.org/r/548229 [11:29:36] 10Operations, 10netops, 10Patch-For-Review: Update router ACLs for newer bacula hosts - https://phabricator.wikimedia.org/T237016 (10elukey) Thanks a lot Jaime! @akosiaris if the change looks good I can update cr1/cr2 manually (or I can use homer if already available!) [11:30:42] (03CR) 10Awight: Install a cron job to produce Reference Previews metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/547715 (https://phabricator.wikimedia.org/T233108) (owner: 10Awight) [11:32:16] (03PS1) 10Jbond: d-i: add contrib component to d-i configuration [puppet] - 10https://gerrit.wikimedia.org/r/548230 (https://phabricator.wikimedia.org/T158562) [11:32:33] (03PS3) 10Awight: Install a cron job to produce Reference Previews metrics [puppet] - 10https://gerrit.wikimedia.org/r/547715 (https://phabricator.wikimedia.org/T233108) [11:33:45] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548231 (https://phabricator.wikimedia.org/T128546) [11:34:20] If nobody minds, I'm gonna do a quick portal update now since my calendar was messed up (because daylight saving time) [11:34:42] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548231 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:35:07] (03PS4) 10Filippo Giunchedi: Introduce Elastic 7 support [puppet] - 10https://gerrit.wikimedia.org/r/545867 (https://phabricator.wikimedia.org/T234854) [11:35:24] (03PS2) 10Jbond: d-i: add contrib component to d-i configuration [puppet] - 10https://gerrit.wikimedia.org/r/548230 (https://phabricator.wikimedia.org/T158562) [11:35:27] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548231 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:36:29] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/548229 (owner: 10Muehlenhoff) [11:36:41] (03CR) 10Filippo Giunchedi: Introduce Elastic 7 support (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/545867 (https://phabricator.wikimedia.org/T234854) (owner: 10Filippo Giunchedi) [11:38:00] (03CR) 10Muehlenhoff: [C: 03+2] Add comment to clarify how contrib is enabled [puppet] - 10https://gerrit.wikimedia.org/r/548229 (owner: 10Muehlenhoff) [11:38:11] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:548231| Bumping portals to master (T128546)]] (duration: 01m 03s) [11:38:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:16] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [11:39:04] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:548231| Bumping portals to master (T128546)]] (duration: 00m 52s) [11:39:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:18] (03PS4) 10Jbond: motd: add the config version to the MOTD [puppet] - 10https://gerrit.wikimedia.org/r/547506 (https://phabricator.wikimedia.org/T228854) [11:40:44] (03CR) 10Elukey: "Nit about the commit msg: let's use report updater job instead of cron, since behind the scenes we use a systemd timer :)" [puppet] - 10https://gerrit.wikimedia.org/r/547715 (https://phabricator.wikimedia.org/T233108) (owner: 10Awight) [11:42:03] 10Operations, 10DBA, 10serviceops, 10Goal, 10Patch-For-Review: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10elukey) >>! In T236406#5630677, @jcrespo wrote: > @elukey @Ottomata Re: matomo1001, is there a reason not to have daily incrementals?... [11:42:10] (03PS1) 10Muehlenhoff: Remove buster-test d-i config [puppet] - 10https://gerrit.wikimedia.org/r/548232 [11:44:01] (03CR) 10Jcrespo: [C: 03+1] "From a logical point of view, this looks correct to me, but I am unaware of what is the workflow/approval process of it (netops side of th" [homer/public] - 10https://gerrit.wikimedia.org/r/548228 (https://phabricator.wikimedia.org/T237016) (owner: 10Elukey) [11:46:02] (03CR) 10Jbond: "> Patch Set 8:" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/547505 (https://phabricator.wikimedia.org/T228854) (owner: 10Jbond) [11:47:16] (03PS4) 10Awight: report updater job: produce Reference Previews metrics [puppet] - 10https://gerrit.wikimedia.org/r/547715 (https://phabricator.wikimedia.org/T233108) [11:47:27] 10Operations, 10DBA, 10serviceops, 10Goal, 10Patch-For-Review: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10jcrespo) @elukey If this helps, I can try generating manually an incremental, for a better informed decision about storage size (it sh... [11:47:39] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/547747 (https://phabricator.wikimedia.org/T235250) (owner: 10Muehlenhoff) [11:48:07] (03CR) 10Muehlenhoff: [C: 03+2] Remove buster-test d-i config [puppet] - 10https://gerrit.wikimedia.org/r/548232 (owner: 10Muehlenhoff) [11:49:36] (03PS7) 10Jbond: check_puppetrun: alert critical after 24 hours [puppet] - 10https://gerrit.wikimedia.org/r/546195 [11:49:49] (03CR) 10Jbond: check_puppetrun: alert critical after 24 hours (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/546195 (owner: 10Jbond) [11:50:59] (03CR) 10Jbond: [C: 03+2] puppetmaster::frontend serve volatile uri from the locale site frontend [puppet] - 10https://gerrit.wikimedia.org/r/542922 (https://phabricator.wikimedia.org/T235427) (owner: 10Jbond) [11:53:25] (03PS3) 10Revi: Enable partial blocks on kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546913 (https://phabricator.wikimedia.org/T236752) [11:54:25] 10Operations, 10Puppet, 10Traffic, 10Patch-For-Review, 10User-jbond: Serve volatile uri from local site - https://phabricator.wikimedia.org/T235427 (10jbond) The volatile URI has been moved both puppetmasters [11:55:27] 10Operations, 10DBA, 10serviceops, 10Goal, 10Patch-For-Review: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10elukey) I am all for simplifying and standardizing confs, so no opposition about incremental. Only one question - what would it change... [11:56:43] 10Operations, 10DBA, 10serviceops, 10Goal, 10Patch-For-Review: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10jcrespo) ` # check_bacula.py matomo1001.eqiad.wmnet-Weekly-Wed-production-mysql-srv-backups 2019-10-30 02:05:43: type: F, status: T, b... [11:58:03] 10Operations, 10Traffic: Network unreachable after network-online.target is brought up - https://phabricator.wikimedia.org/T237243 (10MoritzMuehlenhoff) The lldpd unit only depends on network.target, but network-online.target, per systemd-special(7) lldpd.service only the latter will postpone startup until the... [11:58:51] 10Operations, 10DBA, 10serviceops, 10Goal, 10Patch-For-Review: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10jcrespo) >>! In T236406#5630885, @elukey wrote: > I am all for simplifying and standardizing confs, so no opposition about incremental... [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191104T1200). [12:00:04] revi, Ammarpad, Ammarpad, and awight: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:07] hihihihi [12:00:12] I can SWAT today! [12:00:13] 10Operations, 10DBA, 10serviceops, 10Goal, 10Patch-For-Review: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10jcrespo) Let me find where they are configured and I will send you a patch- later feel free to ping me on IRC and I will show you how... [12:00:46] (03CR) 10Urbanecm: [C: 03+2] Enable partial blocks on kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546913 (https://phabricator.wikimedia.org/T236752) (owner: 10Revi) [12:00:54] !log upgrading mw1261 to PHP 7.2.24 (T237239) [12:00:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:58] T237239: Upgrade to PHP 7.2.24 - https://phabricator.wikimedia.org/T237239 [12:01:16] > we take the moon [12:01:22] You can take the moon @ my home [12:01:59] (03Merged) 10jenkins-bot: Enable partial blocks on kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546913 (https://phabricator.wikimedia.org/T236752) (owner: 10Revi) [12:02:36] revi: please test your patch at mwdebug1001 [12:03:17] ㅁ차 [12:03:18] ack* [12:04:12] Urbanecm: +LGTM [12:04:18] revi: syncing [12:04:29] (03CR) 10Urbanecm: [C: 03+2] Update logo for zh-classical Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547845 (https://phabricator.wikimedia.org/T236905) (owner: 10Ammarpad) [12:04:40] I personally don't like it while I am deploying this lol [12:05:00] but that's what community wants so /shrug [12:05:11] revi: you don't like partial blocks? :-) [12:05:22] (03PS3) 10Urbanecm: Add localized Minerva wordmark for Sindhi Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547061 (https://phabricator.wikimedia.org/T200870) (owner: 10Ammarpad) [12:05:27] (03Merged) 10jenkins-bot: Update logo for zh-classical Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547845 (https://phabricator.wikimedia.org/T236905) (owner: 10Ammarpad) [12:05:28] let me cite my userpage.... [12:05:33] > Partial block is a nonsense. You can't be civil in one place and act like bullshit and be 'partial blocked' on the other side of a single project. You get a sitewide block from me or no block at all. [12:05:49] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: c92a13c: Enable partial blocks on kowiki (T236752) (duration: 00m 54s) [12:05:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:54] T236752: Deploy partial block for kowiki - https://phabricator.wikimedia.org/T236752 [12:08:05] (03CR) 10Urbanecm: [C: 03+2] Add localized Minerva wordmark for Sindhi Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547061 (https://phabricator.wikimedia.org/T200870) (owner: 10Ammarpad) [12:08:22] !log urbanecm@deploy1001 Synchronized static/images/project-logos/: SWAT: a6d64b1: Update logo for zh-classical Wikipedia (T236905) (duration: 00m 53s) [12:08:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:26] T236905: Change the lzh Wikipedia logo - https://phabricator.wikimedia.org/T236905 [12:09:06] (03Merged) 10jenkins-bot: Add localized Minerva wordmark for Sindhi Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547061 (https://phabricator.wikimedia.org/T200870) (owner: 10Ammarpad) [12:11:00] (03PS2) 10Urbanecm: Enable Book Referencing on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547484 (https://phabricator.wikimedia.org/T236894) (owner: 10Awight) [12:11:09] (03CR) 10Urbanecm: [C: 03+2] "noop for prod" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547484 (https://phabricator.wikimedia.org/T236894) (owner: 10Awight) [12:11:20] !log urbanecm@deploy1001 Synchronized static/images/mobile/copyright/: SWAT: 7c1c64c: Add localized Minerva wordmark for Sindhi Wikipedia (T200870; 1/2) (duration: 00m 53s) [12:11:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:25] T200870: Deploy localized wordmark of Wikipedia on mobile site of sdwiki - https://phabricator.wikimedia.org/T200870 [12:11:27] Urbanecm: Thanks! [12:11:30] awight: yw [12:11:53] (03Merged) 10jenkins-bot: Enable Book Referencing on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547484 (https://phabricator.wikimedia.org/T236894) (owner: 10Awight) [12:12:54] !log Purge https://en.wikipedia.org/static/images/project-logos/zh_classicalwiki* (T236905) [12:12:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:00] (03PS1) 10Jcrespo: WIP: Move matomo to a Montly full schedule, like production dbs [puppet] - 10https://gerrit.wikimedia.org/r/548236 (https://phabricator.wikimedia.org/T236406) [12:13:41] (03CR) 10Jcrespo: "This is just a draft for now for me so I don't have to find the file again." [puppet] - 10https://gerrit.wikimedia.org/r/548236 (https://phabricator.wikimedia.org/T236406) (owner: 10Jcrespo) [12:14:39] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 7c1c64c: Add localized Minerva wordmark for Sindhi Wikipedia (T200870; 2/2) (duration: 00m 52s) [12:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:08] (03CR) 10Alexandros Kosiaris: [C: 04-1] Update Bacula configs for analytics-in filters (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/548228 (https://phabricator.wikimedia.org/T237016) (owner: 10Elukey) [12:19:43] 10Operations, 10netops, 10Patch-For-Review: Update router ACLs for newer bacula hosts - https://phabricator.wikimedia.org/T237016 (10akosiaris) Let a minor comment, namely let's keep helium around for a bit more. > (or I can use homer if already available!) I don't know tbh. I filed the task explicitly bec... [12:21:18] !log EU SWAT done [12:21:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:41] (03PS2) 10Muehlenhoff: Enable idp2001 as second identity provider [puppet] - 10https://gerrit.wikimedia.org/r/547735 [12:28:48] (03PS4) 10Jbond: puppetmnasters: use localcacert setting for CA file in apache [puppet] - 10https://gerrit.wikimedia.org/r/545575 (https://phabricator.wikimedia.org/T234332) [12:28:50] (03PS1) 10Jbond: puppet_ca: update puppet ca with a new certificate valid for 10 years [puppet] - 10https://gerrit.wikimedia.org/r/548241 (https://phabricator.wikimedia.org/T236277) [12:30:05] (03CR) 10Muehlenhoff: [C: 03+2] Enable idp2001 as second identity provider [puppet] - 10https://gerrit.wikimedia.org/r/547735 (owner: 10Muehlenhoff) [12:31:21] (03CR) 10Jbond: "Please scrutinise heavily and add any additional reviewers you think useful" [puppet] - 10https://gerrit.wikimedia.org/r/548241 (https://phabricator.wikimedia.org/T236277) (owner: 10Jbond) [12:33:35] 10Operations, 10netops, 10Patch-For-Review: Update router ACLs for newer bacula hosts - https://phabricator.wikimedia.org/T237016 (10Volans) While you wait for @ayounsi I can maybe fill some gap. Homer is already a thing and Arzhel is using and testing it, but it doesn't have yet proper documentation for a w... [12:39:36] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:41:35] 10Operations, 10DBA, 10serviceops, 10Goal, 10Patch-For-Review: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10jcrespo) @elukey I am sorry, after looking closely to the policies, I mistakenly assumed the schedule was wrong. I will abandon patchi... [12:41:56] (03Abandoned) 10Jcrespo: WIP: Move matomo to a Montly full schedule, like production dbs [puppet] - 10https://gerrit.wikimedia.org/r/548236 (https://phabricator.wikimedia.org/T236406) (owner: 10Jcrespo) [12:46:48] (03PS1) 10Ladsgroup: Rotate Amir's SSH key [puppet] - 10https://gerrit.wikimedia.org/r/548242 [12:46:54] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10observability: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10JAllemandou) Thanks for the fast loop over format @Nuria and @BBlack. Indeed having a single field named `TLS` formatted as describe... [12:51:05] (03PS1) 10Jcrespo: bacula: Fix hardcoded thresholds based on configured schedules [puppet] - 10https://gerrit.wikimedia.org/r/548244 (https://phabricator.wikimedia.org/T236406) [12:53:10] (03PS2) 10Jcrespo: bacula: Fix hardcoded thresholds based on configured schedules [puppet] - 10https://gerrit.wikimedia.org/r/548244 (https://phabricator.wikimedia.org/T236406) [12:55:56] (03PS2) 10Muehlenhoff: Add library hint for libpcap [puppet] - 10https://gerrit.wikimedia.org/r/547744 [12:59:13] (03CR) 10Alexandros Kosiaris: [C: 03+1] bacula: Fix hardcoded thresholds based on configured schedules [puppet] - 10https://gerrit.wikimedia.org/r/548244 (https://phabricator.wikimedia.org/T236406) (owner: 10Jcrespo) [13:00:05] Amir1: Time to snap out of that daydream and deploy Creating Mon Wikipedia. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191104T1300). [13:01:09] o/ [13:02:12] 10Operations, 10Traffic, 10netops: Network unreachable after network-online.target is brought up - https://phabricator.wikimedia.org/T237243 (10ema) [13:02:29] (03PS5) 10Ladsgroup: Initial configuration for mnwwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543842 (https://phabricator.wikimedia.org/T235739) (owner: 10Jon Harald Søby) [13:03:49] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for libpcap [puppet] - 10https://gerrit.wikimedia.org/r/547744 (owner: 10Muehlenhoff) [13:03:51] (03CR) 10Ladsgroup: [C: 03+2] Initial configuration for mnwwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543842 (https://phabricator.wikimedia.org/T235739) (owner: 10Jon Harald Søby) [13:04:29] (03PS1) 10Ema: cache: reimage cp5012 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/548249 (https://phabricator.wikimedia.org/T227432) [13:04:38] (03Merged) 10jenkins-bot: Initial configuration for mnwwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543842 (https://phabricator.wikimedia.org/T235739) (owner: 10Jon Harald Søby) [13:05:12] (03PS2) 10Muehlenhoff: Enable new adduser base class on spare hosts [puppet] - 10https://gerrit.wikimedia.org/r/547738 (https://phabricator.wikimedia.org/T235162) [13:06:03] !log depool cp5012 and reimage as text_ats T227432 [13:06:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:08] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [13:06:18] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: Extend Puppet CA Expiry date - https://phabricator.wikimedia.org/T236277 (10jbond) We also need to consider `/usr/local/share/ca-certificates/Puppet_Internal_CA.crt` which is linked to `/etc/ssl/certs/Puppet_Internal_CA.pem`. This cert is used for... [13:06:20] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:07:13] !log ladsgroup@deploy1001 Synchronized dblists: (no justification provided) (duration: 00m 53s) [13:07:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:26] (03CR) 10Alexandros Kosiaris: "> Alex, but the intention was in the right direction, AFAIU? -it needs a private seed or other private part, somehow, so it is not public-" [puppet] - 10https://gerrit.wikimedia.org/r/547569 (https://phabricator.wikimedia.org/T221083) (owner: 10Jbond) [13:07:44] (03CR) 10Jcrespo: [C: 03+2] bacula: Fix hardcoded thresholds based on configured schedules [puppet] - 10https://gerrit.wikimedia.org/r/548244 (https://phabricator.wikimedia.org/T236406) (owner: 10Jcrespo) [13:07:55] (03PS3) 10Jcrespo: bacula: Fix hardcoded thresholds based on configured schedules [puppet] - 10https://gerrit.wikimedia.org/r/548244 (https://phabricator.wikimedia.org/T236406) [13:08:51] (03CR) 10Ema: [C: 03+2] cache: reimage cp5012 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/548249 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [13:08:58] (03PS1) 10Ladsgroup: Add mnwwiki to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548251 (https://phabricator.wikimedia.org/T235739) [13:09:15] (03CR) 10Ladsgroup: [C: 03+2] Add mnwwiki to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548251 (https://phabricator.wikimedia.org/T235739) (owner: 10Ladsgroup) [13:10:01] (03Merged) 10jenkins-bot: Add mnwwiki to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548251 (https://phabricator.wikimedia.org/T235739) (owner: 10Ladsgroup) [13:11:54] 10Operations, 10Puppet, 10DBA, 10Patch-For-Review, 10User-jbond: Extend Puppet CA Expiry date - https://phabricator.wikimedia.org/T236277 (10Volans) One thing to take into account: we're using certificates signed by the Puppet CA in many places: - the puppet client certificate exposed via puppet code, se... [13:12:00] !log ladsgroup@deploy1001 rebuilt and synchronized wikiversions files: T235739 [13:12:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:04] T235739: Create Mon Wikipedia - https://phabricator.wikimedia.org/T235739 [13:12:59] Is anyone else deploying on puppetmaster? I may have locked the pipeline not on purpose [13:13:02] !log ladsgroup@deploy1001 Synchronized multiversion/MWMultiVersion.php: T235739 (duration: 00m 52s) [13:13:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:15] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T235739 (duration: 00m 53s) [13:14:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:25] (03CR) 10Volans: [C: 04-1] "Commented on task as I think we need to find an agreement on the plan of actions there before merging this patch IMHO." [puppet] - 10https://gerrit.wikimedia.org/r/548241 (https://phabricator.wikimedia.org/T236277) (owner: 10Jbond) [13:15:28] !log ladsgroup@deploy1001 Synchronized static/images/project-logos/: T235739 (duration: 00m 53s) [13:15:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:42] 10Operations, 10Puppet, 10DBA, 10Patch-For-Review, 10User-jbond: Extend Puppet CA Expiry date - https://phabricator.wikimedia.org/T236277 (10jcrespo) > - some services don't have a way to reload them and need restart. Notably MySQL has added this capability only in version 8 and Mariadb in version 10.4... [13:16:32] jynus: ok to puppet-merge your change? [13:16:39] I am trying [13:16:41] !log ladsgroup@deploy1001 Synchronized langlist: T235739 (duration: 00m 52s) [13:16:43] but I got locked [13:16:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:51] can you? [13:16:53] (03CR) 10CDanis: [C: 03+2] cumin: aliases: cache::text_ats is a thing now [puppet] - 10https://gerrit.wikimedia.org/r/547800 (https://phabricator.wikimedia.org/T227432) (owner: 10CDanis) [13:17:07] ema: I get a "E: failed to lock, another puppet-merge running on this host?" [13:17:14] jynus: did the original puppet-merge error out? [13:17:16] jynus, cdanis: puppet-merged your changes too [13:17:21] ty ema [13:17:21] (03PS1) 10Ladsgroup: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548253 [13:17:25] (03CR) 10Ladsgroup: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548253 (owner: 10Ladsgroup) [13:17:45] cdanis: same error message [13:17:59] the rebase-if-necc change to the puppet repo is nice [13:18:34] it works now [13:18:36] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548253 (owner: 10Ladsgroup) [13:18:47] so maybe ema was just idling on yes/no :-D [13:18:49] (03CR) 10Elukey: Update Bacula configs for analytics-in filters (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/548228 (https://phabricator.wikimedia.org/T237016) (owner: 10Elukey) [13:19:16] (or someone else) [13:19:21] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp5012.eqsin.wmnet'] ` The log can be found in `/var/log/wm... [13:19:39] !log ladsgroup@deploy1001 Synchronized wmf-config/interwiki.php: Update interwiki cache (duration: 02m 39s) [13:19:40] (03CR) 10Muehlenhoff: [C: 03+2] Enable new adduser base class on spare hosts [puppet] - 10https://gerrit.wikimedia.org/r/547738 (https://phabricator.wikimedia.org/T235162) (owner: 10Muehlenhoff) [13:19:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:48] (03CR) 10Alexandros Kosiaris: [C: 04-1] backup::host: use fqdn_rand_string for password generation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/547569 (https://phabricator.wikimedia.org/T221083) (owner: 10Jbond) [13:20:13] !log Creating Mon Wikipedia is done T235739 [13:20:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:18] T235739: Create Mon Wikipedia - https://phabricator.wikimedia.org/T235739 [13:20:25] 10Operations, 10Puppet, 10DBA, 10Patch-For-Review, 10User-jbond: Extend Puppet CA Expiry date - https://phabricator.wikimedia.org/T236277 (10jbond) >>! In T236277#5631150, @jbond wrote: > We also need to consider `/usr/local/share/ca-certificates/Puppet_Internal_CA.crt` which is linked to `/etc/ssl/certs... [13:20:55] (03PS2) 10Elukey: Update Bacula configs for analytics-in filters [homer/public] - 10https://gerrit.wikimedia.org/r/548228 (https://phabricator.wikimedia.org/T237016) [13:23:24] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Minor comment, otherwise +1." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/547569 (https://phabricator.wikimedia.org/T221083) (owner: 10Jbond) [13:25:02] (03CR) 10Ema: [C: 03+2] 8.0.5-1wm10: fix #4635 with upstream patch [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/547736 (owner: 10Ema) [13:25:13] (03PS3) 10Jbond: backup::host: refactor [puppet] - 10https://gerrit.wikimedia.org/r/547568 (https://phabricator.wikimedia.org/T221083) [13:25:41] (03PS4) 10Jbond: backup::host: use fqdn_rand_string for password generation [puppet] - 10https://gerrit.wikimedia.org/r/547569 (https://phabricator.wikimedia.org/T221083) [13:25:55] 10Operations, 10Puppet, 10DBA, 10Patch-For-Review, 10User-jbond: Extend Puppet CA Expiry date - https://phabricator.wikimedia.org/T236277 (10jcrespo) > The major pain point here is likely MySQL I am relatively sure that I didn't enabled strict cert checking because I knew this day would arrive (requires... [13:26:10] (03CR) 10Jbond: "@Alex I saw a -1 for this in #w-ops but don't see it here, am i missing something?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/547569 (https://phabricator.wikimedia.org/T221083) (owner: 10Jbond) [13:26:34] 10Operations, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for mnwwiki - https://phabricator.wikimedia.org/T235743 (10Urbanecm) >>! In T235743#5583420, @Marostegui wrote: > Let us know when the database is created so we can sanitize its tables and hand over to WMC... [13:26:59] (03CR) 10Jbond: [C: 03+2] puppetdb6: update config to use the new puppetdb6 servers [puppet] - 10https://gerrit.wikimedia.org/r/547218 (https://phabricator.wikimedia.org/T235655) (owner: 10Jbond) [13:28:02] (03PS1) 10Jbond: Revert "puppetdb6: update config to use the new puppetdb6 servers" [puppet] - 10https://gerrit.wikimedia.org/r/548256 [13:28:45] !log update production puppetmasters to use new puppetdb servers [13:28:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:02] 10Operations, 10Puppet, 10DBA, 10Patch-For-Review, 10User-jbond: Extend Puppet CA Expiry date - https://phabricator.wikimedia.org/T236277 (10akosiaris) **IMPORTANT**: The puppet CA cert (and correspondingly key), is used as a "master" (a failsafe in case the actual host key is not around) key for bacula... [13:29:24] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/546195 (owner: 10Jbond) [13:31:08] (03CR) 10Alexandros Kosiaris: [C: 03+1] Update Bacula configs for analytics-in filters [homer/public] - 10https://gerrit.wikimedia.org/r/548228 (https://phabricator.wikimedia.org/T237016) (owner: 10Elukey) [13:31:19] 10Operations, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for mnwwiki - https://phabricator.wikimedia.org/T235743 (10jcrespo) a:03jcrespo Thanks, will apply the filtering and then assign to cloud for exposing it on wikireplicas. [13:31:47] (03PS1) 10Arturo Borrero Gonzalez: wmnet: cleanup unused labsdb1002 entries [dns] - 10https://gerrit.wikimedia.org/r/548257 (https://phabricator.wikimedia.org/T146455) [13:32:59] (03CR) 10Arturo Borrero Gonzalez: "It's not clear to me what to do with mgmt and asset entries." [dns] - 10https://gerrit.wikimedia.org/r/548257 (https://phabricator.wikimedia.org/T146455) (owner: 10Arturo Borrero Gonzalez) [13:33:22] (03PS1) 10Ema: ATS: double log_buffer_size and max_line_size [puppet] - 10https://gerrit.wikimedia.org/r/548258 [13:33:55] (03CR) 10Jbond: [C: 03+2] puppetmaster1003: remove config for canary puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/547237 (https://phabricator.wikimedia.org/T235655) (owner: 10Jbond) [13:34:28] 10Operations, 10LDAP-Access-Requests: Remove user "jeroendedauw" from wmde LDAP group - https://phabricator.wikimedia.org/T237254 (10WMDE-leszek) [13:37:04] (03PS1) 10Kosta Harlan: GrowthExperiments: Add suggested edits links and remote config title [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548260 (https://phabricator.wikimedia.org/T235042) [13:38:00] !log update bacula terms on analytics-in{4,6} filters on cr{1,2}-eqiad - T237016 [13:38:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:04] T237016: Update router ACLs for newer bacula hosts - https://phabricator.wikimedia.org/T237016 [13:38:33] 10Operations, 10netops, 10Patch-For-Review: Update router ACLs for newer bacula hosts - https://phabricator.wikimedia.org/T237016 (10elukey) To keep archives happy: cr1-eqiad: ` elukey@re0.cr1-eqiad# show | compare [edit firewall family inet filter analytics-in4 term bacula from destination-address]... [13:38:49] (03CR) 10Elukey: [V: 03+2 C: 03+2] Update Bacula configs for analytics-in filters [homer/public] - 10https://gerrit.wikimedia.org/r/548228 (https://phabricator.wikimedia.org/T237016) (owner: 10Elukey) [13:40:06] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [13:40:07] 10Operations, 10DBA, 10serviceops, 10Goal, 10Patch-For-Review: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10jcrespo) The matomo false alert is now correctly gone, only the 6 issues due to the 3 tickets above (T236406#5630631) left: ` All fail... [13:40:21] 10Operations, 10serviceops: Kubernetes workers frequent oom-killer in action - https://phabricator.wikimedia.org/T237198 (10akosiaris) As @Joe said, that's expected. It's how misbehaving services are killed in order to recover. Here's also a breakdown in case anyone is interested ` kubectl get pods --all-name... [13:41:17] 10Operations, 10DBA, 10serviceops, 10Goal, 10Patch-For-Review: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10elukey) Nice thanks! Just pushed the new rules to the routers, so in theory an-master1002 and analytics1029 should go away now! Let me... [13:41:25] 10Operations, 10Puppet, 10DBA, 10Patch-For-Review, 10User-jbond: Extend Puppet CA Expiry date - https://phabricator.wikimedia.org/T236277 (10jbond) >>! In T236277#5631271, @akosiaris wrote: > **IMPORTANT**: The puppet CA cert (and correspondingly key), is used as a "master" (a failsafe in case the actual... [13:42:41] (03Abandoned) 10Volans: Cumin: add cache generator for emergencies [puppet] - 10https://gerrit.wikimedia.org/r/362377 (https://phabricator.wikimedia.org/T169304) (owner: 10Volans) [13:43:19] (03PS1) 10Arturo Borrero Gonzalez: templates/: replace labs with cloud for some vlans [dns] - 10https://gerrit.wikimedia.org/r/548261 [13:43:48] 10Operations, 10netops, 10Patch-For-Review: Update router ACLs for newer bacula hosts - https://phabricator.wikimedia.org/T237016 (10akosiaris) 05Open→03Resolved a:03akosiaris ` akosiaris@an-master1002:~$ telnet -4 backup1001.eqiad.wmnet 9103 Trying 10.64.48.36... Connected to backup1001.eqiad.wmnet. E... [13:43:52] 10Operations, 10DBA, 10serviceops, 10Goal, 10Patch-For-Review: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10akosiaris) [13:46:06] 10Operations, 10netops, 10Patch-For-Review: Update router ACLs for newer bacula hosts - https://phabricator.wikimedia.org/T237016 (10jcrespo) For the record, the other host affected: ` analytics1029:~$ telnet -4 backup1001.eqiad.wmnet 9103 Trying 10.64.48.36... Connected to backup1001.eqiad.wmnet ` [13:47:01] 10Operations, 10Puppet, 10DBA, 10Patch-For-Review, 10User-jbond: Extend Puppet CA Expiry date - https://phabricator.wikimedia.org/T236277 (10akosiaris) >>! In T236277#5631321, @jbond wrote: >>>! In T236277#5631271, @akosiaris wrote: >> **IMPORTANT**: The puppet CA cert (and correspondingly key), is used... [13:47:15] !log ema@cumin1001 START - Cookbook sre.hosts.downtime [13:47:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:20] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:49:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:16] 10Operations, 10DBA, 10serviceops, 10Goal, 10Patch-For-Review: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10jcrespo) Forcing a manual run on the 2 above for validation. [13:53:23] !log upload trafficserver 8.0.5-1wm10 to stretch-wikimedia [13:53:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:11] (03CR) 10Jbond: "> Patch Set 4:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/547569 (https://phabricator.wikimedia.org/T221083) (owner: 10Jbond) [13:56:41] (03CR) 10Jbond: [C: 03+2] check_puppetrun: alert critical after 24 hours [puppet] - 10https://gerrit.wikimedia.org/r/546195 (owner: 10Jbond) [14:02:09] 10Operations, 10ops-codfw, 10Cloud-Services: rack/setup codfw: cloudbackup2001.codfw.wmnet and cloudbackup2002.codfw.wmnet - https://phabricator.wikimedia.org/T224528 (10Andrew) Huh, it must've been something that caught up overnight. Thanks! [14:02:21] (03PS1) 10Jbond: check_puppetrun: correct typo is_a? vs is_a [puppet] - 10https://gerrit.wikimedia.org/r/548263 [14:04:07] 10Operations, 10SRE-tools, 10Goal: Expand Spicerack library and SRE Cookbooks - Q2 2018-19 Goal - https://phabricator.wikimedia.org/T205867 (10Volans) All the wmf-* scripts but the reimage ones were migrated to cookbooks. Most of the modules and functionalities in the library have been added to spicerack. I'... [14:04:35] (03CR) 10Jbond: [C: 03+2] check_puppetrun: correct typo is_a? vs is_a [puppet] - 10https://gerrit.wikimedia.org/r/548263 (owner: 10Jbond) [14:05:08] 10Operations, 10ops-codfw, 10Cloud-Services: rack/setup codfw: cloudbackup2001.codfw.wmnet and cloudbackup2002.codfw.wmnet - https://phabricator.wikimedia.org/T224528 (10Papaul) yes DNS update on ns0 [14:05:16] (03PS12) 10MarcoAurelio: Initial configuration for ge.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545909 (https://phabricator.wikimedia.org/T236389) [14:06:42] (03PS1) 10Elukey: role::prometheus::beta: add memcached metrics [puppet] - 10https://gerrit.wikimedia.org/r/548264 (https://phabricator.wikimedia.org/T213089) [14:07:06] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp5012.eqsin.wmnet'] ` The log can be found in `/var/log/wm... [14:07:24] (03PS2) 10MarcoAurelio: Enable DNS blacklist for es.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547863 (https://phabricator.wikimedia.org/T237151) [14:08:11] moritzm: hi, who should I ask to have an amendment made to an LDAP membership today? [14:11:35] 10Operations, 10LDAP-Access-Requests, 10Security-Team: Offboard Raz Shuty from various Wikimedia systems - https://phabricator.wikimedia.org/T237118 (10MarcoAurelio) @Aklapper The account @RazShuty is disabled but still holds Phabricator Administrator permissions. Shall we have those removed as well? Thanks. [14:14:29] hauskater: best to open a task and tag it "LDAP-Access-Requests" [14:14:43] moritzm: aha, so done :) [14:14:56] i guess you'll get to it when the time comes then [14:15:03] thanks moritzm [14:15:17] ack, I'll look into it in a bit [14:18:26] (03PS1) 10Muehlenhoff: Bump system user UID range in enforce-users-groups.sh [puppet] - 10https://gerrit.wikimedia.org/r/548269 (https://phabricator.wikimedia.org/T235162) [14:23:10] 10Operations, 10Puppet, 10DBA, 10Patch-For-Review, 10User-jbond: Extend Puppet CA Expiry date - https://phabricator.wikimedia.org/T236277 (10CDanis) In terms of identifying services that use keys issued by the puppet CA -- is it wrong to think that the following would be a complete list? * keys created u... [14:24:43] (03PS1) 10Muehlenhoff: Removed jeroendedauw from wmde group [puppet] - 10https://gerrit.wikimedia.org/r/548270 (https://phabricator.wikimedia.org/T237254) [14:25:56] (03CR) 10Jhedden: ceph: add k8s manifests for ceph deployment using rook (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/546182 (https://phabricator.wikimedia.org/T236290) (owner: 10Jhedden) [14:27:42] (03CR) 10Muehlenhoff: [C: 03+2] Removed jeroendedauw from wmde group [puppet] - 10https://gerrit.wikimedia.org/r/548270 (https://phabricator.wikimedia.org/T237254) (owner: 10Muehlenhoff) [14:28:49] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Remove user "jeroendedauw" from wmde LDAP group - https://phabricator.wikimedia.org/T237254 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff I've removed jeroendedauw from the wmde LDAP group. There were no NDA-sensitive Phab groups... [14:29:02] hauskater: done [14:29:29] :) [14:29:37] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Remove user "jeroendedauw" from wmde LDAP group - https://phabricator.wikimedia.org/T237254 (10WMDE-leszek) Thanks @MoritzMuehlenhoff ! [14:29:46] moritzm: I thought it was just a sudo ldapmodify [14:30:17] but if jeroen got LDAP access as part of WMDE then yeah it makes sense to remove them altoghether [14:31:48] 10Operations, 10DBA, 10serviceops, 10Goal, 10Patch-For-Review: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10jcrespo) :-) ` All failures: 4 (bromine, ...), Fresh: 90 jobs ` Unsubbing elukey and Otto to prevent unwanted spam (feel free to resub... [14:32:57] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [14:33:18] (03CR) 10Ottomata: [C: 03+1] eventlogging: allow sanitization script to run on all db records [puppet] - 10https://gerrit.wikimedia.org/r/548142 (https://phabricator.wikimedia.org/T236818) (owner: 10Elukey) [14:34:06] hauskater: it's an LDAP change only, but we also need to adjust some meta data for our account tracking, hence the Gerrit commit in addition [14:34:30] moritzm: ack - learning :) [14:34:45] 10Operations, 10LDAP-Access-Requests: Add bawolff to either NDA or WMF ldap group - https://phabricator.wikimedia.org/T236636 (10Ottomata) Ah sorry, I just logged in this morning to have an un-submitted post here. I had written 'Ok, `wmf` it is! @Bawolff I've added you (again) into the `wmf` LDAP group. Ple... [14:36:02] 10Operations, 10Dumps-Generation, 10hardware-requests: Get a third dumpsdata server - https://phabricator.wikimedia.org/T219768 (10ArielGlenn) 05Resolved→03Open Re-opening until the base install is done. [14:36:10] !log ema@cumin1001 START - Cookbook sre.hosts.downtime [14:36:10] !log ema@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [14:36:12] 10Operations, 10Puppet, 10DBA, 10User-jbond: Document all uses of the puppetCA certificate - https://phabricator.wikimedia.org/T237259 (10jbond) [14:36:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:29] 10Operations, 10Puppet, 10DBA, 10User-jbond: Document all uses of the puppetCA certificate - https://phabricator.wikimedia.org/T237259 (10jbond) p:05Triage→03Normal [14:36:32] 10Operations: Add docker-engine to buster-wikimedia distribution - https://phabricator.wikimedia.org/T236947 (10Ottomata) 05Open→03Declined Huh, ok will try that next time. Assuming that works I'll close this, will reopen if I have troubles. [14:39:59] (03CR) 10Herron: netops: add host monitoring for scs systems (serial console servers) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/547752 (https://phabricator.wikimedia.org/T233318) (owner: 10Herron) [14:40:08] (03PS4) 10Herron: netops: add host monitoring for scs systems (serial console servers) [puppet] - 10https://gerrit.wikimedia.org/r/547752 (https://phabricator.wikimedia.org/T233318) [14:42:45] 10Operations, 10Puppet, 10DBA, 10Patch-For-Review, 10User-jbond: Extend Puppet CA Expiry date - https://phabricator.wikimedia.org/T236277 (10jbond) >>! In T236277#5631496, @CDanis wrote: > In terms of identifying services that use keys issued by the puppet CA -- is it wrong to think that the following wo... [14:44:01] 10Operations, 10Puppet, 10DBA, 10User-jbond: Document all uses of the puppetCA certificate - https://phabricator.wikimedia.org/T237259 (10jbond) base::expose_puppet_certs - This can be ignored as it relates to the client key pairs and not the CA public certificate which is the file which is changing [14:47:26] (03CR) 10Herron: [C: 03+2] netops: add host monitoring for scs systems (serial console servers) [puppet] - 10https://gerrit.wikimedia.org/r/547752 (https://phabricator.wikimedia.org/T233318) (owner: 10Herron) [14:51:19] (03PS1) 10Andrew Bogott: labweb/cloudweb: fix path for backup jobs [puppet] - 10https://gerrit.wikimedia.org/r/548273 (https://phabricator.wikimedia.org/T237237) [14:51:22] PROBLEM - Host cp5012 is DOWN: PING CRITICAL - Packet loss = 100% [14:51:43] (03PS1) 10Joal: Bump hiera aqs mediawiki_history_reduced [puppet] - 10https://gerrit.wikimedia.org/r/548274 [14:51:48] (03CR) 10Ottomata: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/547519 (https://phabricator.wikimedia.org/T215904) (owner: 10Filippo Giunchedi) [14:51:49] elukey: --^ ? [14:52:23] (03CR) 10Elukey: [C: 03+2] Bump hiera aqs mediawiki_history_reduced [puppet] - 10https://gerrit.wikimedia.org/r/548274 (owner: 10Joal) [14:53:13] if any of ops have time for rotating my ssh key pretty please https://gerrit.wikimedia.org/r/c/operations/puppet/+/548242 [14:56:10] (03PS4) 10Muehlenhoff: Enable ldap-corp1001/2001 as additional replicas [puppet] - 10https://gerrit.wikimedia.org/r/539150 [14:59:16] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [14:59:46] 10Operations, 10Icinga, 10observability, 10Patch-For-Review: scs monitoring missing in Icinga - https://phabricator.wikimedia.org/T233318 (10herron) Host monitoring for SCS systems has been added to icinga ` Host Status Last Check Duration Attempt Status Information scs-a1... [15:01:12] !log joal@deploy1001 Started restart [analytics/aqs/deploy@59a97fa]: (no justification provided) [15:01:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:23] jynus: Okay, I'm going to retry the thing with the deadlock now, I monitor graphs but heads up for you [15:01:57] ok, thanks for the heads up [15:02:09] did you manage to deploy the cron? [15:03:48] (03PS1) 10Filippo Giunchedi: mtail: add logstash program [puppet] - 10https://gerrit.wikimedia.org/r/548280 (https://phabricator.wikimedia.org/T236343) [15:03:53] (03PS16) 10Jhedden: ceph: add k8s manifests for ceph deployment using rook [puppet] - 10https://gerrit.wikimedia.org/r/546182 (https://phabricator.wikimedia.org/T236290) [15:03:55] (03PS1) 10Filippo Giunchedi: profile: add mtail to logstash [puppet] - 10https://gerrit.wikimedia.org/r/548281 (https://phabricator.wikimedia.org/T236343) [15:05:39] (03CR) 10Jhedden: ceph: add k8s manifests for ceph deployment using rook (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/546182 (https://phabricator.wikimedia.org/T236290) (owner: 10Jhedden) [15:06:03] (03PS3) 10Jcrespo: check_private_data: ignore comments on private.dblist [puppet] - 10https://gerrit.wikimedia.org/r/547283 (https://phabricator.wikimedia.org/T223602) [15:07:45] jynus: not unfortunately :( [15:08:25] (03CR) 10Jcrespo: "Please re-vote, I will merge and we can test it live also with the T235743 change, before and after filtering." [puppet] - 10https://gerrit.wikimedia.org/r/547283 (https://phabricator.wikimedia.org/T223602) (owner: 10Jcrespo) [15:08:38] let me know if I can help after the deploy [15:08:49] Sure [15:09:11] Thanks [15:11:13] (03CR) 10Jcrespo: "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545411 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [15:14:17] 10Operations, 10Puppet, 10DBA, 10Patch-For-Review, 10User-jbond: Extend Puppet CA Expiry date - https://phabricator.wikimedia.org/T236277 (10Volans) @CDanis the problem is that all of those identify clients, while for the CA validation we're mostly interested in the server side. So while that surely woul... [15:15:04] (03CR) 10Jcrespo: [C: 04-1] "Almost there- surprisingly, /srv/backup fileset doesn't exist, so it has to be added to modules/profile/manifests/backup/director.pp. It i" [puppet] - 10https://gerrit.wikimedia.org/r/548273 (https://phabricator.wikimedia.org/T237237) (owner: 10Andrew Bogott) [15:17:56] (03CR) 10Jcrespo: [C: 04-1] "> Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/548273 (https://phabricator.wikimedia.org/T237237) (owner: 10Andrew Bogott) [15:21:27] (03CR) 10Jforrester: [C: 03+1] check_private_data: ignore comments on private.dblist [puppet] - 10https://gerrit.wikimedia.org/r/547283 (https://phabricator.wikimedia.org/T223602) (owner: 10Jcrespo) [15:22:59] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM too" [software/httpbb] - 10https://gerrit.wikimedia.org/r/545689 (https://phabricator.wikimedia.org/T236699) (owner: 10RLazarus) [15:23:58] (03CR) 10BryanDavis: [C: 03+1] check_private_data: ignore comments on private.dblist [puppet] - 10https://gerrit.wikimedia.org/r/547283 (https://phabricator.wikimedia.org/T223602) (owner: 10Jcrespo) [15:26:52] (03PS4) 10Jcrespo: check_private_data: ignore comments on private.dblist [puppet] - 10https://gerrit.wikimedia.org/r/547283 (https://phabricator.wikimedia.org/T223602) [15:27:31] (03CR) 10Jforrester: [C: 03+1] check_private_data: ignore comments on private.dblist [puppet] - 10https://gerrit.wikimedia.org/r/547283 (https://phabricator.wikimedia.org/T223602) (owner: 10Jcrespo) [15:28:12] (03Merged) 10jenkins-bot: Initial version of httpbb, the HTTP black box testing tool. [software/httpbb] - 10https://gerrit.wikimedia.org/r/545689 (https://phabricator.wikimedia.org/T236699) (owner: 10RLazarus) [15:28:20] (03CR) 10Jcrespo: [C: 03+2] check_private_data: ignore comments on private.dblist [puppet] - 10https://gerrit.wikimedia.org/r/547283 (https://phabricator.wikimedia.org/T223602) (owner: 10Jcrespo) [15:29:38] (03PS2) 10Andrew Bogott: labweb/cloudweb: fix path for backup jobs [puppet] - 10https://gerrit.wikimedia.org/r/548273 (https://phabricator.wikimedia.org/T237237) [15:30:41] (03PS1) 10Mholloway: WikimediaEditorTasks: Enable revert counts on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548291 [15:31:14] The backport on mwdebug1001 looks good, tested, going live [15:32:11] (03PS3) 10Andrew Bogott: labweb/cloudweb: fix path for backup jobs [puppet] - 10https://gerrit.wikimedia.org/r/548273 (https://phabricator.wikimedia.org/T237237) [15:33:35] !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.4/extensions/Wikibase: Wikibase deadlock reduction, [[gerrit:548279|Stop locking and use DISTINCT when finding used terms to delete]] (T236466) (duration: 00m 59s) [15:33:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:40] T236466: PHP Warning: [data-update-failed]: A data update callback triggered an exception (Wikimedia\Rdbms\Database::makeList: empty input for field wbxl_text_id) [Called from Wikibase\Repo\Content\DataUpdateAdapter::doUpdate in /extensions/Wikibase/repo/includes/Content/DataUpdateAdapter.php at line 62] - https://phabricator.wikimedia.org/T236466 [15:34:36] (03PS1) 10Filippo Giunchedi: monitoring: add dashboard links to grafana_alert [puppet] - 10https://gerrit.wikimedia.org/r/548292 [15:35:23] (03PS1) 10Elukey: Add analytics users (without ssh keys) to all Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/548293 (https://phabricator.wikimedia.org/T237269) [15:36:36] !log running failing check_private_data report on labsdb1009 T235743 [15:36:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:40] T235743: Prepare and check storage layer for mnwwiki - https://phabricator.wikimedia.org/T235743 [15:39:31] Amir1: all done? I'm hoping to deploy a quick beta config change [15:40:06] mdholloway: I'm basically staring at the logs but I think you can go ahead [15:40:14] Cool, thanks [15:40:31] (03CR) 10Mholloway: [C: 03+2] WikimediaEditorTasks: Enable revert counts on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548291 (owner: 10Mholloway) [15:41:00] (03PS1) 10Elukey: Add druid/analytics/search system users to all Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/548294 (https://phabricator.wikimedia.org/T237269) [15:41:14] (03Merged) 10jenkins-bot: WikimediaEditorTasks: Enable revert counts on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548291 (owner: 10Mholloway) [15:41:20] jynus: https://logstash.wikimedia.org/goto/2222a6f35f345e7106fc9818dbe04245 + https://logstash.wikimedia.org/goto/537848ac4ad5cc864725c990e30d890f + https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=s8&var-role=All&from=now-15m&to=now + [15:41:34] good news or bad news? [15:41:36] * Amir1 opens a bottle of champaign [15:41:57] All good, errors are gone, no issue in rows read or response time [15:42:08] actually rows read decreased [15:42:18] well, I don't want to be too pesimistic, but last time I was deceived by the less errors (due to less traffic) [15:42:41] lets be super careful before declaring victory :-P [15:42:54] I'm monitoring traffic, edit rate, etc :D [15:43:52] there is a latency spike, but on an s3 server, so that should be unrelated [15:43:55] !log mholloway-shell@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: Enable revert counts on beta (T234955) (duration: 00m 53s) [15:43:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:59] T234955: API for reverts for SE v3 - https://phabricator.wikimedia.org/T234955 [15:45:09] Amir1: deploy time 15:32-15:33 aprox, according to SAL? [15:45:17] yup [15:45:21] cool [15:45:51] I am going to monitor s3,as something is up there [15:47:52] (03CR) 10Cwhite: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/548280 (https://phabricator.wikimedia.org/T236343) (owner: 10Filippo Giunchedi) [15:49:38] Amir1: looking very good [15:49:45] apis affected latency is ok too? [15:50:03] it looks good to me [15:50:09] https://logstash.wikimedia.org/goto/eb15109a166f255f62b959a03d596f2a <- This is beautiful [15:50:37] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp5012.eqsin.wmnet'] ` Of which those **FAILED**: ` ['cp5012.eqsin.wmnet'] ` [15:50:39] This looks okay to me https://grafana.wikimedia.org/d/TUJ0V-0Zk/wikidata-alerts?orgId=1&from=now-1h&to=now [15:51:56] (03PS2) 10Filippo Giunchedi: monitoring: add dashboard links to grafana_alert [puppet] - 10https://gerrit.wikimedia.org/r/548292 [15:53:46] 10Operations: Add docker-engine to buster-wikimedia distribution - https://phabricator.wikimedia.org/T236947 (10Ottomata) [15:55:19] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10observability: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10Nuria) @BBlack: we can take a stab at modifying code on VCL if you can CR since that needs to happen before the varnishkafka changes [15:58:18] 10Operations, 10Puppet, 10DBA, 10User-jbond: Document all uses of the puppetCA certificate - https://phabricator.wikimedia.org/T237259 (10jbond) [15:58:21] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1003/19233/icinga1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/548292 (owner: 10Filippo Giunchedi) [15:59:30] (03CR) 10Ottomata: Add analytics users (without ssh keys) to all Hadoop worker nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/548293 (https://phabricator.wikimedia.org/T237269) (owner: 10Elukey) [15:59:59] (03CR) 10Herron: "Looks good! Please see question inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/548280 (https://phabricator.wikimedia.org/T236343) (owner: 10Filippo Giunchedi) [16:00:02] (03CR) 10Ottomata: "Oh, nm, saw next patch." [puppet] - 10https://gerrit.wikimedia.org/r/548293 (https://phabricator.wikimedia.org/T237269) (owner: 10Elukey) [16:00:48] Amir1 looking good https://grafana.wikimedia.org/d/MR93RkVWk/wikibase-api?refresh=5m&orgId=1 [16:05:12] 10Operations, 10serviceops: Upgrade to PHP 7.2.24 - https://phabricator.wikimedia.org/T237239 (10MoritzMuehlenhoff) p:05Triage→03Normal [16:05:32] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10observability: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10Ottomata) Hm, another idea... > I don't think sub-objects or arrays are supported by varnishkafka. We'll have to set each one as a... [16:05:43] 10Operations, 10Traffic, 10netops: Network unreachable after network-online.target is brought up - https://phabricator.wikimedia.org/T237243 (10MoritzMuehlenhoff) p:05Triage→03Normal [16:05:53] \o/ [16:05:53] 10Operations, 10Packaging, 10serviceops: Build and upload envoy 1.12.0 package. - https://phabricator.wikimedia.org/T237235 (10MoritzMuehlenhoff) p:05Triage→03Normal [16:08:09] Amir1: if you have the time to double check the data is also inserted as expected? [16:08:31] jynus: sure [16:08:38] I know some of the non-canonical data is not that important [16:09:08] and may take some time to check there- so if you can check some of the insertions on the codepath afected for an example item on rcs [16:09:53] I am not sure I expressed myself well, but I hope you understood what I meant [16:10:30] Tested on a random item and I'm pretty sure it works: https://www.wikidata.org/w/index.php?title=Q60749818&diff=1044822302&oldid=850566434 [16:10:37] cool [16:10:47] thanks [16:10:56] that is a surprising amout of errors fone [16:10:58] *gone [16:11:07] which probably has impact on performance too [16:11:18] and for what I can say probably not an easy patch [16:11:21] so thank you! [16:11:27] *can see [16:11:43] (03PS1) 10Dmaza: Enable SpecialMute page on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548301 (https://phabricator.wikimedia.org/T231577) [16:11:50] I should have done it sooner, sorry (I was in another project for frontend) [16:19:42] 10Operations, 10ops-eqiad, 10DC-Ops: b7-eqiad pdu refresh (Tuesday 11/5 @12pm UTC) - https://phabricator.wikimedia.org/T227542 (10wiki_willy) [16:24:52] (03CR) 10Krinkle: [C: 03+1] "LGTM. Does this change the email or IRC output?" [puppet] - 10https://gerrit.wikimedia.org/r/548292 (owner: 10Filippo Giunchedi) [16:26:47] (03PS1) 10Bstorm: toolforge-k8s: rotate docker logs [puppet] - 10https://gerrit.wikimedia.org/r/548304 (https://phabricator.wikimedia.org/T237270) [16:30:20] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10observability: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10Ottomata) > - Changes the serialization code path we've been using to produce webrequest for years We discussed this in Analytics s... [16:35:28] 10Operations, 10serviceops: Upgrade to PHP 7.2.24 - https://phabricator.wikimedia.org/T237239 (10Jdforrester-WMF) [16:35:34] 10Operations, 10serviceops: Upgrade to PHP 7.2.24 - https://phabricator.wikimedia.org/T237239 (10Jdforrester-WMF) [16:35:44] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10observability: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10BBlack) Agreed, let's not go down that road right here (because we have a burning need for this data pronto), but side note to keep... [16:41:49] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [16:52:22] (03CR) 10Jforrester: "recheck" [debs/prometheus-ipsec-exporter] - 10https://gerrit.wikimedia.org/r/546922 (owner: 10Hashar) [16:56:09] (03CR) 10Ayounsi: [C: 03+1] templates/: replace labs with cloud for some vlans [dns] - 10https://gerrit.wikimedia.org/r/548261 (owner: 10Arturo Borrero Gonzalez) [17:06:55] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] templates/: replace labs with cloud for some vlans [dns] - 10https://gerrit.wikimedia.org/r/548261 (owner: 10Arturo Borrero Gonzalez) [17:09:45] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [17:10:42] (03Abandoned) 10Arturo Borrero Gonzalez: puppetmaster: base_repo: ensure owner of repo tree [puppet] - 10https://gerrit.wikimedia.org/r/419395 (https://phabricator.wikimedia.org/T184259) (owner: 10Arturo Borrero Gonzalez) [17:18:19] 10Operations, 10LDAP-Access-Requests, 10Security-Team: Offboard Raz Shuty from various Wikimedia systems - https://phabricator.wikimedia.org/T237118 (10sbassett) >>! In T237118#5630251, @MoritzMuehlenhoff wrote: > * "Tag `{{former staff}}` on any relevant project user profile pages" isn't done as part of the... [17:20:47] 10Operations, 10LDAP-Access-Requests, 10Security-Team: Offboard Raz Shuty from various Wikimedia systems - https://phabricator.wikimedia.org/T237118 (10WMDE-leszek) WMDE part of offboarding has happened, yes! [17:22:25] (03CR) 10Nuria: [C: 03+1] eventlogging: allow sanitization script to run on all db records [puppet] - 10https://gerrit.wikimedia.org/r/548142 (https://phabricator.wikimedia.org/T236818) (owner: 10Elukey) [17:23:07] 10Operations, 10LDAP-Access-Requests, 10Security-Team: Offboard Raz Shuty from various Wikimedia systems - https://phabricator.wikimedia.org/T237118 (10Aklapper) @MarcoAurelio: Thanks for catching that! Removed [17:26:50] 10Operations, 10ops-eqiad, 10DC-Ops: b7-eqiad pdu refresh (Tuesday 11/5 @12pm UTC) - https://phabricator.wikimedia.org/T227542 (10RobH) [17:27:40] 10Operations, 10ops-eqiad, 10DC-Ops: b7-eqiad pdu refresh (Tuesday 11/5 @12pm UTC) - https://phabricator.wikimedia.org/T227542 (10Joe) I don't want to conflict-edit the task description, but as far as the MW* and WTP* servers no action is needed. [17:28:19] (03PS9) 10Jforrester: Variant configuration: Allow for YAML-based inheritance of configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538129 (https://phabricator.wikimedia.org/T223602) [17:28:31] 10Operations, 10ops-eqiad, 10DC-Ops: b7-eqiad pdu refresh (Tuesday 11/5 @12pm UTC) - https://phabricator.wikimedia.org/T227542 (10jcrespo) [17:30:02] I'm going to be live-testing on mwdebug1001 for a bit. [17:31:10] 10Operations, 10LDAP-Access-Requests, 10Security-Team: Offboard Raz Shuty from various Wikimedia systems - https://phabricator.wikimedia.org/T237118 (10sbassett) @WMDE-leszek Excellent, thanks. [17:32:32] (03CR) 10Jforrester: [C: 03+2] Variant configuration: Allow for YAML-based inheritance of configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538129 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [17:33:20] (03Merged) 10jenkins-bot: Variant configuration: Allow for YAML-based inheritance of configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538129 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [17:34:18] (03CR) 10Mforns: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/548142 (https://phabricator.wikimedia.org/T236818) (owner: 10Elukey) [17:34:25] 10Operations: vega, bromine (bugzilla-static) is trying to backup a directory that doesn't exist - https://phabricator.wikimedia.org/T237233 (10Dzahn) p:05Triage→03High [17:36:01] (03PS2) 10Elukey: eventlogging: allow sanitization script to run on all db records [puppet] - 10https://gerrit.wikimedia.org/r/548142 (https://phabricator.wikimedia.org/T236818) [17:38:03] (03CR) 10Elukey: [C: 03+2] eventlogging: allow sanitization script to run on all db records [puppet] - 10https://gerrit.wikimedia.org/r/548142 (https://phabricator.wikimedia.org/T236818) (owner: 10Elukey) [17:39:23] !log jforrester@deploy1001 Synchronized wmf-config/config/: Sync out YAML config files (duration: 00m 56s) [17:39:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:06] (03Abandoned) 10Jbond: d-i: add contrib component to d-i configuration [puppet] - 10https://gerrit.wikimedia.org/r/548230 (https://phabricator.wikimedia.org/T158562) (owner: 10Jbond) [17:41:13] (03Abandoned) 10Jbond: Revert "check_puppetrun: don't alert for disabled puppet agents for 1 day" [puppet] - 10https://gerrit.wikimedia.org/r/547195 (owner: 10Jbond) [17:41:54] !log jforrester@deploy1001 Synchronized multiversion/MWConfigCacheGenerator.php: Update for YAML-reading (offline) (duration: 00m 52s) [17:41:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:43] (03Abandoned) 10Jbond: puppetdb6: update cumin to use the new puppetdb instance [puppet] - 10https://gerrit.wikimedia.org/r/547209 (owner: 10Jbond) [17:44:56] (03PS1) 10Elukey: eventlogging: run sanitization script on all the db records [puppet] - 10https://gerrit.wikimedia.org/r/548318 (https://phabricator.wikimedia.org/T236818) [17:45:56] (03CR) 10Mforns: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/548318 (https://phabricator.wikimedia.org/T236818) (owner: 10Elukey) [17:48:18] (03PS3) 10Jforrester: Variant configuration: Generate dblists from YAML [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545411 (https://phabricator.wikimedia.org/T223602) [17:52:36] (03CR) 10Jforrester: [C: 03+1] "I think this is good to go. Will do some more exploratory testing." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545411 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [17:52:39] (03PS2) 10Jhedden: bootstrapvz: remove ed25519 ssh host keys after build [puppet] - 10https://gerrit.wikimedia.org/r/547333 [17:53:42] (03PS21) 10Jforrester: Variant configuration: Pre-calculate config for each wiki and store it in config.git [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507729 (https://phabricator.wikimedia.org/T223602) [17:53:52] (03CR) 10Jhedden: [C: 03+2] bootstrapvz: remove ed25519 ssh host keys after build [puppet] - 10https://gerrit.wikimedia.org/r/547333 (owner: 10Jhedden) [17:54:41] (03CR) 10jerkins-bot: [V: 04-1] Variant configuration: Pre-calculate config for each wiki and store it in config.git [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507729 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [17:57:27] (03PS3) 10Jforrester: Variant configuration: Move some all-wiki configuration from CS to all.yaml [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539436 [17:58:34] (03CR) 10jerkins-bot: [V: 04-1] Variant configuration: Move some all-wiki configuration from CS to all.yaml [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539436 (owner: 10Jforrester) [17:58:52] 10Operations, 10Dumps-Generation, 10hardware-requests: Get a third dumpsdata server - https://phabricator.wikimedia.org/T219768 (10ArielGlenn) [18:00:04] gehel and onimisionipe: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191104T1800). [18:00:26] jouncebot: here here [18:03:21] (03CR) 10Jcrespo: "Looks good, thanks for the quick patch. I will give it a proper review (check) tomorrow, and deploy if it is ok." [puppet] - 10https://gerrit.wikimedia.org/r/548273 (https://phabricator.wikimedia.org/T237237) (owner: 10Andrew Bogott) [18:05:01] !log onimisionipe@deploy1001 Started deploy [wdqs/wdqs@2cb2dde]: Event logging via Event Gate and Absolute classpath for munge and runUpdate scripts [18:05:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:14] jouncebot: next [18:05:15] In 0 hour(s) and 54 minute(s): Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191104T1900) [18:05:26] hope I can make it [18:05:30] cc Urbanecm [18:09:34] !log ppchelko@deploy1001 Started deploy [restbase/deploy@20c710d]: Bump Parsoid-PHP mirroring to 100% T235902 [18:09:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:39] T235902: Tracking: Shadow Parsoid/PHP deployment to production cluster to handle mirrored reparse traffic - https://phabricator.wikimedia.org/T235902 [18:11:13] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [18:13:31] (03CR) 10Herron: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/541830 (https://phabricator.wikimedia.org/T235077) (owner: 10Jbond) [18:17:08] !log onimisionipe@deploy1001 Finished deploy [wdqs/wdqs@2cb2dde]: Event logging via Event Gate and Absolute classpath for munge and runUpdate scripts (duration: 12m 07s) [18:17:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:11] (03CR) 10BryanDavis: newk8s: adjust things to be compatible with migration to the new cluster (035 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/547676 (https://phabricator.wikimedia.org/T236202) (owner: 10Bstorm) [18:21:43] hauskater: you frightened me that the SWAT has already started 🙂 [18:21:55] * Urbanecm is usually out of sync when time changes [18:22:25] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [18:22:37] Urbanecm: no, I mean, I think I'm having a call on 20:00 my time [18:22:44] in about 30 minutes or so [18:22:53] so I'm not sure I'll be around [18:23:11] aha 🙂. [18:24:03] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@20c710d]: Bump Parsoid-PHP mirroring to 100% T235902 (duration: 14m 30s) [18:24:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:08] T235902: Tracking: Shadow Parsoid/PHP deployment to production cluster to handle mirrored reparse traffic - https://phabricator.wikimedia.org/T235902 [18:26:44] !log andrew@deploy1001 Started deploy [horizon/deploy@1ac26da]: add new user-selected puppet edit mode [18:26:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:31] 10Operations, 10ops-eqiad, 10DC-Ops: b7-eqiad pdu refresh (Tuesday 11/5 @12pm UTC) - https://phabricator.wikimedia.org/T227542 (10wiki_willy) a:03Cmjohnson [18:29:13] (03CR) 10Ayounsi: [C: 03+2] "Looks good!" [software/homer] - 10https://gerrit.wikimedia.org/r/547639 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [18:30:10] !log andrew@deploy1001 Finished deploy [horizon/deploy@1ac26da]: add new user-selected puppet edit mode (duration: 03m 27s) [18:30:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:37] 10Operations, 10Puppet, 10DBA, 10User-jbond: Document all uses of the puppetCA certificate - https://phabricator.wikimedia.org/T237259 (10Krenair) Acme-chief nginx config probably? [18:30:55] PROBLEM - mediawiki originals uploads -hourly- for codfw on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005:9112 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/dashboard/file/swift?panelId=9&fullscreen&orgId=1&var-DC=codfw [18:32:05] PROBLEM - mediawiki originals uploads -hourly- for eqiad on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005:9112 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/dashboard/file/swift?panelId=9&fullscreen&orgId=1&var-DC=eqiad [18:33:35] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [18:39:57] 10Operations, 10Puppet, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Puppet tab in Horizon unusably slow - https://phabricator.wikimedia.org/T149589 (10Andrew) I've added an alternative editor that loads much faster. To switch modes, scroll to the bottom of the puppet tab and cl... [18:41:37] (03PS1) 10Dzahn: puppetmaster: remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/548439 [18:41:39] (03PS1) 10Dzahn: backups: remove bugzilla backup directory [puppet] - 10https://gerrit.wikimedia.org/r/548440 (https://phabricator.wikimedia.org/T237233) [18:42:35] (03CR) 10Dzahn: "..unless we still need to support puppetmasters in cloud on jessie?" [puppet] - 10https://gerrit.wikimedia.org/r/548439 (owner: 10Dzahn) [18:43:13] (03CR) 10Ayounsi: Icinga: add parents to mgmt devices (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/547767 (owner: 10Ayounsi) [18:44:47] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [18:45:09] 10Operations, 10Wikimedia-Bugzilla, 10Patch-For-Review: vega, bromine (bugzilla-static) is trying to backup a directory that doesn't exist - https://phabricator.wikimedia.org/T237233 (10Dzahn) [18:45:52] 10Operations, 10Wikimedia-Bugzilla, 10Patch-For-Review: vega, bromine (bugzilla-static) is trying to backup a directory that doesn't exist - https://phabricator.wikimedia.org/T237233 (10Dzahn) Yea, this directory does not exist anymore on the servers. (I am not sure who deleted it when though)..but: There i... [18:46:28] (03CR) 10Dzahn: [C: 03+2] backups: remove bugzilla backup directory [puppet] - 10https://gerrit.wikimedia.org/r/548440 (https://phabricator.wikimedia.org/T237233) (owner: 10Dzahn) [18:46:37] (03PS2) 10Dzahn: backups: remove bugzilla backup directory [puppet] - 10https://gerrit.wikimedia.org/r/548440 (https://phabricator.wikimedia.org/T237233) [18:49:24] (03PS1) 10BBlack: TLS Analytics: send to VCL log in condensed form [puppet] - 10https://gerrit.wikimedia.org/r/548443 (https://phabricator.wikimedia.org/T233661) [18:49:28] (03PS1) 10BBlack: TLS Analytics: pick up VCL Log in webrequest [puppet] - 10https://gerrit.wikimedia.org/r/548444 (https://phabricator.wikimedia.org/T233661) [18:49:50] (03CR) 10jerkins-bot: [V: 04-1] TLS Analytics: send to VCL log in condensed form [puppet] - 10https://gerrit.wikimedia.org/r/548443 (https://phabricator.wikimedia.org/T233661) (owner: 10BBlack) [18:53:10] 10Operations, 10Wikimedia-Bugzilla, 10Patch-For-Review: vega, bromine (bugzilla-static) is trying to backup a directory that doesn't exist - https://phabricator.wikimedia.org/T237233 (10Dzahn) 05Open→03Resolved Removed. On backup1001: ` Notice: /Stage[main]/Bacula::Director/File[/etc/bacula/conf.d/fil... [18:53:20] 10Operations, 10DBA, 10serviceops, 10Goal, 10Patch-For-Review: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10Dzahn) [18:54:20] (03CR) 10BBlack: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/548443 (https://phabricator.wikimedia.org/T233661) (owner: 10BBlack) [18:54:57] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, and 2 others: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10BBlack) Patches above look sane? I went ahead and shortened the key names down to the minimum to prevent bloat at these layers. We can... [18:55:21] Urbanecm: have to leave; would you be able to deploy both patches for me? [18:55:39] hauskater: sure [18:55:51] DNS cannot be tested, the other is just checking Special:GlobalGroupMembership at metawiki and see if 'autoreview' is there [18:55:58] thanks much Urbanecm [18:56:07] I really have to go, otherwise I'd be present [18:56:20] (03PS3) 10Ayounsi: Initial forwarding-options templating [homer/public] - 10https://gerrit.wikimedia.org/r/547586 [18:56:26] (DNS can be tested, it logs to logstash when blocked) [18:56:32] anyway, yw hauskater [18:56:35] :) [19:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: Time to snap out of that daydream and deploy Morning SWAT (Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191104T1900). [19:00:04] hauskater and dmaza: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:17] I can SWAT today [19:00:35] (03CR) 10Urbanecm: [C: 03+2] Allow FlaggedRevs' 'autoreview' permission to be assigned globally [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547957 (owner: 10MarcoAurelio) [19:00:40] (03CR) 10Urbanecm: [C: 03+2] Enable DNS blacklist for es.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547863 (https://phabricator.wikimedia.org/T237151) (owner: 10MarcoAurelio) [19:01:27] dmaza: around? [19:01:44] (03Merged) 10jenkins-bot: Allow FlaggedRevs' 'autoreview' permission to be assigned globally [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547957 (owner: 10MarcoAurelio) [19:01:46] Urbanecm: yes [19:01:48] (03Merged) 10jenkins-bot: Enable DNS blacklist for es.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547863 (https://phabricator.wikimedia.org/T237151) (owner: 10MarcoAurelio) [19:02:38] (03PS2) 10Urbanecm: Enable SpecialMute page on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548301 (https://phabricator.wikimedia.org/T231577) (owner: 10Dmaza) [19:02:44] (03CR) 10Urbanecm: [C: 03+2] Enable SpecialMute page on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548301 (https://phabricator.wikimedia.org/T231577) (owner: 10Dmaza) [19:02:47] (03CR) 10Dzahn: "is it ok now?" [puppet] - 10https://gerrit.wikimedia.org/r/522567 (owner: 10Dzahn) [19:03:30] (03Merged) 10jenkins-bot: Enable SpecialMute page on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548301 (https://phabricator.wikimedia.org/T231577) (owner: 10Dmaza) [19:05:32] (03CR) 10Volans: "one very minor nit" (032 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/547523 (owner: 10Ayounsi) [19:05:45] !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT: 0fc3909: Allow FlaggedRevs autoreview permission to be assigned globally (duration: 00m 54s) [19:05:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:32] (03CR) 10Dzahn: [C: 03+1] "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/548292 (owner: 10Filippo Giunchedi) [19:07:32] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 9204768: Enable DNS blacklist for es.wikinews (T237151) (duration: 00m 53s) [19:07:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:37] T237151: Enable DNS blacklist for es.wikinews - https://phabricator.wikimedia.org/T237151 [19:07:53] dmaza: please test your patch at mwdebug1001 and let me know [19:08:04] thank you. Testing now [19:10:49] (03CR) 10Ottomata: TLS Analytics: pick up VCL Log in webrequest (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/548444 (https://phabricator.wikimedia.org/T233661) (owner: 10BBlack) [19:11:04] (03CR) 10Ayounsi: Add the ability to ignore some or all Junos warnings (032 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/547523 (owner: 10Ayounsi) [19:11:21] (03PS4) 10Ayounsi: Add the ability to ignore some or all Junos warnings [software/homer] - 10https://gerrit.wikimedia.org/r/547523 [19:11:39] Urbanecm: there seems to be a problem in my tests.. can we roll it back? [19:11:48] dmaza: certainly [19:11:57] Thank you [19:12:23] (03CR) 10Volans: wmf_auto_reimage: Adjust message about waiting for puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/522567 (owner: 10Dzahn) [19:12:59] (03CR) 10Volans: [C: 03+1] "LGTM" [software/homer] - 10https://gerrit.wikimedia.org/r/547523 (owner: 10Ayounsi) [19:13:19] 10Operations, 10Puppet, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Puppet tab in Horizon unusably slow - https://phabricator.wikimedia.org/T149589 (10Krenair) >>! In T149589#5632880, @Andrew wrote: > I've added an alternative editor that loads much faster. To switch modes, scr... [19:13:26] (03PS1) 10Urbanecm: Revert "Enable SpecialMute page on all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548452 [19:13:37] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] Revert "Enable SpecialMute page on all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548452 (owner: 10Urbanecm) [19:13:44] dmaza: done [19:13:59] (03CR) 10Ayounsi: [C: 03+2] Add the ability to ignore some or all Junos warnings [software/homer] - 10https://gerrit.wikimedia.org/r/547523 (owner: 10Ayounsi) [19:14:16] (03PS1) 10Urbanecm: Add throttle rule for bard college editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548453 (https://phabricator.wikimedia.org/T236955) [19:14:33] (03CR) 10Urbanecm: [C: 03+2] Add throttle rule for bard college editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548453 (https://phabricator.wikimedia.org/T236955) (owner: 10Urbanecm) [19:15:02] Urbanecm: thank you very much [19:15:09] you're welcome [19:15:18] (03Merged) 10jenkins-bot: Add throttle rule for bard college editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548453 (https://phabricator.wikimedia.org/T236955) (owner: 10Urbanecm) [19:15:41] (03CR) 10Ottomata: TLS Analytics: send to VCL log in condensed form (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/548443 (https://phabricator.wikimedia.org/T233661) (owner: 10BBlack) [19:16:18] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, and 2 others: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10Ottomata) Hm, would be ok with me, but likely whatever we choose we'll be stuck with forever. I tend to prefer descriptive names in gene... [19:16:20] (03Merged) 10jenkins-bot: Add the ability to ignore some or all Junos warnings [software/homer] - 10https://gerrit.wikimedia.org/r/547523 (owner: 10Ayounsi) [19:16:59] !log urbanecm@deploy1001 Synchronized wmf-config/throttle.php: SWAT: 6a4b966: Add throttle rule for bard college editathon (T236955) (duration: 00m 54s) [19:17:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:04] T236955: Lift IP cap for edit-a-thon at Bard College Nov. 11, 2019 - https://phabricator.wikimedia.org/T236955 [19:17:09] !log cobalt - stopping services, removing apache2 [19:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:41] RECOVERY - mediawiki originals uploads -hourly- for eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/dashboard/file/swift?panelId=9&fullscreen&orgId=1&var-DC=eqiad [19:22:09] RECOVERY - mediawiki originals uploads -hourly- for codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/dashboard/file/swift?panelId=9&fullscreen&orgId=1&var-DC=codfw [19:23:31] !log Morning SWAT done [19:23:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:59] (03CR) 10Bstorm: newk8s: adjust things to be compatible with migration to the new cluster (032 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/547676 (https://phabricator.wikimedia.org/T236202) (owner: 10Bstorm) [19:25:45] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, and 2 others: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10BBlack) >>! In T233661#5633090, @Ottomata wrote: > Hm, would be ok with me, but likely whatever we choose we'll be stuck with forever. I... [19:26:46] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, and 2 others: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10BBlack) Nevermind, I see it in the gerrit comments [19:27:59] (03PS5) 10Dzahn: wmf_auto_reimage: Adjust message about waiting for puppet [puppet] - 10https://gerrit.wikimedia.org/r/522567 [19:29:40] (03CR) 10BBlack: TLS Analytics: send to VCL log in condensed form (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/548443 (https://phabricator.wikimedia.org/T233661) (owner: 10BBlack) [19:30:18] (03CR) 10BBlack: TLS Analytics: pick up VCL Log in webrequest (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/548444 (https://phabricator.wikimedia.org/T233661) (owner: 10BBlack) [19:30:29] (03PS2) 10BBlack: TLS Analytics: send to VCL log in condensed form [puppet] - 10https://gerrit.wikimedia.org/r/548443 (https://phabricator.wikimedia.org/T233661) [19:30:31] (03PS2) 10BBlack: TLS Analytics: pick up VCL Log in webrequest [puppet] - 10https://gerrit.wikimedia.org/r/548444 (https://phabricator.wikimedia.org/T233661) [19:30:34] (03CR) 10jerkins-bot: [V: 04-1] wmf_auto_reimage: Adjust message about waiting for puppet [puppet] - 10https://gerrit.wikimedia.org/r/522567 (owner: 10Dzahn) [19:30:59] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, and 2 others: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10Ottomata) > I'm not sure if there's limitations on overall length of the varnishkafka inputs/outputs. Shouldn't be from varnishkafka, but... [19:31:01] (03PS1) 10RLazarus: httpbb: Create a new Puppet module for httpbb. [puppet] - 10https://gerrit.wikimedia.org/r/548461 (https://phabricator.wikimedia.org/T236699) [19:31:58] (03PS1) 10Urbanecm: Set namespace alias for Index: (NS 102/103) for elwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548463 (https://phabricator.wikimedia.org/T237253) [19:33:36] (03PS3) 10BBlack: TLS Analytics: send to VCL log in condensed form [puppet] - 10https://gerrit.wikimedia.org/r/548443 (https://phabricator.wikimedia.org/T233661) [19:33:38] (03PS3) 10BBlack: TLS Analytics: pick up VCL Log in webrequest [puppet] - 10https://gerrit.wikimedia.org/r/548444 (https://phabricator.wikimedia.org/T233661) [19:34:40] (03CR) 10BBlack: "PS3 has a potential compromise on the subkey lengths, they're all slightly-descriptive and 4 characters long (but still don't include the " [puppet] - 10https://gerrit.wikimedia.org/r/548443 (https://phabricator.wikimedia.org/T233661) (owner: 10BBlack) [19:38:40] (03PS1) 10Ottomata: Include hdfs_cleaner an an-coord node to clean HDFS /tmp dir [puppet] - 10https://gerrit.wikimedia.org/r/548468 (https://phabricator.wikimedia.org/T235200) [19:39:29] (03CR) 10Ottomata: [C: 03+1] "+1 from me, joal?" [puppet] - 10https://gerrit.wikimedia.org/r/548443 (https://phabricator.wikimedia.org/T233661) (owner: 10BBlack) [19:39:47] (03CR) 10Catrope: [C: 04-1] "The intro links also need the 'image' ones, which I've added in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/547831" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548260 (https://phabricator.wikimedia.org/T235042) (owner: 10Kosta Harlan) [19:39:56] (03CR) 10Ottomata: [C: 03+1] TLS Analytics: pick up VCL Log in webrequest [puppet] - 10https://gerrit.wikimedia.org/r/548444 (https://phabricator.wikimedia.org/T233661) (owner: 10BBlack) [19:40:42] (03CR) 10jerkins-bot: [V: 04-1] Include hdfs_cleaner an an-coord node to clean HDFS /tmp dir [puppet] - 10https://gerrit.wikimedia.org/r/548468 (https://phabricator.wikimedia.org/T235200) (owner: 10Ottomata) [19:41:36] (03PS2) 10Ottomata: Include hdfs_cleaner an an-coord node to clean HDFS /tmp dir [puppet] - 10https://gerrit.wikimedia.org/r/548468 (https://phabricator.wikimedia.org/T235200) [19:44:28] (03CR) 10Nuria: [C: 03+1] "Looks good, +1 to descriptive names" [puppet] - 10https://gerrit.wikimedia.org/r/548443 (https://phabricator.wikimedia.org/T233661) (owner: 10BBlack) [19:44:40] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler1001/19235/an-coord1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/548468 (https://phabricator.wikimedia.org/T235200) (owner: 10Ottomata) [19:46:15] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [19:46:44] (03CR) 10Joal: [C: 03+1] "Names look good to me as well - Thanks a lot for the effort in keeping them small AND understandable." [puppet] - 10https://gerrit.wikimedia.org/r/548443 (https://phabricator.wikimedia.org/T233661) (owner: 10BBlack) [19:48:20] (03CR) 10Nuria: "Adding @elukey" [puppet] - 10https://gerrit.wikimedia.org/r/548444 (https://phabricator.wikimedia.org/T233661) (owner: 10BBlack) [19:49:43] (03CR) 10Ottomata: [C: 03+2] Spark 2.4.4 release [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/542225 (https://phabricator.wikimedia.org/T222253) (owner: 10Ottomata) [19:49:55] (03CR) 10Jforrester: Set namespace alias for Index: (NS 102/103) for elwikisource (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548463 (https://phabricator.wikimedia.org/T237253) (owner: 10Urbanecm) [19:51:03] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, and 2 others: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10JAllemandou) I checked for message length in one day of webrequest, and we top at 4916 bytes. I think Kafka will be fine as per message-s... [19:51:11] (03PS2) 10Urbanecm: Set namespace alias for Index: (NS 102/103) for elwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548463 (https://phabricator.wikimedia.org/T237253) [19:51:26] (03CR) 10Urbanecm: Set namespace alias for Index: (NS 102/103) for elwikisource (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548463 (https://phabricator.wikimedia.org/T237253) (owner: 10Urbanecm) [19:51:58] (03CR) 10jerkins-bot: [V: 04-1] Set namespace alias for Index: (NS 102/103) for elwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548463 (https://phabricator.wikimedia.org/T237253) (owner: 10Urbanecm) [19:53:06] (03PS3) 10Urbanecm: Set namespace alias for Index: (NS 102/103) for elwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548463 (https://phabricator.wikimedia.org/T237253) [19:56:26] (03PS4) 10BBlack: TLS Analytics: send to VCL log in condensed form [puppet] - 10https://gerrit.wikimedia.org/r/548443 (https://phabricator.wikimedia.org/T233661) [19:56:28] (03PS4) 10BBlack: TLS Analytics: pick up VCL Log in webrequest [puppet] - 10https://gerrit.wikimedia.org/r/548444 (https://phabricator.wikimedia.org/T233661) [19:57:46] (03CR) 10BBlack: [C: 03+2] TLS Analytics: send to VCL log in condensed form [puppet] - 10https://gerrit.wikimedia.org/r/548443 (https://phabricator.wikimedia.org/T233661) (owner: 10BBlack) [19:59:17] (03PS1) 10CDanis: puppet-merge: show who is holding the lock [puppet] - 10https://gerrit.wikimedia.org/r/548478 [20:01:40] (03Abandoned) 10Kosta Harlan: GrowthExperiments: Add suggested edits links and remote config title [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548260 (https://phabricator.wikimedia.org/T235042) (owner: 10Kosta Harlan) [20:02:35] (03CR) 10Kosta Harlan: [C: 03+1] GrowthExperiments: Configure intro links for suggested edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547831 (https://phabricator.wikimedia.org/T235723) (owner: 10Catrope) [20:16:57] (03PS1) 10Ottomata: Bump refinery-job versions to 0.0.105 for Spark 2.4.4 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/548488 (https://phabricator.wikimedia.org/T222253) [20:24:10] (03PS2) 10CDanis: puppet-merge: show who is holding the lock [puppet] - 10https://gerrit.wikimedia.org/r/548478 [20:25:45] 10Operations, 10Puppet, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Puppet tab in Horizon unusably slow - https://phabricator.wikimedia.org/T149589 (10bd808) >>! In T149589#5632880, @Andrew wrote: > I've added an alternative editor that loads much faster. To switch modes, scrol... [20:34:44] (03PS2) 10Bstorm: toolforge-k8s: rotate docker logs [puppet] - 10https://gerrit.wikimedia.org/r/548304 (https://phabricator.wikimedia.org/T237270) [20:37:18] (03PS9) 10Bstorm: new k8s: adjust things to be compatible with migration to the new cluster [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/547676 (https://phabricator.wikimedia.org/T236202) [20:49:21] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 8 others: Picture from Commons not found from Singapore - https://phabricator.wikimedia.org/T231086 (10Gilles) [20:50:47] (03PS10) 10Bstorm: new k8s: adjust things to be compatible with migration to the new cluster [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/547676 (https://phabricator.wikimedia.org/T236202) [20:51:44] !log Testing twitter feed following account confirmation [20:51:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:00] its alive :) [20:52:57] (03PS6) 10Dzahn: wmf_auto_reimage: Adjust message about waiting for puppet [puppet] - 10https://gerrit.wikimedia.org/r/522567 [20:53:09] ah, we hadn't been tweeting for a while? [20:54:05] (03PS11) 10Bstorm: new k8s: adjust things to be compatible with migration to the new cluster [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/547676 (https://phabricator.wikimedia.org/T236202) [20:55:17] (03CR) 10jerkins-bot: [V: 04-1] wmf_auto_reimage: Adjust message about waiting for puppet [puppet] - 10https://gerrit.wikimedia.org/r/522567 (owner: 10Dzahn) [20:56:10] (03CR) 10Bstorm: new k8s: adjust things to be compatible with migration to the new cluster (033 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/547676 (https://phabricator.wikimedia.org/T236202) (owner: 10Bstorm) [20:56:43] mutante: yeah, Twitter soft locked the account until I just "proved" that it is not a bot (which of course is a huge lie but whatever) [20:58:28] (03PS6) 10Ottomata: Add eventgate-logging-external instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/547307 (https://phabricator.wikimedia.org/T236386) [21:00:04] cscott, arlolra, subbu, bearND, halfak, accraze, and mdholloway: Dear deployers, time to do the Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191104T2100). [21:00:08] bd808: gotcha, thanks! [21:00:28] no parsoid deploy today [21:02:25] (03CR) 10Bstorm: "On the ingress, I can play with it a bit. Maybe with both on the ingress, it will be a decent set up. The number of ingresses is more like" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/547676 (https://phabricator.wikimedia.org/T236202) (owner: 10Bstorm) [21:02:27] (03PS4) 10Ayounsi: Initial forwarding-options templating [homer/public] - 10https://gerrit.wikimedia.org/r/547586 [21:03:02] (03CR) 10Ayounsi: "This change is ready for review." (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/547586 (owner: 10Ayounsi) [21:03:18] (03CR) 10Bstorm: new k8s: adjust things to be compatible with migration to the new cluster (033 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/547676 (https://phabricator.wikimedia.org/T236202) (owner: 10Bstorm) [21:09:30] 10Operations, 10LDAP-Access-Requests: Add bawolff to either NDA or WMF ldap group - https://phabricator.wikimedia.org/T236636 (10Bawolff) >>! In T236636#5631543, @Ottomata wrote: > Ah sorry, I just logged in this morning to have an un-submitted post here. I had written 'Ok, `wmf` it is! @Bawolff I've added y... [21:09:59] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:17:48] (03PS7) 10Dzahn: wmf_auto_reimage: Adjust message about waiting for puppet [puppet] - 10https://gerrit.wikimedia.org/r/522567 [21:17:49] PROBLEM - Host cp3057 is DOWN: PING CRITICAL - Packet loss = 100% [21:20:06] (03CR) 10jerkins-bot: [V: 04-1] wmf_auto_reimage: Adjust message about waiting for puppet [puppet] - 10https://gerrit.wikimedia.org/r/522567 (owner: 10Dzahn) [21:21:34] 10Operations, 10DBA, 10serviceops, 10Goal, 10Patch-For-Review: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10jcrespo) Down to 2: ` All failures: 2 (cloudweb2001-dev, ...), Fresh: 90 jobs ` Which should be fixed when cloud patch is reviewed an... [21:22:18] (03CR) 10Nikerabbit: Variant configuration: Allow for YAML-based inheritance of configuration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538129 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [21:22:39] (03CR) 10Bstorm: ceph: add k8s manifests for ceph deployment using rook (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/546182 (https://phabricator.wikimedia.org/T236290) (owner: 10Jhedden) [21:25:05] 10Operations, 10serviceops: Make the parsoid cluster support parsoid/PHP - https://phabricator.wikimedia.org/T233654 (10Dzahn) [21:28:01] (03PS3) 10Bstorm: toolforge-k8s: rotate docker logs [puppet] - 10https://gerrit.wikimedia.org/r/548304 (https://phabricator.wikimedia.org/T237270) [21:28:04] (03CR) 10Jhedden: ceph: add k8s manifests for ceph deployment using rook (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/546182 (https://phabricator.wikimedia.org/T236290) (owner: 10Jhedden) [21:32:01] (03CR) 10Dzahn: "ploticus is missing for https://phabricator.wikimedia.org/T237304" [puppet] - 10https://gerrit.wikimedia.org/r/540154 (https://phabricator.wikimedia.org/T195847) (owner: 10Giuseppe Lavagetto) [21:35:52] (03PS1) 10Dzahn: mediawiki::packages: re-add ploticus for EasyTimeline extension [puppet] - 10https://gerrit.wikimedia.org/r/548533 (https://phabricator.wikimedia.org/T237304) [21:37:17] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:38:07] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [21:42:02] (03PS2) 10Dzahn: mediawiki::packages: re-add ploticus for EasyTimeline extension [puppet] - 10https://gerrit.wikimedia.org/r/548533 (https://phabricator.wikimedia.org/T237304) [21:42:29] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/c/operations/puppet/+/548533" [puppet] - 10https://gerrit.wikimedia.org/r/540154 (https://phabricator.wikimedia.org/T195847) (owner: 10Giuseppe Lavagetto) [21:51:34] (03CR) 10Subramanya Sastry: "I don't fully understand the original changes (removal of ploticus and math packages), so I'll let Reedy & Joe review this." [puppet] - 10https://gerrit.wikimedia.org/r/548533 (https://phabricator.wikimedia.org/T237304) (owner: 10Dzahn) [21:54:41] (03PS7) 10Ottomata: Add eventgate-logging-external instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/547307 (https://phabricator.wikimedia.org/T236386) [21:56:19] (03CR) 10Cwhite: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/548281 (https://phabricator.wikimedia.org/T236343) (owner: 10Filippo Giunchedi) [21:56:53] (03PS14) 10CDanis: rsync: add option to TLS-wrap communications [puppet] - 10https://gerrit.wikimedia.org/r/547527 [21:57:07] (03CR) 10CDanis: rsync: add option to TLS-wrap communications (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/547527 (owner: 10CDanis) [21:57:09] (03CR) 10Reedy: [C: 03+1] "I guess it was wrongly categorised at some point as (only) being used for Math rendering. Which while the Math extension et al might've us" [puppet] - 10https://gerrit.wikimedia.org/r/548533 (https://phabricator.wikimedia.org/T237304) (owner: 10Dzahn) [21:57:11] (03CR) 10Bstorm: ceph: add k8s manifests for ceph deployment using rook (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/546182 (https://phabricator.wikimedia.org/T236290) (owner: 10Jhedden) [21:57:47] (03CR) 10Jforrester: Variant configuration: Allow for YAML-based inheritance of configuration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538129 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [21:58:37] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, and 2 others: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10Nuria) @BBlack: once we deploy the VCL/varnish-kafka chnages we need to change our refine pipeline to read these values, when we deploy t... [22:00:04] Reedy and sbassett: It is that lovely time of the day again! You are hereby commanded to deploy Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191104T2200). [22:00:29] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [22:01:36] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): Degraded RAID on elastic1046 - https://phabricator.wikimedia.org/T228606 (10Jclark-ctr) received replacement drive @elukey @Gehel. [22:05:49] 10Operations, 10Parsoid-PHP, 10serviceops: wt2html: Out of memory crashers - https://phabricator.wikimedia.org/T236833 (10ssastry) Later next month, we are also going to relax some resource limit constraints which we imported from Parsoid/JS (and which we have wanted to relax for a while now but never got ar... [22:06:14] (03PS3) 10Cwhite: hiera: update ores to pass statsd through statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/538976 (https://phabricator.wikimedia.org/T205870) [22:06:44] (03PS4) 10Cwhite: hiera: update ores to pass statsd through statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/538976 (https://phabricator.wikimedia.org/T205870) [22:08:05] !log The Wikimedia SAL Twitter feed is now @wikimedia_sal (https://twitter.com/wikimedia_sal) T237322 [22:08:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:10] T237322: Change twitter handle of Wikimedia Tech SAL - https://phabricator.wikimedia.org/T237322 [22:09:19] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:11:41] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [22:13:39] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:17:13] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): Degraded RAID on elastic1046 - https://phabricator.wikimedia.org/T228606 (10Jclark-ctr) Replaced drive sda . @Gehel [22:18:31] (03CR) 10BBlack: [C: 03+2] TLS Analytics: pick up VCL Log in webrequest [puppet] - 10https://gerrit.wikimedia.org/r/548444 (https://phabricator.wikimedia.org/T233661) (owner: 10BBlack) [22:20:23] (03PS3) 10Thcipriani: logging: add logspam utilities [puppet] - 10https://gerrit.wikimedia.org/r/547777 (owner: 10Brennen Bearnes) [22:22:05] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:22:10] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10JHedden) The management interface for cloudcephosd1001.mgmt is currently unavailable, could we get someone take a look at it please? `bast... [22:22:49] thcipriani hi, could you add your +1 to https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/545367/ please? :) [22:22:57] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster: analytics1062 lost one of its power supplies - https://phabricator.wikimedia.org/T237133 (10Jclark-ctr) unfortunately not a loose power cord Submitted Tech direct ticket for replacement psu Service Request 1001998096 [22:23:44] (03CR) 10Thcipriani: [C: 03+1] "Should be safe to turn back on." [puppet] - 10https://gerrit.wikimedia.org/r/545367 (owner: 10Paladox) [22:23:50] paladox: thanks for the reminder. [22:24:04] thcipriani thanks!! [22:24:15] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:27:53] (03PS1) 10Dzahn: cumin: add missing alias to cover labtestpuppetmaster2001 [puppet] - 10https://gerrit.wikimedia.org/r/548539 [22:34:31] (03PS3) 10Catrope: GrowthExperiments: Require opt-in for suggested edits on cawiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547855 (https://phabricator.wikimedia.org/T236968) [22:34:48] (03CR) 10Catrope: [C: 03+2] GrowthExperiments: Require opt-in for suggested edits on cawiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547855 (https://phabricator.wikimedia.org/T236968) (owner: 10Catrope) [22:35:23] (03CR) 10Bstorm: "The more I think about this, the more I really want anything to do with the new name on a separate patch. Just getting this working is eno" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/547676 (https://phabricator.wikimedia.org/T236202) (owner: 10Bstorm) [22:35:34] (03Merged) 10jenkins-bot: GrowthExperiments: Require opt-in for suggested edits on cawiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547855 (https://phabricator.wikimedia.org/T236968) (owner: 10Catrope) [22:38:26] (03PS2) 10Dzahn: cumin: add missing alias to cover labtestpuppetmaster2001 [puppet] - 10https://gerrit.wikimedia.org/r/548539 [22:41:39] (03PS2) 10Dzahn: Revert "Revert "gerrit: enable jgit gc"" [puppet] - 10https://gerrit.wikimedia.org/r/545367 (owner: 10Paladox) [22:42:38] (03PS4) 10Bstorm: toolforge-k8s: rotate docker logs [puppet] - 10https://gerrit.wikimedia.org/r/548304 (https://phabricator.wikimedia.org/T237270) [22:43:33] paladox: ^ well, except the comment says "Renable after 2.15.12" and we are already on 2.15.14 [22:43:46] should we just remove the "homedir" file entirely? [22:43:59] since it's commented out anyways but might be confusing [22:44:14] mutante nah, we will be using it for git v2 [22:44:25] and it'll make it easy to disable jgit gc in the future if we have to [22:44:45] (03CR) 10Bstorm: [C: 03+2] toolforge-k8s: rotate docker logs [puppet] - 10https://gerrit.wikimedia.org/r/548304 (https://phabricator.wikimedia.org/T237270) (owner: 10Bstorm) [22:48:15] (03PS3) 10Paladox: Gerrit: enable jgit gc [puppet] - 10https://gerrit.wikimedia.org/r/545367 [22:49:11] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): Degraded RAID on elastic1046 - https://phabricator.wikimedia.org/T228606 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on cumin1001.eqiad.wmnet for hosts: ` ['elastic1046.eqiad.wmnet'] ` The log can be found in `/var/log/wmf... [22:53:40] (03CR) 10Dzahn: [C: 03+2] Gerrit: enable jgit gc [puppet] - 10https://gerrit.wikimedia.org/r/545367 (owner: 10Paladox) [22:54:25] (03PS24) 10Cwhite: lvs, prometheus, profile: add blackbox job helper and enable openapi endpoint scrapes [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) [22:55:47] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:56:29] (03CR) 10jerkins-bot: [V: 04-1] lvs, prometheus, profile: add blackbox job helper and enable openapi endpoint scrapes [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [22:57:41] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [23:03:02] !log gehel@cumin1001 START - Cookbook sre.hosts.downtime [23:03:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:05] !log gehel@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [23:05:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:42] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [23:10:50] (03PS25) 10Cwhite: lvs, prometheus, profile: add blackbox job helper and enable openapi scrapes [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) [23:11:31] !log milimetric@deploy1001 Started deploy [analytics/refinery@99f1535]: Fix for geoeditors jobs [23:11:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:00] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): Degraded RAID on elastic1046 - https://phabricator.wikimedia.org/T228606 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['elastic1046.eqiad.wmnet'] ` and were **ALL** successful. [23:15:07] (03PS1) 10Dzahn: gerrit: link /etc/gerrit to /var/lib/gerrit2/review_site/etc [puppet] - 10https://gerrit.wikimedia.org/r/548545 [23:15:41] (03CR) 10jerkins-bot: [V: 04-1] gerrit: link /etc/gerrit to /var/lib/gerrit2/review_site/etc [puppet] - 10https://gerrit.wikimedia.org/r/548545 (owner: 10Dzahn) [23:17:09] (03CR) 10Bstorm: new k8s: adjust things to be compatible with migration to the new cluster (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/547676 (https://phabricator.wikimedia.org/T236202) (owner: 10Bstorm) [23:18:00] (03PS2) 10Dzahn: gerrit: link /etc/gerrit to /var/lib/gerrit2/review_site/etc [puppet] - 10https://gerrit.wikimedia.org/r/548545 [23:18:51] !log milimetric@deploy1001 Finished deploy [analytics/refinery@99f1535]: Fix for geoeditors jobs (duration: 07m 20s) [23:18:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:25] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:23:53] (03PS1) 10Dzahn: gerrit: remove pre-buster support [puppet] - 10https://gerrit.wikimedia.org/r/548547 [23:29:50] (03PS1) 10Gehel: elasticsearch: ensure python prometheus client in installed [puppet] - 10https://gerrit.wikimedia.org/r/548548 (https://phabricator.wikimedia.org/T228606) [23:31:48] (03PS1) 10Paladox: Add mysql-connector-java [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/548549 [23:32:41] (03PS1) 10Dzahn: gerrit: ensure tmp dir under review_site exists [puppet] - 10https://gerrit.wikimedia.org/r/548550 (https://phabricator.wikimedia.org/T176774) [23:33:16] (03CR) 10Paladox: [C: 03+1] gerrit: ensure tmp dir under review_site exists [puppet] - 10https://gerrit.wikimedia.org/r/548550 (https://phabricator.wikimedia.org/T176774) (owner: 10Dzahn) [23:33:54] (03CR) 10Alexandros Kosiaris: "This is arguably going to break in some case (i.e. when the logs have already been rotated but no new ones are around) kubectl logs comman" [puppet] - 10https://gerrit.wikimedia.org/r/548304 (https://phabricator.wikimedia.org/T237270) (owner: 10Bstorm) [23:34:00] (03CR) 10Dzahn: [C: 03+1] elasticsearch: ensure python prometheus client in installed [puppet] - 10https://gerrit.wikimedia.org/r/548548 (https://phabricator.wikimedia.org/T228606) (owner: 10Gehel) [23:34:39] (03CR) 10Gehel: [C: 03+2] elasticsearch: ensure python prometheus client in installed [puppet] - 10https://gerrit.wikimedia.org/r/548548 (https://phabricator.wikimedia.org/T228606) (owner: 10Gehel) [23:35:50] (03PS26) 10Cwhite: lvs, prometheus, profile: add blackbox job helper and enable openapi scrapes [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) [23:38:16] (03CR) 10Dzahn: [C: 03+1] Add mysql-connector-java [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/548549 (owner: 10Paladox) [23:38:54] (03CR) 10Bstorm: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/548304 (https://phabricator.wikimedia.org/T237270) (owner: 10Bstorm) [23:42:09] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [23:42:19] (03CR) 10Cwhite: "PCC checks out: https://puppet-compiler.wmflabs.org/compiler1003/19237/" [puppet] - 10https://gerrit.wikimedia.org/r/538976 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [23:42:35] (03PS1) 10Paladox: Gerrit: Symnlink lib/mysql-connector to gerrit deployment repo [puppet] - 10https://gerrit.wikimedia.org/r/548552 [23:42:55] (03CR) 10Alexandros Kosiaris: [C: 03+1] labweb/cloudweb: fix path for backup jobs [puppet] - 10https://gerrit.wikimedia.org/r/548273 (https://phabricator.wikimedia.org/T237237) (owner: 10Andrew Bogott) [23:43:15] (03PS2) 10Paladox: Gerrit: Symnlink lib/mysql-connector to gerrit deployment repo [puppet] - 10https://gerrit.wikimedia.org/r/548552 [23:45:43] (03CR) 10Dzahn: "looks good.. just deploying it needs some manual step and avoid downtime" [puppet] - 10https://gerrit.wikimedia.org/r/548552 (owner: 10Paladox) [23:52:26] (03CR) 10Paladox: "So the steps will be:" [puppet] - 10https://gerrit.wikimedia.org/r/548552 (owner: 10Paladox) [23:56:32] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Met Amir in person and asked about this, LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/548242 (owner: 10Ladsgroup) [23:56:39] (03PS2) 10Alexandros Kosiaris: Rotate Amir's SSH key [puppet] - 10https://gerrit.wikimedia.org/r/548242 (owner: 10Ladsgroup) [23:57:57] (03PS2) 10Dzahn: gerrit: ensure tmp dir under review_site exists [puppet] - 10https://gerrit.wikimedia.org/r/548550 (https://phabricator.wikimedia.org/T176774) [23:57:59] (03PS1) 10Dzahn: gerrit: refactor, move java setup to separate class [puppet] - 10https://gerrit.wikimedia.org/r/548554 [23:58:59] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [23:59:10] (03CR) 10jerkins-bot: [V: 04-1] gerrit: refactor, move java setup to separate class [puppet] - 10https://gerrit.wikimedia.org/r/548554 (owner: 10Dzahn) [23:59:37] (03CR) 10Dzahn: "Everything is in the jetty class which isn't even related to the webserver setup. Start to refactor things and move some stuff into their " [puppet] - 10https://gerrit.wikimedia.org/r/548554 (owner: 10Dzahn)