[00:48:17] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 59.61 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [00:51:45] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 79.62 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [01:08:01] 10Operations, 10Traffic, 10observability: Varnish traffic drop alert @ codfw is noisy / codfw incoming traffic is spikey - https://phabricator.wikimedia.org/T239039 (10CDanis) [01:09:22] 10Operations, 10Traffic, 10observability: Varnish traffic drop alert @ codfw is noisy / codfw incoming traffic is spikey - https://phabricator.wikimedia.org/T239039 (10CDanis) [01:49:06] 10Operations, 10Traffic: ATS logs aren't being rotated - https://phabricator.wikimedia.org/T238724 (10Vgutierrez) Even though moving (mv) the file manually works as expected without a reload, the rotation triggered by logrotate isn't forcing ATS to open a new file: ` -rw-r--r-- 1 trafficserver trafficserver... [01:52:09] (03PS1) 10Herron: install_server: add logstash 7 vms [puppet] - 10https://gerrit.wikimedia.org/r/552676 [02:07:08] (03PS1) 10Herron: icinga: disable notifications on logstash 7 hosts [puppet] - 10https://gerrit.wikimedia.org/r/552677 [02:09:35] (03CR) 10Herron: [C: 03+2] icinga: disable notifications on logstash 7 hosts [puppet] - 10https://gerrit.wikimedia.org/r/552677 (owner: 10Herron) [02:10:06] (03PS2) 10Herron: icinga: disable notifications on logstash 7 hosts [puppet] - 10https://gerrit.wikimedia.org/r/552677 [02:14:22] (03CR) 10Herron: [C: 03+2] install_server: add logstash 7 vms [puppet] - 10https://gerrit.wikimedia.org/r/552676 (owner: 10Herron) [02:18:30] 10Operations, 10Traffic: ATS logs aren't being rotated - https://phabricator.wikimedia.org/T238724 (10Vgutierrez) upon manual removal of the empty error.log file, trafficserver creates a new one (without issuing a reload): ` -rw-r--r-- 1 trafficserver trafficserver 2.8K Nov 25 02:17 error.log -rw-r--r-- 1 tr... [02:23:36] (03PS1) 10Vgutierrez: ATS: Prevent logrotate from creating empty log files [puppet] - 10https://gerrit.wikimedia.org/r/552678 (https://phabricator.wikimedia.org/T238724) [02:35:15] (03PS2) 10Vgutierrez: ATS: Prevent logrotate from creating empty log files [puppet] - 10https://gerrit.wikimedia.org/r/552678 (https://phabricator.wikimedia.org/T238724) [02:37:55] (03CR) 10Vgutierrez: "pcc is happy: https://puppet-compiler.wmflabs.org/compiler1001/19573/" [puppet] - 10https://gerrit.wikimedia.org/r/552678 (https://phabricator.wikimedia.org/T238724) (owner: 10Vgutierrez) [02:47:42] 10Operations, 10ops-esams, 10Traffic: cp3053 is unreachable - https://phabricator.wikimedia.org/T239041 (10Vgutierrez) [02:59:24] !log depooling & power-cycling cp3053 - T239041 [02:59:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:59:29] T239041: cp3053 is unreachable - https://phabricator.wikimedia.org/T239041 [03:00:06] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp3053.esams.wmnet [03:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:02:26] RECOVERY - MariaDB Slave Lag: s8 on db2083 is OK: OK slave_sql_lag Replication lag: 0.28 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [03:03:24] 10Operations, 10ops-esams, 10Traffic: cp3053 is unreachable - https://phabricator.wikimedia.org/T239041 (10Vgutierrez) p:05Triage→03Normal [03:05:59] RECOVERY - Host cp3053 is UP: PING OK - Packet loss = 0%, RTA = 83.43 ms [03:09:09] 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Vgutierrez) [03:09:30] 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Vgutierrez) [03:09:33] 10Operations, 10ops-esams, 10Traffic: cp3053 is unreachable - https://phabricator.wikimedia.org/T239041 (10Vgutierrez) [03:13:29] !log repooling cp3053 - T239041 [03:13:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:13:34] T239041: cp3053 is unreachable - https://phabricator.wikimedia.org/T239041 [03:14:30] 10Operations, 10ops-esams, 10Traffic: cp3053 is unreachable - https://phabricator.wikimedia.org/T239041 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez Nothing on the logs or on SEL [03:14:32] 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Vgutierrez) [03:21:03] (03PS1) 10Vgutierrez: acme_chief: Revoke access from netmon boxes to netbox certificate [puppet] - 10https://gerrit.wikimedia.org/r/552680 (https://phabricator.wikimedia.org/T238919) [03:38:01] (03CR) 10Vgutierrez: "> Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/552398 (https://phabricator.wikimedia.org/T238900) (owner: 10Vgutierrez) [04:11:52] PROBLEM - MegaRAID on analytics1057 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:21:56] RECOVERY - MegaRAID on analytics1057 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:38:38] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 1.417 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [04:41:44] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0 https://wikitech.wikimedia.org/wiki/Logstash https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [04:51:54] https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Slowdown_on_WP [04:52:07] Complaints of general slowness [04:53:32] PROBLEM - MegaRAID on analytics1057 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:04:14] RECOVERY - MegaRAID on analytics1057 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:20:32] 10Operations, 10DNS, 10Internet-Archive, 10Traffic, 10Wikimedia-Video: Please set up a CNAME for videoserver.wikimedia.org to Video Editing Server - https://phabricator.wikimedia.org/T99216 (10Aklapper) > jcrespo changed the task status from Open to Stalled. What exactly is this task [stalled](https://w... [05:53:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1101:3318', diff saved to https://phabricator.wikimedia.org/P9725 and previous config saved to /var/cache/conftool/dbconfig/20191125-055305-marostegui.json [05:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:57:05] 10Operations, 10DBA: db2125 crashed - https://phabricator.wikimedia.org/T239042 (10Marostegui) p:05Triage→03Normal [05:58:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2125 - crashed T239042', diff saved to https://phabricator.wikimedia.org/P9726 and previous config saved to /var/cache/conftool/dbconfig/20191125-055813-marostegui.json [05:58:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:58:19] T239042: db2125 crashed - https://phabricator.wikimedia.org/T239042 [05:59:40] ACKNOWLEDGEMENT - SSH on db2125 is CRITICAL: CRITICAL - Socket timeout after 10 seconds Marostegui T239042 https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:00:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1101:3318', diff saved to https://phabricator.wikimedia.org/P9727 and previous config saved to /var/cache/conftool/dbconfig/20191125-060011-marostegui.json [06:00:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:03:24] 10Operations, 10DBA: db2125 crashed - https://phabricator.wikimedia.org/T239042 (10Marostegui) There are no hardware logs: ` /admin1-> racadm getsel Record: 1 Date/Time: 07/12/2019 21:38:11 Source: system Severity: Ok Description: Log cleared. ----------------------------------------------------... [06:07:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1101:3318', diff saved to https://phabricator.wikimedia.org/P9728 and previous config saved to /var/cache/conftool/dbconfig/20191125-060728-marostegui.json [06:07:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:24] PROBLEM - MegaRAID on analytics1057 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:13:45] 10Operations, 10DBA: db2125 crashed - https://phabricator.wikimedia.org/T239042 (10Marostegui) The first traces of crash are: ` Nov 23 08:25:35 db2125 mysqld[13682]: InnoDB: Warning: a long semaphore wait: Nov 23 08:25:35 db2125 mysqld[13682]: --Thread 139387736135424 has waited at row0purge.cc line 772 for 24... [06:15:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1101:3318', diff saved to https://phabricator.wikimedia.org/P9729 and previous config saved to /var/cache/conftool/dbconfig/20191125-061542-marostegui.json [06:15:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1104 - schema change', diff saved to https://phabricator.wikimedia.org/P9730 and previous config saved to /var/cache/conftool/dbconfig/20191125-061629-marostegui.json [06:16:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:16] 10Operations, 10DBA: db2125 crashed - https://phabricator.wikimedia.org/T239042 (10Marostegui) More logs from the console: ` [10086760.709402] NMI watchdog: BUG: soft lockup - CPU#14 stuck for 22s! [kworker/u480:0:6] [10086764.636175] NMI watchdog: BUG: soft lockup - CPU#4 stuck for 22s! [sshd:107965] [1008676... [06:18:35] !log racadm serveraction hardreset on db2125 T239042 [06:18:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:39] T239042: db2125 crashed - https://phabricator.wikimedia.org/T239042 [06:21:42] RECOVERY - SSH on db2125 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:22:15] !log Compress db2094:3318 [06:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:23:14] !log Compress db2082 [06:23:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:09] !log Compress db2080 [06:24:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:55] 10Operations, 10DBA: db2125 crashed - https://phabricator.wikimedia.org/T239042 (10Marostegui) Nothing apart from this on OS logs: ` Nov 23 08:17:09 db2125 systemd[1]: Started Time & Date Service. Nov 23 08:18:01 db2125 CRON[107127]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var... [06:31:51] 10Operations, 10DBA: db2125 crashed - https://phabricator.wikimedia.org/T239042 (10Marostegui) I have extracted the controller logs....nothing showing up there. [06:34:32] 10Operations, 10ops-codfw, 10DBA: db2125 crashed - https://phabricator.wikimedia.org/T239042 (10Marostegui) a:03Papaul @Papaul can we upgrade firwmare and BIOS on this host? It is a very new host, and if it this crash happens again we might need to contact Dell. [06:37:16] (03PS1) 10Marostegui: db2134: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/552690 (https://phabricator.wikimedia.org/T238183) [06:37:40] marostegui: may I ask how db2125 crashed? [06:37:59] vgutierrez: I don't know, I am guessing storage crashed,but there is nothing on logs [06:38:01] marostegui: maybe no net, nothing on the KVM console and nothing on the logs? [06:38:10] vgutierrez: nothing [06:38:21] vgutierrez: from the MySQL logs, I think the storage crashed somehow [06:38:25] hmmm R440? [06:38:38] vgutierrez: yes [06:38:53] marostegui: https://phabricator.wikimedia.org/T238305 [06:39:01] dunno if it's related [06:39:10] but we are seeing something similar in esams brand new cp servers [06:39:47] vgutierrez: This host is relatively new too (from july) [06:39:56] vgutierrez: Going to add the task as a subtask [06:40:13] yeah, cp1077 there is also not brand new but relatively new as well [06:40:17] also a R440 [06:40:28] RECOVERY - MegaRAID on analytics1057 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:40:31] 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Marostegui) db2125 crashed too and it is a new R440: {T239042} [06:40:50] 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Marostegui) [06:41:07] 10Operations, 10ops-codfw, 10DBA: db2125 crashed - https://phabricator.wikimedia.org/T239042 (10Marostegui) [06:41:09] 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Marostegui) [06:41:34] 10Operations, 10ops-codfw, 10DBA: db2125 crashed - https://phabricator.wikimedia.org/T239042 (10Marostegui) This can be related: T238305 [06:42:00] (03CR) 10Marostegui: [C: 03+2] db2134: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/552690 (https://phabricator.wikimedia.org/T238183) (owner: 10Marostegui) [06:43:36] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 53.92 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [06:52:08] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 71.29 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [06:57:57] (03PS1) 10Marostegui: db2125: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/552695 (https://phabricator.wikimedia.org/T239042) [07:00:06] (03CR) 10Marostegui: [C: 03+2] db2125: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/552695 (https://phabricator.wikimedia.org/T239042) (owner: 10Marostegui) [07:04:02] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 58.62 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [07:04:53] !log Upgrade db2134 [07:04:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:37] (03PS1) 10Marostegui: mariadb: Promote db2134 to m3 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/552699 (https://phabricator.wikimedia.org/T238183) [07:07:25] (03PS8) 10DannyS712: abusefilter.php: Remove settings that duplicate defaults, and clean up [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552610 (https://phabricator.wikimedia.org/T238965) [07:07:28] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 74.34 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [07:11:01] 10Operations, 10ops-eqiad, 10Analytics: analytics1057's BBU is faulty - https://phabricator.wikimedia.org/T239045 (10elukey) [07:13:31] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2134 to m3 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/552699 (https://phabricator.wikimedia.org/T238183) (owner: 10Marostegui) [07:22:15] !log Compress db2090 [07:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:31] 10Operations, 10DBA: Decommission db2043-db2070 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [07:28:43] 10Operations, 10DBA: Decommission db2043-db2070 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [07:50:04] (03CR) 10Marostegui: [C: 03+1] racktables: use codfw database when in codfw [puppet] - 10https://gerrit.wikimedia.org/r/552553 (https://phabricator.wikimedia.org/T224247) (owner: 10Dzahn) [07:51:13] (03CR) 10Marostegui: [C: 03+1] iegreview app: use codfw database when in codfw [puppet] - 10https://gerrit.wikimedia.org/r/552552 (https://phabricator.wikimedia.org/T224247) (owner: 10Dzahn) [07:51:37] (03CR) 10Marostegui: [C: 03+1] wikimania_scholarships app: use codfw database when in codfw [puppet] - 10https://gerrit.wikimedia.org/r/552551 (https://phabricator.wikimedia.org/T224247) (owner: 10Dzahn) [07:57:30] (03PS1) 10ArielGlenn: properly handle mtime lookup for dumps log exception checker [puppet] - 10https://gerrit.wikimedia.org/r/552707 [07:59:03] (03PS1) 10Marostegui: mariadb: Remove db1067 [puppet] - 10https://gerrit.wikimedia.org/r/552708 (https://phabricator.wikimedia.org/T238297) [08:00:04] (03PS1) 10Marostegui: wmnet: Remove production DNS for db1067 [dns] - 10https://gerrit.wikimedia.org/r/552709 (https://phabricator.wikimedia.org/T238297) [08:04:06] (03PS6) 10Muehlenhoff: Enable ldap-corp1001/2001 as additional LDAP replicas [puppet] - 10https://gerrit.wikimedia.org/r/539150 [08:10:48] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [08:10:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [08:11:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:06] (03CR) 10Marostegui: [C: 03+2] mariadb: Remove db1067 [puppet] - 10https://gerrit.wikimedia.org/r/552708 (https://phabricator.wikimedia.org/T238297) (owner: 10Marostegui) [08:11:46] (03CR) 10Marostegui: [C: 03+2] wmnet: Remove production DNS for db1067 [dns] - 10https://gerrit.wikimedia.org/r/552709 (https://phabricator.wikimedia.org/T238297) (owner: 10Marostegui) [08:12:47] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission db1067.eqiad.wmnet - https://phabricator.wikimedia.org/T238297 (10Marostegui) a:05Marostegui→03Jclark-ctr [08:13:00] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission db1067.eqiad.wmnet - https://phabricator.wikimedia.org/T238297 (10Marostegui) Host ready for #dc-ops steps [08:13:15] 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [08:14:42] (03CR) 10Effie Mouzeli: "> We could create a generic profile which installs the perftools by" [puppet] - 10https://gerrit.wikimedia.org/r/550833 (https://phabricator.wikimedia.org/T236048) (owner: 10Effie Mouzeli) [08:17:51] 10Operations, 10DNS, 10Internet-Archive, 10Traffic, 10Wikimedia-Video: Please set up a CNAME for videoserver.wikimedia.org to Video Editing Server - https://phabricator.wikimedia.org/T99216 (10jcrespo) @Aklapper an answer to T99216#2057570 [08:19:35] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/552506 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [08:25:45] PROBLEM - MegaRAID on analytics1057 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:27:50] (03PS1) 10Giuseppe Lavagetto: Add server_name, override settings [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/552763 [08:27:57] (03PS1) 10Marostegui: db2065: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/552764 (https://phabricator.wikimedia.org/T239046) [08:29:57] (03CR) 10Marostegui: [C: 03+2] db2065: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/552764 (https://phabricator.wikimedia.org/T239046) (owner: 10Marostegui) [08:31:24] (03CR) 10Vgutierrez: [C: 03+1] Add server_name, override settings [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/552763 (owner: 10Giuseppe Lavagetto) [08:32:11] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Add server_name, override settings [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/552763 (owner: 10Giuseppe Lavagetto) [08:33:29] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 54.67 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:39:00] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10jijiki) [08:39:18] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2125 crashed - https://phabricator.wikimedia.org/T239042 (10Marostegui) For what is worth, this is the kernel this host is running at the moment (I have not upgraded it since the crash): ` root@db2125:~# uname -a Linux db2125 4.9.0-11-amd64 #1 SMP Deb... [08:39:39] 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Marostegui) [08:40:07] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10jijiki) p:05Triage→03Normal [08:42:17] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 81.75 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:43:49] (03PS1) 10Filippo Giunchedi: prometheus: support configcluster and configcluster_stretch [puppet] - 10https://gerrit.wikimedia.org/r/552765 (https://phabricator.wikimedia.org/T238791) [08:47:40] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "a bit ugly, but should work for now. I'll fix it back when we've tranistioned." [puppet] - 10https://gerrit.wikimedia.org/r/552765 (https://phabricator.wikimedia.org/T238791) (owner: 10Filippo Giunchedi) [08:48:39] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: support configcluster and configcluster_stretch [puppet] - 10https://gerrit.wikimedia.org/r/552765 (https://phabricator.wikimedia.org/T238791) (owner: 10Filippo Giunchedi) [08:53:58] <_joe_> !log rebuilding base docker images docker-registry.wikimedia.org/wikimedia-{jessie,stretch,buster} [08:54:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:09] 10Operations, 10serviceops: Appservers rising GET latency might have triggered LVS pages - https://phabricator.wikimedia.org/T238973 (10fgiunchedi) [08:54:12] 10Operations, 10Traffic, 10serviceops: Increased latency in appservers - 22 Nov 2019 - https://phabricator.wikimedia.org/T238939 (10fgiunchedi) [08:54:35] 10Operations, 10serviceops: Appservers rising GET latency might have triggered LVS pages - https://phabricator.wikimedia.org/T238973 (10fgiunchedi) The cause was indeed appservers latency, resolving in favor of T238939 [08:55:20] 10Operations, 10serviceops: Appservers rising GET latency might have triggered LVS pages - https://phabricator.wikimedia.org/T238973 (10Joe) I find it hard to believe this is the case. Text-lb checks request a cached url, so the backend latency should not matter. [08:56:59] 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Marostegui) [09:03:53] (03PS1) 10Filippo Giunchedi: Fix invalid metric name for pdns_tcp4_queries [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/552767 [09:10:14] (03CR) 10Filippo Giunchedi: "The current version is causing this in Prometheus logs:" [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/552767 (owner: 10Filippo Giunchedi) [09:11:06] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/552767 (owner: 10Filippo Giunchedi) [09:13:17] !log installing python2.7 updates on buster [09:13:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:34] (03CR) 10Ema: [C: 03+2] Revert "vcl: move XWD pass logic to wm_common" [puppet] - 10https://gerrit.wikimedia.org/r/552507 (https://phabricator.wikimedia.org/T233768) (owner: 10Ema) [09:16:37] (03CR) 10Ema: [C: 03+2] cache: do not cache noc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/552508 (https://phabricator.wikimedia.org/T233768) (owner: 10Ema) [09:17:02] !log onimisionipe@deploy1001 Started deploy [wdqs/wdqs@db43901]: T238822 [09:17:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:46] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10Gilles) [09:20:51] (03CR) 10DCausse: [C: 03+1] wdqs: remove the ban of Guzzle user agent. [puppet] - 10https://gerrit.wikimedia.org/r/552540 (owner: 10Gehel) [09:23:26] RECOVERY - MegaRAID on analytics1057 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:25:25] (03PS2) 10Gehel: wdqs: remove the ban of Guzzle user agent. [puppet] - 10https://gerrit.wikimedia.org/r/552540 [09:26:11] (03CR) 10Filippo Giunchedi: [C: 03+2] Fix invalid metric name for pdns_tcp4_queries [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/552767 (owner: 10Filippo Giunchedi) [09:26:19] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] Fix invalid metric name for pdns_tcp4_queries [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/552767 (owner: 10Filippo Giunchedi) [09:26:24] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/552680 (https://phabricator.wikimedia.org/T238919) (owner: 10Vgutierrez) [09:28:34] <_joe_> !log building and publishing updated images for envoy [09:28:37] (03CR) 10Gehel: [C: 03+2] wdqs: remove the ban of Guzzle user agent. [puppet] - 10https://gerrit.wikimedia.org/r/552540 (owner: 10Gehel) [09:28:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:10] !log onimisionipe@deploy1001 Finished deploy [wdqs/wdqs@db43901]: T238822 (duration: 13m 08s) [09:30:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1104 after schema change', diff saved to https://phabricator.wikimedia.org/P9731 and previous config saved to /var/cache/conftool/dbconfig/20191125-093038-marostegui.json [09:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:22] (03PS1) 10Giuseppe Lavagetto: Fix badly formatted changelog entry, typo [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/552769 [09:31:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1126 - schema change', diff saved to https://phabricator.wikimedia.org/P9732 and previous config saved to /var/cache/conftool/dbconfig/20191125-093157-marostegui.json [09:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:39] (03CR) 10Volans: [C: 03+1] "Thanks for the fix" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/552769 (owner: 10Giuseppe Lavagetto) [09:32:43] !log installing systemd security/bugfix updates on buster [09:32:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:07] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Fix badly formatted changelog entry, typo [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/552769 (owner: 10Giuseppe Lavagetto) [09:41:46] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] labmon: add compatibility in buster [puppet] - 10https://gerrit.wikimedia.org/r/552107 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [09:45:02] !log installing cron updates from buster point release [09:45:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:37] 10Operations, 10serviceops: Appservers rising GET latency might have triggered LVS pages - https://phabricator.wikimedia.org/T238973 (10fgiunchedi) 05duplicate→03Open >>! In T238973#5688257, @Joe wrote: > I find it hard to believe this is the case. Text-lb checks request a cached url, so the backend latenc... [09:45:55] (03PS2) 10Giuseppe Lavagetto: blubberoid: add telemetry collection support for envoy [deployment-charts] - 10https://gerrit.wikimedia.org/r/549837 (https://phabricator.wikimedia.org/T237234) [09:45:57] (03PS1) 10Giuseppe Lavagetto: Add private stub to eventgate [deployment-charts] - 10https://gerrit.wikimedia.org/r/552771 [09:52:33] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add private stub to eventgate [deployment-charts] - 10https://gerrit.wikimedia.org/r/552771 (owner: 10Giuseppe Lavagetto) [09:53:16] 10Operations: Integrate Buster 10.2 point update - https://phabricator.wikimedia.org/T238519 (10MoritzMuehlenhoff) [09:54:42] 10Operations: Integrate Buster 10.2 point update - https://phabricator.wikimedia.org/T238519 (10MoritzMuehlenhoff) [09:54:45] 10Operations, 10Puppet, 10User-jbond: Add method to admin module ci to detect removed users - https://phabricator.wikimedia.org/T239070 (10jbond) [09:55:30] PROBLEM - MegaRAID on analytics1057 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:56:27] (03CR) 10Jbond: [C: 04-1] admins: add Max Semenik as ldap_only_admin (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552594 (https://phabricator.wikimedia.org/T238960) (owner: 10Dzahn) [09:56:53] downtimed for a week -^ [09:59:32] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/19574/deploy1001.eqiad.wmnet/ it compiles, and gives the correct result. I'm merging the " [puppet] - 10https://gerrit.wikimedia.org/r/549872 (https://phabricator.wikimedia.org/T237234) (owner: 10Giuseppe Lavagetto) [09:59:44] (03CR) 10Giuseppe Lavagetto: [C: 03+2] prometheus: add scraping of k8s envoy sidecars [puppet] - 10https://gerrit.wikimedia.org/r/549871 (https://phabricator.wikimedia.org/T237234) (owner: 10Giuseppe Lavagetto) [10:00:09] (03PS5) 10Ayounsi: Add virtual-chassis support [software/homer] - 10https://gerrit.wikimedia.org/r/550367 [10:08:01] (03CR) 10Jbond: "mostly look fine however as its a new role it will need approval in Mondays meeting. would also be nice to have a pointer to the init sc" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552613 (https://phabricator.wikimedia.org/T238905) (owner: 10Dzahn) [10:09:28] (03CR) 10Muehlenhoff: admins: add Max Semenik as ldap_only_admin (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552594 (https://phabricator.wikimedia.org/T238960) (owner: 10Dzahn) [10:09:52] (03Abandoned) 10Muehlenhoff: Remove now obsolete openstack-jessie-bpo filter [puppet] - 10https://gerrit.wikimedia.org/r/549814 (owner: 10Muehlenhoff) [10:21:02] (03CR) 10Filippo Giunchedi: [C: 03+1] labmon: add compatibility in buster [puppet] - 10https://gerrit.wikimedia.org/r/552107 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [10:22:05] 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Campsite, 10User-DannyS712: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10Ladsgroup) Given T238901#5687813 it seems it's fixed. [10:26:31] (03PS1) 10Giuseppe Lavagetto: prometheus::snmp_exporter: rationalize hiera calls [puppet] - 10https://gerrit.wikimedia.org/r/552774 [10:28:17] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/compiler1002/19575/ shows this is a noop." [puppet] - 10https://gerrit.wikimedia.org/r/552774 (owner: 10Giuseppe Lavagetto) [10:30:04] jan_drewniak: How many deployers does it take to do Wikimedia Portals Update deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191125T1030). [10:30:44] (03CR) 10Nikerabbit: [C: 04-1] "Do we need the cxnonbeta list at all? I think we could just set" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548730 (https://phabricator.wikimedia.org/T234318) (owner: 10KartikMistry) [10:31:32] (03CR) 10Jbond: "Hi All, any objections to moving forward with this? can i get some +1's?" [puppet] - 10https://gerrit.wikimedia.org/r/548241 (https://phabricator.wikimedia.org/T237362) (owner: 10Jbond) [10:33:26] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus::snmp_exporter: rationalize hiera calls [puppet] - 10https://gerrit.wikimedia.org/r/552774 (owner: 10Giuseppe Lavagetto) [10:38:14] RECOVERY - MegaRAID on analytics1057 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:41:17] 10Operations, 10observability: Write ulogd logs to a dedicated logfile - https://phabricator.wikimedia.org/T238414 (10fgiunchedi) FWIW I'm ok with doing whichever is easiest, IIRC we can ship to kafka first and then add rules to log to a separate file. [10:54:35] (03PS7) 10KartikMistry: Enable CX out of beta for newly created WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548730 (https://phabricator.wikimedia.org/T234318) [10:57:00] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 51 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [10:58:33] (03CR) 10Giuseppe Lavagetto: [C: 03+2] prometheus::snmp_exporter: rationalize hiera calls [puppet] - 10https://gerrit.wikimedia.org/r/552774 (owner: 10Giuseppe Lavagetto) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191125T1100). [11:00:04] daimona and Tpt: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:13] o/ [11:00:20] o/ [11:00:22] I can SWAT today! [11:00:29] Noice [11:00:39] Hi! [11:01:08] (03CR) 10Urbanecm: [C: 03+2] Properly configures the Wikisource extension on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/551254 (https://phabricator.wikimedia.org/T239034) (owner: 10Tpt) [11:01:23] Tpt[m]: +2'ed your patch, as soon as it is merged, it will be automatically deployed [11:01:38] Urbanecm: Thank you! [11:01:52] (03Merged) 10jenkins-bot: Properly configures the Wikisource extension on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/551254 (https://phabricator.wikimedia.org/T239034) (owner: 10Tpt) [11:02:10] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 76.76 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [11:03:24] (03CR) 10Urbanecm: [C: 03+2] Allow enwikiversity interface admins to remove their own interface administratorship [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552615 (https://phabricator.wikimedia.org/T238967) (owner: 10DannyS712) [11:03:44] 10Operations, 10Wikimedia-Mailing-lists: Create OpenGLAM mailing list - https://phabricator.wikimedia.org/T238759 (10SandraF_WMF) Thank you! Much appreciated 😀 [11:04:12] (03CR) 10Volans: [C: 03+1] "No blocker for me, I'd like to see more buy-in by other stakeholders too." [puppet] - 10https://gerrit.wikimedia.org/r/548241 (https://phabricator.wikimedia.org/T237362) (owner: 10Jbond) [11:04:15] (03Merged) 10jenkins-bot: Allow enwikiversity interface admins to remove their own interface administratorship [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552615 (https://phabricator.wikimedia.org/T238967) (owner: 10DannyS712) [11:05:17] (03PS1) 10Elukey: profile::mariadb::misc::eventlogging::database: set db to read only [puppet] - 10https://gerrit.wikimedia.org/r/552776 (https://phabricator.wikimedia.org/T234826) [11:06:05] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 9394f1f: Allow enwikiversity interface admins to remove their own interface administratorship (T238967) (duration: 00m 57s) [11:06:06] Urbanecm: please ping me when my patch is ready, I'm dealing with like 4 bugs at the same time [11:06:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:11] T238967: Allow enwikiversity interface admins to remove their own interface administratorship - https://phabricator.wikimedia.org/T238967 [11:06:22] Daimona: ack, I'm waiting for CI now [11:06:29] ty [11:09:27] (03CR) 10Urbanecm: [C: 03+2] Add throttle rule for WMCL Editathon 2019-12-07 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552640 (https://phabricator.wikimedia.org/T238986) (owner: 10Zoranzoki21) [11:10:13] (03Merged) 10jenkins-bot: Add throttle rule for WMCL Editathon 2019-12-07 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552640 (https://phabricator.wikimedia.org/T238986) (owner: 10Zoranzoki21) [11:11:16] (03PS1) 10Giuseppe Lavagetto: deployment_server::helmfile: use the correct prometheus nodes [puppet] - 10https://gerrit.wikimedia.org/r/552777 [11:11:57] !log urbanecm@deploy1001 Synchronized wmf-config/throttle.php: SWAT: 4670d1d: Add throttle rule for WMCL Editathon 2019-12-07 (T238986) (duration: 00m 53s) [11:12:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:02] T238986: Lift IP limit - WMCL Editathon 2019-12-07 - https://phabricator.wikimedia.org/T238986 [11:12:08] Daimona: can you test it within a minute? [11:12:37] Urbanecm: sure [11:12:42] In like 10 seconds actually ahah [11:13:04] Daimona: scap pull just finished, mwdebug1001 [11:13:40] Yay, works [11:13:42] Thanks [11:14:19] (03CR) 10jerkins-bot: [V: 04-1] deployment_server::helmfile: use the correct prometheus nodes [puppet] - 10https://gerrit.wikimedia.org/r/552777 (owner: 10Giuseppe Lavagetto) [11:14:31] Daimona: thx! [11:15:15] (03PS2) 10Giuseppe Lavagetto: deployment_server::helmfile: use the correct prometheus nodes [puppet] - 10https://gerrit.wikimedia.org/r/552777 [11:15:23] Daimona: syncing [11:16:24] !log urbanecm@deploy1001 Synchronized php-1.35.0-wmf.5/extensions/AbuseFilter/extension.json: SWAT: 29a16bd: Restrict viewing Special:Log/AbuseFilter, and remove from recent changes (T34959) (duration: 01m 04s) [11:16:31] Daimona: synced! [11:16:33] !log EU SWAT done [11:16:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:39] T34959: Private filters should not be visible in recent changes - https://phabricator.wikimedia.org/T34959 [11:16:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:01] Confirmed working, thanks again [11:17:12] happy to help! [11:19:22] PROBLEM - Apache HTTP on mw1341 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [11:19:36] PROBLEM - PHP7 rendering on mw1314 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:19:38] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:20:02] PROBLEM - PHP7 rendering on mw1341 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:20:16] ^ checking [11:20:34] PROBLEM - Restrouter LVS eqiad on restrouter.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [11:20:40] PROBLEM - Apache HTTP on mw1314 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [11:20:44] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [11:20:50] lovely [11:20:54] RECOVERY - Apache HTTP on mw1341 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:20:58] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:21:00] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:21:00] (03PS7) 10Muehlenhoff: Enable ldap-corp1001/2001 as additional LDAP replicas [puppet] - 10https://gerrit.wikimedia.org/r/539150 [11:21:06] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:21:10] PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var- [11:21:20] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:21:44] RECOVERY - PHP7 rendering on mw1341 is OK: HTTP OK: HTTP/1.1 200 OK - 79036 bytes in 7.813 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:22:12] RECOVERY - Apache HTTP on mw1314 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:22:16] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:22:16] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:22:44] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [11:22:49] effie: need help? [11:22:54] RECOVERY - PHP7 rendering on mw1314 is OK: HTTP OK: HTTP/1.1 200 OK - 79034 bytes in 0.972 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:22:56] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:23:24] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:23:34] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:23:54] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:24:04] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [11:24:08] (03CR) 10Marostegui: [C: 03+1] "Reminder: you have to either restart mysql or enable read-only directly on mysql with: set global read_only=1;" [puppet] - 10https://gerrit.wikimedia.org/r/552776 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [11:24:20] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [11:24:36] PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var- [11:24:36] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:24:48] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out befor [11:24:48] received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:24:52] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro [11:25:28] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:25:28] PROBLEM - Nginx local proxy to apache on mw1314 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [11:25:36] PROBLEM - Nginx local proxy to apache on mw1316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [11:25:44] RECOVERY - Restrouter LVS eqiad on restrouter.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [11:25:44] PROBLEM - Restrouter LVS codfw on restrouter.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [11:25:45] (03CR) 10Jbond: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1001/19579/icinga1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/552512 (owner: 10Jbond) [11:25:58] PROBLEM - Nginx local proxy to apache on mw1313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [11:26:28] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:26:34] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [11:27:22] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:27:26] RECOVERY - Restrouter LVS codfw on restrouter.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [11:27:36] RECOVERY - Nginx local proxy to apache on mw1313 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 4.381 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:27:54] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:28:12] PROBLEM - Apache HTTP on mw1316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [11:28:18] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [11:29:08] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:29:36] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [11:29:47] (03PS3) 10Giuseppe Lavagetto: deployment_server::helmfile: use the correct prometheus nodes [puppet] - 10https://gerrit.wikimedia.org/r/552777 [11:29:54] RECOVERY - Apache HTTP on mw1316 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 8.603 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:29:56] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [11:30:16] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:30:18] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [11:30:30] RECOVERY - Nginx local proxy to apache on mw1314 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 3.038 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:30:50] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [11:31:12] PROBLEM - Apache HTTP on mw1314 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [11:31:14] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:31:40] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:31:44] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [11:31:51] !log restart php-fpm on mw1314 [11:31:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:56] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [11:31:56] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:32:14] PROBLEM - PHP7 rendering on mw1312 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1312 bytes in 2.133 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:32:16] PROBLEM - PHP7 rendering on mw1316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:32:17] (03CR) 10Elukey: "hey John, so the original patch is https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/552304/." [puppet] - 10https://gerrit.wikimedia.org/r/552613 (https://phabricator.wikimedia.org/T238905) (owner: 10Dzahn) [11:32:26] RECOVERY - Nginx local proxy to apache on mw1316 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 8.256 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:32:46] RECOVERY - Apache HTTP on mw1314 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.207 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:33:02] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [11:33:08] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:33:24] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [11:33:36] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [11:33:56] RECOVERY - PHP7 rendering on mw1312 is OK: HTTP OK: HTTP/1.1 200 OK - 79034 bytes in 0.205 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:34:44] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [11:34:48] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:34:58] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.5 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [11:35:20] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [11:36:10] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [11:36:26] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:36:32] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [11:36:34] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:37:14] PROBLEM - proton endpoints health on proton2002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [11:37:30] RECOVERY - PHP7 rendering on mw1316 is OK: HTTP OK: HTTP/1.1 200 OK - 79035 bytes in 8.708 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:37:48] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [11:37:50] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [11:38:14] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:38:14] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:38:34] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:38:34] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:38:42] PROBLEM - Nginx local proxy to apache on mw1316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [11:38:54] RECOVERY - proton endpoints health on proton2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [11:39:00] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [11:39:08] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:39:28] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [11:39:58] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:39:58] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:40:02] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [11:40:06] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:40:16] RECOVERY - Nginx local proxy to apache on mw1316 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.059 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:40:40] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [11:40:44] !log cumin -b 2 -s 10 restart php on API servers [11:40:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:48] PROBLEM - Apache HTTP on mw1346 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [11:41:37] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10Gilles) Noteworthy graph found by @ema: https://grafana.wikimedia.org/d/w4TRwaxZz/local-backend-hitrate-varnish-vs-ats?panelId=4&... [11:41:38] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:41:40] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:41:42] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [11:41:52] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.02917 https://wikitech.wikimedia.org/wiki/Logstash https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [11:41:58] PROBLEM - PHP7 rendering on mw1316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:42:26] RECOVERY - Apache HTTP on mw1346 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 4.612 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:42:28] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:42:46] PROBLEM - Nginx local proxy to apache on mw1313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [11:43:22] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:43:28] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out bef [11:43:28] s received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:43:32] PROBLEM - PHP7 rendering on mw1313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:43:48] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:43:48] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:44:06] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:44:18] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:44:18] RECOVERY - Nginx local proxy to apache on mw1313 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 0.463 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:44:38] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:45:04] RECOVERY - PHP7 rendering on mw1313 is OK: HTTP OK: HTTP/1.1 200 OK - 79034 bytes in 0.199 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:45:14] RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST [11:45:18] RECOVERY - PHP7 rendering on mw1316 is OK: HTTP OK: HTTP/1.1 200 OK - 79036 bytes in 3.230 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:45:24] (03PS1) 10Arturo Borrero Gonzalez: protmeheus: haproxy: add support for Debian Buster [puppet] - 10https://gerrit.wikimedia.org/r/552789 (https://phabricator.wikimedia.org/T237643) [11:46:21] (03PS2) 10Arturo Borrero Gonzalez: protmeheus: haproxy: add support for Debian Buster [puppet] - 10https://gerrit.wikimedia.org/r/552789 (https://phabricator.wikimedia.org/T237643) [11:47:00] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 1.675 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [11:47:12] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:47:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1126 after schema change', diff saved to https://phabricator.wikimedia.org/P9734 and previous config saved to /var/cache/conftool/dbconfig/20191125-114733-marostegui.json [11:47:36] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:47:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase main traffic weight for db1126', diff saved to https://phabricator.wikimedia.org/P9735 and previous config saved to /var/cache/conftool/dbconfig/20191125-114821-marostegui.json [11:48:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:33] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:48:36] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:48:50] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:48:58] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:49:32] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:49:56] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [11:49:59] 10Operations, 10Prod-Kubernetes, 10Pybal, 10Traffic, 10serviceops: Proposal: simplify set up of a new load-balanced service on kubernetes - https://phabricator.wikimedia.org/T238909 (10akosiaris) [11:50:07] 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10Patch-For-Review, 10cloud-services-team (Kanban): Migrate labmon* to Stretch (or Buster, better yet!) - https://phabricator.wikimedia.org/T224585 (10Phamhi) [11:50:53] 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10Patch-For-Review, 10cloud-services-team (Kanban): Migrate labmon* to Stretch (or Buster, better yet!) - https://phabricator.wikimedia.org/T224585 (10Phamhi) The scope of this request has been extended to OS rename as well. New OS hostname will be det... [11:50:56] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:55:30] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.004167 https://wikitech.wikimedia.org/wiki/Logstash https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [11:58:20] (03PS2) 10Muehlenhoff: Add image tracking support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/552517 (https://phabricator.wikimedia.org/T237978) [12:00:00] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, minor thing inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552789 (https://phabricator.wikimedia.org/T237643) (owner: 10Arturo Borrero Gonzalez) [12:10:38] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 58.59 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [12:10:43] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] protmeheus: haproxy: add support for Debian Buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552789 (https://phabricator.wikimedia.org/T237643) (owner: 10Arturo Borrero Gonzalez) [12:11:33] (03PS1) 10Ayounsi: Depool esams for esams/knams work [dns] - 10https://gerrit.wikimedia.org/r/552792 [12:14:06] (03CR) 10Muehlenhoff: "That would work, but given that the Buster package provides a systemd unit in the package it seems better to only use the ERB version for " [puppet] - 10https://gerrit.wikimedia.org/r/552789 (https://phabricator.wikimedia.org/T237643) (owner: 10Arturo Borrero Gonzalez) [12:14:48] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/19582/" [puppet] - 10https://gerrit.wikimedia.org/r/552777 (owner: 10Giuseppe Lavagetto) [12:15:22] 10Operations: Add Daimona to #mediawiki_security - https://phabricator.wikimedia.org/T239093 (10Daimona) [12:17:26] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 48.1 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [12:25:48] (03CR) 10Vgutierrez: [C: 03+2] acme_chief: Revoke access from netmon boxes to netbox certificate [puppet] - 10https://gerrit.wikimedia.org/r/552680 (https://phabricator.wikimedia.org/T238919) (owner: 10Vgutierrez) [12:26:00] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 99.06 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [12:27:29] !log disable BGP to knams transits - T237031 [12:27:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:34] T237031: Bundle esams-knams links back - https://phabricator.wikimedia.org/T237031 [12:28:18] 10Operations, 10Traffic, 10Inuka-Team (Kanban), 10Patch-For-Review, 10Performance-Team (Radar): Code for InukaPageView instrumentation - https://phabricator.wikimedia.org/T238029 (10AMuigai) Makes sense to me on both fronts @Neil_P._Quinn_WMF [12:29:35] (03PS1) 10Arturo Borrero Gonzalez: toolforge: new k8s: haproxy: include prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/552794 (https://phabricator.wikimedia.org/T237643) [12:30:23] (03PS1) 10Jbond: cas-icinga: add redirect back / => /icinga/ [puppet] - 10https://gerrit.wikimedia.org/r/552795 [12:31:05] (03CR) 10Gilles: "Has this been deployed? If not, le me know when you would like to do it." [puppet] - 10https://gerrit.wikimedia.org/r/519374 (https://phabricator.wikimedia.org/T211661) (owner: 10Gilles) [12:31:14] (03CR) 10Gilles: "Has this been deployed? If not, le me know when you would like to do it." [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/531204 (https://phabricator.wikimedia.org/T226707) (owner: 10Gilles) [12:37:20] (03PS3) 10Giuseppe Lavagetto: blubberoid: add telemetry collection support for envoy [deployment-charts] - 10https://gerrit.wikimedia.org/r/549837 (https://phabricator.wikimedia.org/T237234) [12:40:52] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10jijiki) [12:41:11] (03CR) 10Jbond: [C: 03+2] cas-icinga: add redirect back / => /icinga/ [puppet] - 10https://gerrit.wikimedia.org/r/552795 (owner: 10Jbond) [12:41:23] (03PS2) 10Jbond: cas-icinga: add redirect back / => /icinga/ [puppet] - 10https://gerrit.wikimedia.org/r/552795 [12:42:13] (03PS2) 10Arturo Borrero Gonzalez: toolforge: new k8s: haproxy: enable prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/552794 (https://phabricator.wikimedia.org/T237643) [12:42:31] !log bundle esams-knams links on esams side - T237031 [12:42:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:37] T237031: Bundle esams-knams links back - https://phabricator.wikimedia.org/T237031 [12:44:35] (03PS4) 10Giuseppe Lavagetto: blubberoid: add telemetry collection support for envoy [deployment-charts] - 10https://gerrit.wikimedia.org/r/549837 (https://phabricator.wikimedia.org/T237234) [12:45:00] 10Operations, 10ops-eqiad, 10DC-Ops: Duplicate cable label in cr1-eqiad/cr2-eqiad - https://phabricator.wikimedia.org/T239098 (10faidon) p:05Triage→03High [12:48:02] !log bundle esams-knams links on knams side - T237031 [12:48:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:07] T237031: Bundle esams-knams links back - https://phabricator.wikimedia.org/T237031 [12:48:18] (03CR) 10Giuseppe Lavagetto: [C: 03+2] blubberoid: add telemetry collection support for envoy [deployment-charts] - 10https://gerrit.wikimedia.org/r/549837 (https://phabricator.wikimedia.org/T237234) (owner: 10Giuseppe Lavagetto) [12:49:40] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: new k8s: haproxy: enable prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/552794 (https://phabricator.wikimedia.org/T237643) (owner: 10Arturo Borrero Gonzalez) [12:49:51] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: add TLS support for smokeping.wikimedia.org - https://phabricator.wikimedia.org/T238900 (10Vgutierrez) @volans @ayounsi IMHO it doesn't make any sense to include smokeping.wm.o SNI on the librenms certificate, that would set a dependency between otherw... [12:54:27] (03CR) 10BBlack: [C: 03+1] "Having fewer certs (with more SNIs in them, for shared purposes) doesn't seem worth trying to optimize for when cert management is so auto" [puppet] - 10https://gerrit.wikimedia.org/r/552398 (https://phabricator.wikimedia.org/T238900) (owner: 10Vgutierrez) [12:57:25] (03PS1) 10Giuseppe Lavagetto: blubberoid: revert to correct selector for service chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/552799 [12:57:56] (03CR) 10Giuseppe Lavagetto: [C: 03+2] blubberoid: revert to correct selector for service chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/552799 (owner: 10Giuseppe Lavagetto) [12:59:24] !log oblivian@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'blubberoid' for release 'staging' . [12:59:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:58] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 58.69 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [13:02:56] !log oblivian@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'blubberoid' for release 'staging' . [13:03:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:36] !log cleanup config on cr2-esams - T237031 [13:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:41] T237031: Bundle esams-knams links back - https://phabricator.wikimedia.org/T237031 [13:07:06] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 84.01 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [13:09:32] (03PS1) 10Giuseppe Lavagetto: blubberoid: use string for port value [deployment-charts] - 10https://gerrit.wikimedia.org/r/552802 [13:10:41] (03CR) 10Giuseppe Lavagetto: [C: 03+2] blubberoid: use string for port value [deployment-charts] - 10https://gerrit.wikimedia.org/r/552802 (owner: 10Giuseppe Lavagetto) [13:11:24] !log oblivian@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'blubberoid' for release 'staging' . [13:11:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:00] (03PS1) 10Giuseppe Lavagetto: blubberoid: all annotations are supposed to be strings. [deployment-charts] - 10https://gerrit.wikimedia.org/r/552803 [13:14:27] (03CR) 10Giuseppe Lavagetto: [C: 03+2] blubberoid: all annotations are supposed to be strings. [deployment-charts] - 10https://gerrit.wikimedia.org/r/552803 (owner: 10Giuseppe Lavagetto) [13:15:34] !log oblivian@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'blubberoid' for release 'staging' . [13:15:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:55] !log cleanup config on cr3-esams - T237031 [13:17:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:00] T237031: Bundle esams-knams links back - https://phabricator.wikimedia.org/T237031 [13:20:34] (03PS7) 10Jbond: puppetboard: Add cas authentication [puppet] - 10https://gerrit.wikimedia.org/r/552513 (https://phabricator.wikimedia.org/T238924) [13:20:36] (03PS2) 10Jbond: cas-puppetboard.wikimedia.org: add new cas protected puppetboard site [puppet] - 10https://gerrit.wikimedia.org/r/552536 (https://phabricator.wikimedia.org/T238924) [13:23:03] (03CR) 10Elukey: [C: 03+2] profile::mariadb::misc::eventlogging::database: set db to read only [puppet] - 10https://gerrit.wikimedia.org/r/552776 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [13:24:59] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: mw1239 memory errors - https://phabricator.wikimedia.org/T227867 (10Cmjohnson) mw1239 is well out of warranty and is over 5 years old. Historically we decom these host at this stage in their life. We also have a several new MW servers waiting to be racke... [13:25:57] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): Degraded RAID on elastic1039 - https://phabricator.wikimedia.org/T236601 (10Cmjohnson) 05Open→03Resolved Resolving this task for the failed raid, @gehel you may want to create a new one for the re-image. [13:26:56] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: mw1239 memory errors - https://phabricator.wikimedia.org/T227867 (10MoritzMuehlenhoff) @Cmjohnson mw1239 will be decommed soon via https://phabricator.wikimedia.org/T239054, we can close this task. [13:27:23] !log set global read_only=1 on db1108's log database - T159170 [13:27:25] 10Operations, 10SRE-tools, 10netbox: Netbox reports Icinga checks timeout - https://phabricator.wikimedia.org/T237803 (10faidon) What's the status of this task? [13:27:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:27] T159170: Sunset MySQL data store for eventlogging - https://phabricator.wikimedia.org/T159170 [13:28:29] (03CR) 10Vgutierrez: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/552515 (owner: 10Alexandros Kosiaris) [13:31:40] (03CR) 10Andrew Bogott: profile::url_downloader: Add missing labs neutron subnet, also link-local (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552631 (owner: 10Alex Monk) [13:34:43] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: add TLS support for smokeping.wikimedia.org - https://phabricator.wikimedia.org/T238900 (10Volans) No problem for me for 1 cert, it seems a reasonable approach. [13:36:56] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/548241 (https://phabricator.wikimedia.org/T237362) (owner: 10Jbond) [13:37:24] (03PS5) 10Elukey: Create airflow-search-admins admin group [puppet] - 10https://gerrit.wikimedia.org/r/552613 (https://phabricator.wikimedia.org/T238905) (owner: 10Dzahn) [13:37:29] (03PS3) 10Jbond: cas-puppetboard.wikimedia.org: add new cas protected puppetboard site [puppet] - 10https://gerrit.wikimedia.org/r/552536 (https://phabricator.wikimedia.org/T238924) [13:38:36] (03PS6) 10Elukey: Create airflow-search-admins admin group [puppet] - 10https://gerrit.wikimedia.org/r/552613 (https://phabricator.wikimedia.org/T238905) (owner: 10Dzahn) [13:39:41] (03CR) 10Elukey: "Daniel/John: reduced the scope of the change and added in the commit description https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/54" [puppet] - 10https://gerrit.wikimedia.org/r/552613 (https://phabricator.wikimedia.org/T238905) (owner: 10Dzahn) [13:40:00] jbond42: --^ let me know if you think it is ok or not :) [13:41:05] looking [13:41:27] 10Operations, 10observability: Tune HTTP availability alerts - https://phabricator.wikimedia.org/T236367 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Thresholds adjusted for global availability and I've updated "frontend traffic" dashboard [13:41:44] (03PS1) 10Giuseppe Lavagetto: envoy-tls-local-proxy: fix configuration [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/552807 [13:47:21] 10Operations, 10observability, 10serviceops, 10Patch-For-Review: dropped packets to conf1004/5/6 2379/tcp - https://phabricator.wikimedia.org/T238791 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Fixed! [13:47:53] 10Operations, 10observability: Logstash doesn't parse ulogd source and destination ports - https://phabricator.wikimedia.org/T238416 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Looks like this is all done, resolving [13:49:31] 10Operations, 10ops-eqiad, 10DC-Ops: Asset tag remaining cablemgmt in eqiad - https://phabricator.wikimedia.org/T239110 (10faidon) [13:51:31] (03CR) 10Jbond: "Thanks luca for the update. I think this is all fine as the service runs as the airflow user (we could lock down the systemd script furth" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552613 (https://phabricator.wikimedia.org/T238905) (owner: 10Dzahn) [13:52:15] 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Volans) If needed, full list of R440 available here: https://puppetboard.wikimedia.org/fact/productname/PowerEdge+R440 (intentionally not mentioning their count here) [13:55:28] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] envoy-tls-local-proxy: fix configuration [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/552807 (owner: 10Giuseppe Lavagetto) [13:59:04] RECOVERY - snapshot of s3 in eqiad on db1115 is OK: snapshot for s3 at eqiad taken less than 4 days ago and larger than 90 GB: Last one 2019-11-25 12:44:21 from db1095.eqiad.wmnet:3313 (860 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [13:59:11] (03PS8) 10Jbond: puppetboard: Add cas authentication [puppet] - 10https://gerrit.wikimedia.org/r/552513 (https://phabricator.wikimedia.org/T238924) [13:59:57] (03PS1) 10Filippo Giunchedi: prometheus: use max5m for node_ipvs gauges [puppet] - 10https://gerrit.wikimedia.org/r/552810 (https://phabricator.wikimedia.org/T236700) [14:04:07] !log oblivian@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'blubberoid' for release 'staging' . [14:04:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:37] (03PS7) 10Elukey: Create airflow-search-admins admin group [puppet] - 10https://gerrit.wikimedia.org/r/552613 (https://phabricator.wikimedia.org/T238905) (owner: 10Dzahn) [14:06:55] (03CR) 10Elukey: ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552613 (https://phabricator.wikimedia.org/T238905) (owner: 10Dzahn) [14:09:12] (03CR) 10BBlack: [C: 04-1] "For the traffic caches, we've standardized on X-Client-IP (XCIP) as a way for the traffic layer to single the original client IP address t" [puppet] - 10https://gerrit.wikimedia.org/r/552515 (owner: 10Alexandros Kosiaris) [14:09:43] (03CR) 10Jbond: "lgtm assuming authorised in monday meeting" [puppet] - 10https://gerrit.wikimedia.org/r/552613 (https://phabricator.wikimedia.org/T238905) (owner: 10Dzahn) [14:11:43] (03CR) 10CDanis: [C: 03+1] prometheus: use max5m for node_ipvs gauges [puppet] - 10https://gerrit.wikimedia.org/r/552810 (https://phabricator.wikimedia.org/T236700) (owner: 10Filippo Giunchedi) [14:12:41] (03CR) 10CDanis: [C: 03+1] puppet_ca: update puppet ca with a new certificate valid for 10 years [puppet] - 10https://gerrit.wikimedia.org/r/548241 (https://phabricator.wikimedia.org/T237362) (owner: 10Jbond) [14:13:21] (03PS1) 10Cmjohnson: Adding dns entries new ganeti hosts [dns] - 10https://gerrit.wikimedia.org/r/552812 (https://phabricator.wikimedia.org/T228924) [14:15:34] (03PS6) 10BBlack: authdns: refactor role/profile/hieradata bits [puppet] - 10https://gerrit.wikimedia.org/r/552346 (https://phabricator.wikimedia.org/T98006) [14:15:36] (03PS3) 10BBlack: Unify and simplify DNS server ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/552506 (https://phabricator.wikimedia.org/T98006) [14:15:38] (03PS1) 10BBlack: Move DNS server profiles under profile::dns:: [puppet] - 10https://gerrit.wikimedia.org/r/552814 (https://phabricator.wikimedia.org/T98006) [14:15:40] (03PS1) 10BBlack: Move DNS roles together under role::dns:: [puppet] - 10https://gerrit.wikimedia.org/r/552815 (https://phabricator.wikimedia.org/T98006) [14:16:38] PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 86, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:16:50] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 82, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:18:32] PROBLEM - Router interfaces on cr2-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 51, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:20:57] that's expected I think, cc XioNoX [14:21:24] godog: curious, is there maintenance or else? [14:21:48] there is maintenance @ esams (and soon knams) indeed, mark is on-site [14:22:00] ahhh [14:22:02] ah yes, the downtime expired I'll extend it [14:22:03] our maintenance, not by our vendor(s) [14:22:14] super [14:22:16] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 58.17 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [14:22:21] (03CR) 10Ema: [C: 03+1] cas-puppetboard.wikimedia.org: add new cas protected puppetboard site [puppet] - 10https://gerrit.wikimedia.org/r/552536 (https://phabricator.wikimedia.org/T238924) (owner: 10Jbond) [14:23:22] !log upgrading OpenJDK 11 on an-conf* [14:23:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:22] (03PS1) 10Effie Mouzeli: admin: add jiji to analytics-privatedata-users group [puppet] - 10https://gerrit.wikimedia.org/r/552818 [14:29:06] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 71.45 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [14:29:07] 10Operations, 10DNS, 10Internet-Archive, 10Traffic, 10Wikimedia-Video: Please set up a CNAME for videoserver.wikimedia.org to Video Editing Server - https://phabricator.wikimedia.org/T99216 (10Aklapper) Ah, thanks. But who exactly is supposed to answer that question? [14:30:24] RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 86, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:30:36] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 82, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:31:23] (03PS9) 10Jbond: puppetboard: Add cas authentication [puppet] - 10https://gerrit.wikimedia.org/r/552513 (https://phabricator.wikimedia.org/T238924) [14:31:24] (03PS1) 10Jbond: puppetboard: update puppet board public cert to add cas sni [puppet] - 10https://gerrit.wikimedia.org/r/552819 (https://phabricator.wikimedia.org/T238924) [14:31:33] (03CR) 10Elukey: [C: 03+1] "LGTM, let's wait for Nuria's formal +1 to follow rules :)" [puppet] - 10https://gerrit.wikimedia.org/r/552818 (owner: 10Effie Mouzeli) [14:35:22] 10Operations, 10ops-esams, 10netops: setup new MX204 in knams - https://phabricator.wikimedia.org/T237030 (10ayounsi) [14:37:26] !log Deploy schema change on s1 codfw (this will generate lag on codfw) - T234066 T233135 [14:37:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:32] T233135: Schema change for refactored actor and comment storage - https://phabricator.wikimedia.org/T233135 [14:37:32] T234066: Schema change to rename user_newtalk indexes - https://phabricator.wikimedia.org/T234066 [14:37:50] (03PS1) 10Muehlenhoff: Setup rsync config for U2F device storage [puppet] - 10https://gerrit.wikimedia.org/r/552821 [14:37:53] (03CR) 10Jbond: [C: 03+2] puppetboard: Add cas authentication [puppet] - 10https://gerrit.wikimedia.org/r/552513 (https://phabricator.wikimedia.org/T238924) (owner: 10Jbond) [14:38:10] !log Remove triggers from archive table on s1 codfw sanitarium T234704 [14:38:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:14] T234704: Remove ar_comment from sanitarium triggers - https://phabricator.wikimedia.org/T234704 [14:39:10] 10Operations, 10Traffic, 10observability: Varnish traffic drop alert @ codfw is noisy / codfw incoming traffic is spikey - https://phabricator.wikimedia.org/T239039 (10fgiunchedi) Took a quick look at the expression and the idea LGTM, thanks @CDanis. Also cc @ayounsi as the original implementor of the alert [14:39:45] (03PS1) 10Ema: Revert "cache: reimage cp3064 as text_ats" [puppet] - 10https://gerrit.wikimedia.org/r/552825 (https://phabricator.wikimedia.org/T238494) [14:39:58] (03CR) 10jerkins-bot: [V: 04-1] Setup rsync config for U2F device storage [puppet] - 10https://gerrit.wikimedia.org/r/552821 (owner: 10Muehlenhoff) [14:42:43] (03PS2) 10Ema: Revert "cache: reimage cp3064 as text_ats" [puppet] - 10https://gerrit.wikimedia.org/r/552825 (https://phabricator.wikimedia.org/T238494) [14:44:04] !log elukey@cumin1001 START - Cookbook sre.zookeeper.roll-restart-zookeeper [14:44:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:45] zookeeper analytics cluster --^ [14:45:58] !log depool cp3064 and reimage with varnish-be T227432 [14:46:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:03] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [14:47:18] (03CR) 10Ema: [C: 03+2] Revert "cache: reimage cp3064 as text_ats" [puppet] - 10https://gerrit.wikimedia.org/r/552825 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema) [14:48:17] (03CR) 10Jbond: [C: 03+2] puppetboard: update puppet board public cert to add cas sni [puppet] - 10https://gerrit.wikimedia.org/r/552819 (https://phabricator.wikimedia.org/T238924) (owner: 10Jbond) [14:48:41] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp3064.esams.wmnet'] ` The... [14:49:25] (03PS2) 10Muehlenhoff: Setup rsync config for U2F device storage [puppet] - 10https://gerrit.wikimedia.org/r/552821 [14:50:12] !log enable cr3-esams:et-1/0/0 - T236767 [14:50:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:17] T236767: cr3-esams:et-1/0/0 flap - https://phabricator.wikimedia.org/T236767 [14:50:31] !log elukey@cumin1001 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) [14:50:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:27] (03CR) 10BBlack: [C: 03+2] authdns: refactor role/profile/hieradata bits [puppet] - 10https://gerrit.wikimedia.org/r/552346 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [14:51:29] this spicerack thing seems working [14:51:32] :P [14:51:47] (03CR) 10Jbond: [C: 03+2] cas-puppetboard.wikimedia.org: add new cas protected puppetboard site [puppet] - 10https://gerrit.wikimedia.org/r/552536 (https://phabricator.wikimedia.org/T238924) (owner: 10Jbond) [14:52:20] (03CR) 10jerkins-bot: [V: 04-1] Setup rsync config for U2F device storage [puppet] - 10https://gerrit.wikimedia.org/r/552821 (owner: 10Muehlenhoff) [14:54:37] elukey: lol :D [14:55:01] (03PS3) 10Muehlenhoff: Setup rsync config for U2F device storage [puppet] - 10https://gerrit.wikimedia.org/r/552821 [14:55:05] (03CR) 10Ema: [C: 03+2] ATS: enable reload for global Lua script [puppet] - 10https://gerrit.wikimedia.org/r/552201 (https://phabricator.wikimedia.org/T233274) (owner: 10Ema) [14:56:09] (03PS1) 10DCausse: [wdqs] add async-import option [puppet] - 10https://gerrit.wikimedia.org/r/552835 (https://phabricator.wikimedia.org/T238045) [14:56:11] (03PS1) 10DCausse: [wdqs] enable asynchronous imports on wdqs1004 [puppet] - 10https://gerrit.wikimedia.org/r/552836 (https://phabricator.wikimedia.org/T238045) [14:58:05] (03CR) 10jerkins-bot: [V: 04-1] Setup rsync config for U2F device storage [puppet] - 10https://gerrit.wikimedia.org/r/552821 (owner: 10Muehlenhoff) [14:58:53] (03PS1) 10Herron: logstash: create elk7 ES role and assign to elk7 ES hw hosts [puppet] - 10https://gerrit.wikimedia.org/r/552837 (https://phabricator.wikimedia.org/T234854) [14:59:26] (03CR) 10Jbond: [C: 03+2] cas-puppetboard.wikimedia.org: add record [dns] - 10https://gerrit.wikimedia.org/r/552503 (owner: 10Jbond) [14:59:34] (03PS2) 10Jbond: cas-puppetboard.wikimedia.org: add record [dns] - 10https://gerrit.wikimedia.org/r/552503 [15:00:42] PROBLEM - Aggregate IPsec Tunnel Status eqiad on icinga1001 is CRITICAL: instance={cp1083:9536,cp1085:9536} site=eqiad tunnel={cp3064_v4,cp3064_v6} https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [15:00:46] !log bblack@cumin1001 conftool action : set/weight=100; selector: name=cp3056.esams.wmnet,service=ats-be [15:00:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:50] (03CR) 10jerkins-bot: [V: 04-1] logstash: create elk7 ES role and assign to elk7 ES hw hosts [puppet] - 10https://gerrit.wikimedia.org/r/552837 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [15:02:01] !log bblack@cumin1001 conftool action : set/pooled=yes; selector: name=cp3056.esams.wmnet [15:02:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:18] (03CR) 10DCausse: "ge" [puppet] - 10https://gerrit.wikimedia.org/r/552836 (https://phabricator.wikimedia.org/T238045) (owner: 10DCausse) [15:02:29] (03PS2) 10Herron: logstash: create elk7 ES role and assign to elk7 ES hw hosts [puppet] - 10https://gerrit.wikimedia.org/r/552837 (https://phabricator.wikimedia.org/T234854) [15:03:03] !log cp1075: ats-backend-restart to enable lua reload T233274 [15:03:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:08] T233274: ATS lua script reload doesn't work as expected - https://phabricator.wikimedia.org/T233274 [15:03:18] PROBLEM - Aggregate IPsec Tunnel Status codfw on icinga1001 is CRITICAL: instance=cp2023:9536 site=codfw tunnel={cp3064_v4,cp3064_v6} https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [15:04:34] ACKNOWLEDGEMENT - Aggregate IPsec Tunnel Status codfw on icinga1001 is CRITICAL: instance=cp2023:9536 site=codfw tunnel={cp3064_v4,cp3064_v6} Ema reimaging 3064 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [15:04:34] ACKNOWLEDGEMENT - Aggregate IPsec Tunnel Status eqiad on icinga1001 is CRITICAL: instance={cp1083:9536,cp1085:9536} site=eqiad tunnel={cp3064_v4,cp3064_v6} Ema reimaging 3064 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [15:05:25] (03CR) 10jerkins-bot: [V: 04-1] logstash: create elk7 ES role and assign to elk7 ES hw hosts [puppet] - 10https://gerrit.wikimedia.org/r/552837 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [15:05:26] (03PS1) 10Mholloway: Update wikifeeds to 2019-11-25-144622-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/552838 (https://phabricator.wikimedia.org/T238942) [15:05:41] (03PS4) 10Muehlenhoff: Setup rsync config for U2F device storage [puppet] - 10https://gerrit.wikimedia.org/r/552821 [15:07:11] (03CR) 10Mholloway: [C: 03+2] Update wikifeeds to 2019-11-25-144622-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/552838 (https://phabricator.wikimedia.org/T238942) (owner: 10Mholloway) [15:07:29] (03Merged) 10jenkins-bot: Update wikifeeds to 2019-11-25-144622-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/552838 (https://phabricator.wikimedia.org/T238942) (owner: 10Mholloway) [15:07:41] (03CR) 10BBlack: [C: 03+2] Unify and simplify DNS server ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/552506 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [15:07:49] (03PS4) 10BBlack: Unify and simplify DNS server ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/552506 (https://phabricator.wikimedia.org/T98006) [15:08:44] (03CR) 10jerkins-bot: [V: 04-1] Setup rsync config for U2F device storage [puppet] - 10https://gerrit.wikimedia.org/r/552821 (owner: 10Muehlenhoff) [15:09:33] !log ema@cumin1001 START - Cookbook sre.hosts.downtime [15:09:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:13] !log mholloway-shell@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'wikifeeds' for release 'staging' . [15:10:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:56] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [15:11:09] !log cp1075: ats-tls-restart to enable lua reload T233274 [15:11:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:14] T233274: ATS lua script reload doesn't work as expected - https://phabricator.wikimedia.org/T233274 [15:11:21] (03PS5) 10Muehlenhoff: Setup rsync config for U2F device storage [puppet] - 10https://gerrit.wikimedia.org/r/552821 [15:11:39] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:11:40] !log mholloway-shell@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'wikifeeds' for release 'production' . [15:11:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:29] !log mholloway-shell@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'wikifeeds' for release 'production' . [15:13:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:22] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [15:14:22] !log gehel@cumin1001 START - Cookbook sre.hosts.downtime [15:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:28] RECOVERY - Aggregate IPsec Tunnel Status eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [15:18:05] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3064.esams.wmnet'] ` Of which those **FAILED**: ` ['cp3064.esams.wmnet... [15:18:35] !log gehel@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:18:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:33] !log cp-ats: rolling ats-{tls,backend} restart to enable lua reload T233274 [15:20:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:39] T233274: ATS lua script reload doesn't work as expected - https://phabricator.wikimedia.org/T233274 [15:21:52] (03PS2) 10DCausse: [wdqs] enable asynchronous imports on wdqs1004 [puppet] - 10https://gerrit.wikimedia.org/r/552836 (https://phabricator.wikimedia.org/T238045) [15:22:08] (03PS3) 10Marostegui: mariadb: Promote db1086 to s7 primary master [puppet] - 10https://gerrit.wikimedia.org/r/552381 (https://phabricator.wikimedia.org/T238044) [15:22:17] (03PS3) 10Marostegui: wmnet: Update s7-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/552382 (https://phabricator.wikimedia.org/T238044) [15:22:58] !log cp3064 manual reboot after wmf-auto-reimage error: 'Unable to run wmf-auto-reimage-host: Failed to reboot_host' T238494 [15:23:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:03] T238494: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 [15:27:38] (03PS6) 10Muehlenhoff: Setup rsync config for U2F device storage [puppet] - 10https://gerrit.wikimedia.org/r/552821 [15:28:26] PROBLEM - Aggregate IPsec Tunnel Status eqiad on icinga1001 is CRITICAL: instance={cp1083:9536,cp1085:9536,cp1087:9536,cp1089:9536} site=eqiad tunnel={cp3064_v4,cp3064_v6} https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [15:29:06] PROBLEM - traffic_server backend process restarted on cp5010 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=eqsin+prometheus/ops&var-instance=cp5010&var-layer=backend [15:30:40] (03PS1) 10Mobrovac: Parsoid: Switch private wiki consumers (Flow, VE) to Parsoid/PHP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552845 (https://phabricator.wikimedia.org/T229015) [15:33:08] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 48.7 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [15:34:58] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=205 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [15:36:09] !log cp3064 create filesystem on /dev/nvme0n1p1 (see https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/552547/) and reboot T238494 [15:36:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:14] T238494: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 [15:36:22] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [15:37:04] (03PS2) 10BBlack: Move DNS server profiles under profile::dns:: [puppet] - 10https://gerrit.wikimedia.org/r/552814 (https://phabricator.wikimedia.org/T98006) [15:37:06] (03PS2) 10BBlack: Move DNS roles together under role::dns:: [puppet] - 10https://gerrit.wikimedia.org/r/552815 (https://phabricator.wikimedia.org/T98006) [15:39:14] RECOVERY - Aggregate IPsec Tunnel Status codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [15:39:28] RECOVERY - Aggregate IPsec Tunnel Status eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [15:40:05] (03PS7) 10Muehlenhoff: Setup rsync config for U2F device storage [puppet] - 10https://gerrit.wikimedia.org/r/552821 [15:41:20] (03PS1) 10Jbond: idp: add puppetboard service [puppet] - 10https://gerrit.wikimedia.org/r/552850 (https://phabricator.wikimedia.org/T238924) [15:44:11] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/552850 (https://phabricator.wikimedia.org/T238924) (owner: 10Jbond) [15:44:39] (03CR) 10Jbond: [C: 03+2] idp: add puppetboard service [puppet] - 10https://gerrit.wikimedia.org/r/552850 (https://phabricator.wikimedia.org/T238924) (owner: 10Jbond) [15:45:12] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 79.18 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [15:46:04] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/19590/" [puppet] - 10https://gerrit.wikimedia.org/r/552821 (owner: 10Muehlenhoff) [15:46:24] (03PS1) 10Ayounsi: Remove old cr2-knams <-> cr2/3-esams links [dns] - 10https://gerrit.wikimedia.org/r/552851 (https://phabricator.wikimedia.org/T237031) [15:46:53] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10ema) Among the things I've checked to rule out obvious mistakes porting VCL to Lua: - Cookie responses without "session" or "toke... [15:47:58] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=205 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [15:49:00] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [15:49:20] (03CR) 10BBlack: [C: 03+2] "looks good in compiler, bunch of rename-y things but no functional bits: https://puppet-compiler.wmflabs.org/compiler1003/19589/" [puppet] - 10https://gerrit.wikimedia.org/r/552815 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [15:49:25] (03CR) 10BBlack: [C: 03+2] Move DNS server profiles under profile::dns:: [puppet] - 10https://gerrit.wikimedia.org/r/552814 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [15:49:47] !log pool cp3064 with varnish-be T227432 [15:49:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:51] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [15:54:30] (03PS1) 10Ema: ATS: log Cache-Control as received from the origin [puppet] - 10https://gerrit.wikimedia.org/r/552853 (https://phabricator.wikimedia.org/T238494) [15:56:49] (03PS4) 10CRusnov: netbox: Expose automated DNS repository for web access [puppet] - 10https://gerrit.wikimedia.org/r/551948 (https://phabricator.wikimedia.org/T233183) [15:57:55] (03CR) 10CRusnov: "This change is ready for review." (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/551948 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov) [15:58:28] (03CR) 10Jbond: "looks good but see inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/552821 (owner: 10Muehlenhoff) [15:59:50] (03CR) 10jerkins-bot: [V: 04-1] netbox: Expose automated DNS repository for web access [puppet] - 10https://gerrit.wikimedia.org/r/551948 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov) [16:05:36] 10Operations, 10SRE-tools, 10netbox: Netbox reports Icinga checks timeout - https://phabricator.wikimedia.org/T237803 (10crusnov) 05Open→03Resolved I executed the plan that Riccardo outlined, removed the running ability in the check and switched to running from the management script, which has simplified... [16:06:06] (03CR) 10Muehlenhoff: Setup rsync config for U2F device storage (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/552821 (owner: 10Muehlenhoff) [16:06:24] (03PS8) 10Muehlenhoff: Setup rsync config for U2F device storage [puppet] - 10https://gerrit.wikimedia.org/r/552821 [16:07:13] 10Operations, 10MediaWiki-REST-API, 10serviceops, 10wikidiff2, and 2 others: Deploy version 1.10.0 of wikidiff2 to production - https://phabricator.wikimedia.org/T236963 (10WDoranWMF) Thanks @jijiki! [16:08:26] RECOVERY - traffic_server backend process restarted on cp5010 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=eqsin+prometheus/ops&var-instance=cp5010&var-layer=backend [16:08:33] 10Operations, 10observability: Monitor mailman outbound mail queue - https://phabricator.wikimedia.org/T236505 (10colewhite) [16:09:43] (03PS1) 10Volans: netbox: remove limit from API query [puppet] - 10https://gerrit.wikimedia.org/r/552854 [16:11:05] 10Operations, 10ops-esams, 10netops: Setup esams atlas anchor - https://phabricator.wikimedia.org/T174637 (10faidon) [16:11:39] (03CR) 10CRusnov: [C: 03+1] "LGTM as discussed" [puppet] - 10https://gerrit.wikimedia.org/r/552854 (owner: 10Volans) [16:13:38] 10Operations, 10ops-esams, 10netops: Setup esams atlas anchor - https://phabricator.wikimedia.org/T174637 (10faidon) The Anchor is now installed, connected to the SCS, and we see a getty on serial with the right hostname. It's also now responsive to IPv4 pings but not IPv6 (which matches our previous experie... [16:14:08] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [16:15:48] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [16:17:26] PROBLEM - Check correctness of the icinga configuration on icinga1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [16:22:46] (03PS9) 10Muehlenhoff: Setup rsync config for U2F device storage [puppet] - 10https://gerrit.wikimedia.org/r/552821 [16:22:48] (03CR) 10Muehlenhoff: Setup rsync config for U2F device storage (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552821 (owner: 10Muehlenhoff) [16:24:28] (03CR) 10Nuria: "Looks good, please take some time to read https://wikitech.wikimedia.org/wiki/Analytics/Data_Access_Guidelines" [puppet] - 10https://gerrit.wikimedia.org/r/552818 (owner: 10Effie Mouzeli) [16:32:16] 10Operations: ganeti netbox sync alerts are noisy - https://phabricator.wikimedia.org/T233624 (10crusnov) 05Open→03Resolved This should be resolved. [16:33:49] 10Operations: Puppet breakage in automation-framework VMs - https://phabricator.wikimedia.org/T234452 (10crusnov) 05Open→03Resolved This should be resolved, I've spot checked hosts in the af project and they have been running puppet normally. [16:34:02] (03PS1) 10Muehlenhoff: Setup systemd timer for rsync of U2F devices config [puppet] - 10https://gerrit.wikimedia.org/r/552857 [16:36:50] (03CR) 10jerkins-bot: [V: 04-1] Setup systemd timer for rsync of U2F devices config [puppet] - 10https://gerrit.wikimedia.org/r/552857 (owner: 10Muehlenhoff) [16:37:36] 10Operations, 10ops-esams: cr3-esams:et-1/0/0 flap - https://phabricator.wikimedia.org/T236767 (10faidon) @mark swapped the optic with a new one and the link is now reenabled. This is being monitored for another 24-36h and will be resolved then. [16:37:37] (03CR) 10Ayounsi: [C: 03+2] Remove old cr2-knams <-> cr2/3-esams links [dns] - 10https://gerrit.wikimedia.org/r/552851 (https://phabricator.wikimedia.org/T237031) (owner: 10Ayounsi) [16:38:36] (03PS1) 10Jbond: puppetboard: add proxied_as parameter [puppet] - 10https://gerrit.wikimedia.org/r/552859 (https://phabricator.wikimedia.org/T238924) [16:39:24] 10Operations, 10ops-esams, 10Patch-For-Review: Bundle esams-knams links back - https://phabricator.wikimedia.org/T237031 (10ayounsi) 05Open→03Resolved a:03ayounsi All done. Not re-enabling knams transits as we're setting up the new MX204 right now. [16:39:54] 10Operations, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Rename multimedia-team to structured-data-team - https://phabricator.wikimedia.org/T235550 (10crusnov) a:05crusnov→03None [16:40:18] 10Operations, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Rename multimedia-team to structured-data-team - https://phabricator.wikimedia.org/T235550 (10crusnov) Passing to next clinic duty person. [16:40:38] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10jijiki) [16:41:08] (03PS2) 10Muehlenhoff: Setup systemd timer for rsync of U2F devices config [puppet] - 10https://gerrit.wikimedia.org/r/552857 [16:44:12] (03PS2) 10Jbond: puppetboard: add proxied_as parameter [puppet] - 10https://gerrit.wikimedia.org/r/552859 (https://phabricator.wikimedia.org/T238924) [16:44:31] (03CR) 10jerkins-bot: [V: 04-1] Setup systemd timer for rsync of U2F devices config [puppet] - 10https://gerrit.wikimedia.org/r/552857 (owner: 10Muehlenhoff) [16:47:40] (03PS3) 10Herron: logstash: create elk7 ES role and assign to elk7 ES hw hosts [puppet] - 10https://gerrit.wikimedia.org/r/552837 (https://phabricator.wikimedia.org/T234854) [16:47:52] (03CR) 10Volans: [C: 03+2] netbox: remove limit from API query [puppet] - 10https://gerrit.wikimedia.org/r/552854 (owner: 10Volans) [16:47:54] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/552821 (owner: 10Muehlenhoff) [16:48:26] !log upgrading and restarting dbprov* hosts [16:48:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:10] (03PS1) 10Jcrespo: check_mariadb.py: Update bacula logic to the latest Bacula class [puppet] - 10https://gerrit.wikimedia.org/r/552860 [16:50:26] (03CR) 10jerkins-bot: [V: 04-1] logstash: create elk7 ES role and assign to elk7 ES hw hosts [puppet] - 10https://gerrit.wikimedia.org/r/552837 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [16:51:08] (03PS4) 10Herron: logstash: create elk7 ES role and assign to elk7 ES hw hosts [puppet] - 10https://gerrit.wikimedia.org/r/552837 (https://phabricator.wikimedia.org/T234854) [16:52:01] (03CR) 10jerkins-bot: [V: 04-1] check_mariadb.py: Update bacula logic to the latest Bacula class [puppet] - 10https://gerrit.wikimedia.org/r/552860 (owner: 10Jcrespo) [16:52:15] 10Operations, 10Traffic, 10netops, 10observability: Network port utilization alerts should be paging - https://phabricator.wikimedia.org/T224888 (10CDanis) I've a proposal for doing this: - Add some special tag like `#NRPE` or `#page` to the names of any [[ https://librenms.wikimedia.org/alert-rules | L... [16:54:23] (03PS3) 10Muehlenhoff: Setup systemd timer for rsync of U2F devices config [puppet] - 10https://gerrit.wikimedia.org/r/552857 [16:56:17] (03CR) 10Ema: [C: 03+1] ATS: Prevent logrotate from creating empty log files [puppet] - 10https://gerrit.wikimedia.org/r/552678 (https://phabricator.wikimedia.org/T238724) (owner: 10Vgutierrez) [16:56:50] (03CR) 10Vgutierrez: [C: 03+2] ATS: Prevent logrotate from creating empty log files [puppet] - 10https://gerrit.wikimedia.org/r/552678 (https://phabricator.wikimedia.org/T238724) (owner: 10Vgutierrez) [16:57:30] (03PS5) 10CRusnov: netbox: Expose automated DNS repository for web access [puppet] - 10https://gerrit.wikimedia.org/r/551948 (https://phabricator.wikimedia.org/T233183) [16:57:33] (03CR) 10jerkins-bot: [V: 04-1] Setup systemd timer for rsync of U2F devices config [puppet] - 10https://gerrit.wikimedia.org/r/552857 (owner: 10Muehlenhoff) [16:58:27] (03PS4) 10Jbond: Setup systemd timer for rsync of U2F devices config [puppet] - 10https://gerrit.wikimedia.org/r/552857 (owner: 10Muehlenhoff) [17:00:04] gehel and onimisionipe: How many deployers does it take to do Wikidata Query Service weekly deploy deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191125T1700). [17:00:49] ack [17:01:10] 10Operations, 10ops-eqsin: duplicate cable IDs in eqsin - https://phabricator.wikimedia.org/T239125 (10RobH) p:05Triage→03Normal [17:01:22] 10Operations, 10ops-eqsin: duplicate cable IDs in eqsin - https://phabricator.wikimedia.org/T239125 (10RobH) [17:02:56] (03PS3) 10Jbond: puppetboard: add proxied_as parameter [puppet] - 10https://gerrit.wikimedia.org/r/552859 (https://phabricator.wikimedia.org/T238924) [17:05:00] 10Operations, 10ops-eqiad, 10Analytics: analytics1057's BBU is faulty - https://phabricator.wikimedia.org/T239045 (10mforns) p:05Triage→03High [17:05:20] !log arlolra@deploy1001 Started deploy [parsoid/deploy@e7faa19]: Updating Parsoid to a6bfdfa [17:05:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:24] 10Operations, 10MediaWiki-REST-API, 10serviceops, 10wikidiff2, and 2 others: Deploy version 1.10.0 of wikidiff2 to production - https://phabricator.wikimedia.org/T236963 (10WDoranWMF) @jijiki Do you know when the rollout will be complete to all prod? [17:12:59] !log onimisionipe@deploy1001 Started deploy [wdqs/wdqs@4c5f503]: New Blazegraph Build and WDQS Updates [17:13:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:18] !log arlolra@deploy1001 Finished deploy [parsoid/deploy@e7faa19]: Updating Parsoid to a6bfdfa (duration: 08m 58s) [17:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:55] (03PS1) 10BBlack: Fix common/monitoring dnsbox cluster defs [puppet] - 10https://gerrit.wikimedia.org/r/552861 (https://phabricator.wikimedia.org/T98006) [17:15:24] 10Operations, 10MediaWiki-REST-API, 10serviceops, 10wikidiff2, and 2 others: Deploy version 1.10.0 of wikidiff2 to production - https://phabricator.wikimedia.org/T236963 (10jijiki) @WDoranWMF Today after the SRE meeting, I will roll out to production. We had some minor issues with our api servers this morn... [17:16:59] (03CR) 10BBlack: [C: 03+2] Fix common/monitoring dnsbox cluster defs [puppet] - 10https://gerrit.wikimedia.org/r/552861 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [17:17:42] (03PS2) 10Eevans: kask-echoseen: Do not report dupes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552257 (https://phabricator.wikimedia.org/T237143) (owner: 10Mobrovac) [17:19:42] !log power down cr2-knams - T237030 [17:19:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:47] T237030: setup new MX204 in knams - https://phabricator.wikimedia.org/T237030 [17:20:02] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2125 crashed - https://phabricator.wikimedia.org/T239042 (10Papaul) a:05Papaul→03Marostegui complete Before BIOS Version 2.2.11 iDRAC Firmware Version 3.34.34.34 After BIOS Version 2.4.7 iDRAC Firmware Version 3.36.36.36 [17:21:13] RECOVERY - Check correctness of the icinga configuration on icinga1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [17:21:27] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2125 crashed - https://phabricator.wikimedia.org/T239042 (10Marostegui) Thank you Papaul. I will start MySQL and do a data consistency check. [17:23:42] (03CR) 10Gehel: [C: 03+1] "LGTM (for what it's worth)" [puppet] - 10https://gerrit.wikimedia.org/r/548241 (https://phabricator.wikimedia.org/T237362) (owner: 10Jbond) [17:24:52] (03PS3) 10Ema: ATS: explicitly skip the cache instead of hiding CC [puppet] - 10https://gerrit.wikimedia.org/r/552076 (https://phabricator.wikimedia.org/T238494) [17:24:54] (03PS1) 10Ema: ATS: do not coalesce uncacheable requests [puppet] - 10https://gerrit.wikimedia.org/r/552862 (https://phabricator.wikimedia.org/T238494) [17:25:21] !log onimisionipe@deploy1001 Finished deploy [wdqs/wdqs@4c5f503]: New Blazegraph Build and WDQS Updates (duration: 12m 23s) [17:25:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:06] (03CR) 10Vgutierrez: [C: 03+1] ATS: do not coalesce uncacheable requests [puppet] - 10https://gerrit.wikimedia.org/r/552862 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema) [17:29:49] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10elukey) In hiera we have 4 codfw mw hosts acting as proxy for mcrouter: ` codfw: A: host: 10.192.0.61 # mw2235, A3 port: 11214 ssl: true B: host: 10.192.16.5... [17:31:45] PROBLEM - Check systemd state on wdqs1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:31:45] PROBLEM - Check systemd state on wdqs1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:31:57] PROBLEM - Check systemd state on wdqs2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:32:09] PROBLEM - Check systemd state on wdqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:32:11] PROBLEM - Check systemd state on wdqs1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:32:14] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10elukey) [17:32:19] onimisionipe: ^^known issue? [17:32:21] PROBLEM - Check systemd state on wdqs1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:32:29] PROBLEM - Check systemd state on wdqs2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:32:35] PROBLEM - Check systemd state on wdqs2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:32:35] PROBLEM - Check systemd state on wdqs2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:32:35] PROBLEM - Check systemd state on wdqs2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:32:59] hmmm [17:33:20] I'm checking [17:33:39] onimisionipe: looks like an error in the updater [17:34:10] onimisionipe: probably needs a rollback [17:34:19] RECOVERY - Check systemd state on wdqs1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:34:22] ● wdqs-updater.service loaded failed failed Query Service Updater [17:35:08] yeah a java exception on startup [17:35:11] null pointer exception when syncing dates [17:35:57] onimisionipe: I'll open a phab task, scream if you need help! [17:36:21] !log Upgrade kernel on db2125 T239042 [17:36:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:29] T239042: db2125 crashed - https://phabricator.wikimedia.org/T239042 [17:36:36] rolling back! [17:36:50] (03PS1) 10Mholloway: Update wikifeeds to 2019-11-25-173023-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/552864 (https://phabricator.wikimedia.org/T235652) [17:37:49] RECOVERY - Check systemd state on wdqs2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:38:23] PROBLEM - Check systemd state on wdqs1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:38:51] RECOVERY - Check systemd state on wdqs1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:40:19] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog1001.eqiad.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [17:40:35] (03PS5) 10Andrew Bogott: wmf_sink: remove instance-puppet git entries for deleted VMs [puppet] - 10https://gerrit.wikimedia.org/r/552583 (https://phabricator.wikimedia.org/T238708) [17:41:45] RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [17:42:09] PROBLEM - Check systemd state on wdqs2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:42:10] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10jijiki) [17:42:15] RECOVERY - Check systemd state on wdqs2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:43:08] 10Operations, 10ops-esams, 10netops: setup new MX204 in knams - https://phabricator.wikimedia.org/T237030 (10mark) [17:43:13] 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10BBlack) It was observed earlier in the traffic meeting that we're fairly certain that none of our R440 hosts have had this problem more than once, so this may be a "once per server" phenomenon, in... [17:43:15] PROBLEM - Check systemd state on wdqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:43:32] (03PS2) 10Cmjohnson: Adding dns entries new ganeti hosts [dns] - 10https://gerrit.wikimedia.org/r/552812 (https://phabricator.wikimedia.org/T228924) [17:43:41] (03PS6) 10Andrew Bogott: wmf_sink: remove instance-puppet git entries for deleted VMs [puppet] - 10https://gerrit.wikimedia.org/r/552583 (https://phabricator.wikimedia.org/T238708) [17:44:00] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack/setup/install fundraising db system - https://phabricator.wikimedia.org/T239139 (10RobH) p:05Triage→03Normal [17:44:08] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack/setup/install fundraising db system - https://phabricator.wikimedia.org/T239139 (10RobH) [17:44:46] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2125 crashed - https://phabricator.wikimedia.org/T239042 (10Marostegui) Kernel upgraded and host rebooted: ` root@db2125:~# uname -a Linux db2125 4.9.0-11-amd64 #1 SMP Debian 4.9.189-3+deb9u2 (2019-11-11) x86_64 GNU/Linux ` [17:45:00] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack/setup/install fundraising db system - https://phabricator.wikimedia.org/T239139 (10RobH) @jgreen: Can you provide the hostname and network info for these? I think we want to have the network ports, one each, plugged into the fasw (so the single server... [17:45:12] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10jijiki) [17:45:14] !log onimisionipe@deploy1001 Started deploy [wdqs/wdqs@4c5f503]: Revert New Blazegraph Build and WDQS Updates [17:45:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:56] (03CR) 10Andrew Bogott: [C: 03+2] wmf_sink: remove instance-puppet git entries for deleted VMs [puppet] - 10https://gerrit.wikimedia.org/r/552583 (https://phabricator.wikimedia.org/T238708) (owner: 10Andrew Bogott) [17:46:34] (03CR) 10Cmjohnson: [C: 03+2] Adding dns entries new ganeti hosts [dns] - 10https://gerrit.wikimedia.org/r/552812 (https://phabricator.wikimedia.org/T228924) (owner: 10Cmjohnson) [17:46:55] RECOVERY - Check systemd state on wdqs2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:46:55] PROBLEM - Check systemd state on wdqs2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:47:19] 10Operations, 10ops-eqiad: rack/setup/instal (4) CI ganeti nodes - https://phabricator.wikimedia.org/T228926 (10Cmjohnson) production dns was added here https://gerrit.wikimedia.org/r/#/c/operations/dns/+/552812/ [17:47:22] (03CR) 10Mobrovac: [C: 03+2] Parsoid: Switch private wiki consumers (Flow, VE) to Parsoid/PHP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552845 (https://phabricator.wikimedia.org/T229015) (owner: 10Mobrovac) [17:47:31] 10Operations, 10ops-eqiad: rack/setup/instal (4) CI ganeti nodes - https://phabricator.wikimedia.org/T228926 (10Cmjohnson) [17:48:01] RECOVERY - Check systemd state on wdqs1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:48:09] (03Merged) 10jenkins-bot: Parsoid: Switch private wiki consumers (Flow, VE) to Parsoid/PHP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552845 (https://phabricator.wikimedia.org/T229015) (owner: 10Mobrovac) [17:48:24] (03CR) 10Vgutierrez: [C: 03+1] ATS: explicitly skip the cache instead of hiding CC [puppet] - 10https://gerrit.wikimedia.org/r/552076 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema) [17:48:33] RECOVERY - Check systemd state on wdqs2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:49:06] 10Operations, 10ops-eqiad: rack/setup/instal (4) CI ganeti nodes - https://phabricator.wikimedia.org/T228926 (10Cmjohnson) [17:50:12] !log mobrovac@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Parsoid: Switch private wiki clients (Flow, VE) to Parsoid/PHP -- T229015 (duration: 00m 53s) [17:50:13] PROBLEM - Check systemd state on wdqs2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:50:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:17] T229015: Tracking: Direct live production traffic at Parsoid/PHP - https://phabricator.wikimedia.org/T229015 [17:51:06] 10Operations, 10Traffic, 10fixcopyright.wikimedia.org: Redirect all traffic for fixcopyright.wikimedia.org to https://policy.wikimedia.org/policy-landing/copyright/ - https://phabricator.wikimedia.org/T239141 (10Jdforrester-WMF) [17:51:26] 10Operations, 10Traffic, 10fixcopyright.wikimedia.org, 10Core Platform Team Workboards (Clinic Duty Team), and 3 others: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10Jdforrester-WMF) 05Open→03Stalled [17:51:55] PROBLEM - Check systemd state on wdqs2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:52:29] RECOVERY - Check systemd state on wdqs1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:53:03] PROBLEM - Check systemd state on wdqs1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:53:35] PROBLEM - Check systemd state on wdqs2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:53:37] RECOVERY - Check systemd state on wdqs2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:54:31] RECOVERY - Check systemd state on wdqs2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:55:01] RECOVERY - Check systemd state on wdqs1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:55:38] !log onimisionipe@deploy1001 Finished deploy [wdqs/wdqs@4c5f503]: Revert New Blazegraph Build and WDQS Updates (duration: 10m 24s) [17:55:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:49] !log restart wdqs-updater on all wdqs servers [17:55:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:55] RECOVERY - Check systemd state on wdqs1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:56:29] RECOVERY - Check systemd state on wdqs1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:56:31] RECOVERY - Check systemd state on wdqs1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:56:50] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack/setup/install fundraising db system - https://phabricator.wikimedia.org/T239139 (10Jgreen) [17:56:59] RECOVERY - Check systemd state on wdqs2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:57:05] RECOVERY - Check systemd state on wdqs2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:57:05] RECOVERY - Check systemd state on wdqs2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:57:05] RECOVERY - Check systemd state on wdqs2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:57:25] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack/setup/install fundraising db system - https://phabricator.wikimedia.org/T239139 (10Jgreen) [17:57:28] 04Critical Alert for device cr2-knams.wikimedia.org - Juniper alarm active [17:57:39] PROBLEM - Check systemd state on wdqs1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:59:19] RECOVERY - Check systemd state on wdqs1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: Your horoscope predicts another unfortunate Morning SWAT(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191125T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:00:11] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack/setup/install fundraising db system - https://phabricator.wikimedia.org/T239139 (10Jgreen) [18:02:39] 10Operations, 10serviceops, 10HHVM, 10MW-1.35-notes (1.35.0-wmf.3; 2019-10-22), and 2 others: Remove HHVM from production - https://phabricator.wikimedia.org/T229792 (10jijiki) [18:02:57] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10jijiki) [18:03:00] 10Operations, 10serviceops, 10HHVM, 10MW-1.35-notes (1.35.0-wmf.3; 2019-10-22), and 2 others: Remove HHVM from production - https://phabricator.wikimedia.org/T229792 (10jijiki) [18:03:35] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack/setup/install fundraising db system - https://phabricator.wikimedia.org/T239139 (10Jgreen) >>! In T239139#5690558, @RobH wrote: > @jgreen: Can you provide the hostname and network info for these? I think we want to have the network ports, one each, plu... [18:05:26] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10jijiki) [18:05:37] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: (Need By: IMMEDIATE) rack/setup/install frdb1003.eqiad.wmnet - https://phabricator.wikimedia.org/T239139 (10RobH) [18:05:55] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: (Need By: IMMEDIATE) rack/setup/install frdb1003.eqiad.wmnet - https://phabricator.wikimedia.org/T239139 (10RobH) p:05Normal→03High [18:06:49] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: (ESCALATED Need By: 11/26/19) rack/setup/install frdb1003.eqiad.wmnet - https://phabricator.wikimedia.org/T239139 (10RobH) [18:07:19] (03CR) 10Rush: "testing adding a group as a reviewer" [puppet] - 10https://gerrit.wikimedia.org/r/456690 (owner: 10Rush) [18:07:22] !log Upgrade php-wikidiff2 to 1.10.0 to all servers - T236963 [18:07:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:27] T236963: Deploy version 1.10.0 of wikidiff2 to production - https://phabricator.wikimedia.org/T236963 [18:10:16] (03PS1) 10Andrew Bogott: wmf_sink: forward newton changes to ocata [puppet] - 10https://gerrit.wikimedia.org/r/552867 (https://phabricator.wikimedia.org/T238708) [18:11:59] PROBLEM - Disk space on cp4028 is CRITICAL: DISK CRITICAL - free space: / 107 MB (1% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp4028&var-datasource=ulsfo+prometheus/ops [18:12:21] (03CR) 10Andrew Bogott: [C: 03+2] wmf_sink: forward newton changes to ocata [puppet] - 10https://gerrit.wikimedia.org/r/552867 (https://phabricator.wikimedia.org/T238708) (owner: 10Andrew Bogott) [18:13:24] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: (ESCALATED Need By: 11/26/19) rack/setup/install frdb1003.eqiad.wmnet - https://phabricator.wikimedia.org/T239139 (10RobH) [18:13:44] is it the swat window now? [18:13:54] (03PS3) 10Elukey: admin: add analytics-privatedata system user [puppet] - 10https://gerrit.wikimedia.org/r/550814 (https://phabricator.wikimedia.org/T238306) [18:14:03] jouncebot: now [18:14:03] For the next 0 hour(s) and 45 minute(s): Morning SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191125T1800) [18:14:11] 10Operations, 10ops-esams, 10netops: setup new MX204 in knams - https://phabricator.wikimedia.org/T237030 (10mark) [18:14:21] looks like it :) [18:14:45] (03CR) 10Elukey: [C: 03+2] "Approved by today's SRE meeting. Just rebased." [puppet] - 10https://gerrit.wikimedia.org/r/550814 (https://phabricator.wikimedia.org/T238306) (owner: 10Elukey) [18:15:29] i would like https://gerrit.wikimedia.org/r/c/mediawiki/extensions/TemplateData/+/552620 backported once that merges. [18:16:08] !log Restart php-fpm on mw* and wtp* servers in eqiad and codfw - T236963 [18:16:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:14] T236963: Deploy version 1.10.0 of wikidiff2 to production - https://phabricator.wikimedia.org/T236963 [18:17:48] !log cp4028: disk space exhausted, rm /var/log/daemon.log + restart rsyslog [18:17:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:27] PROBLEM - Disk space on cp4031 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp4031&var-datasource=ulsfo+prometheus/ops [18:18:49] RECOVERY - Disk space on cp4028 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp4028&var-datasource=ulsfo+prometheus/ops [18:19:29] (03CR) 10Elukey: [C: 04-1] "Erik I got the +1 to merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/552613/, can you please change this accordingly?" [puppet] - 10https://gerrit.wikimedia.org/r/552304 (https://phabricator.wikimedia.org/T236180) (owner: 10EBernhardson) [18:20:38] (03CR) 10Elukey: [C: 03+2] Create airflow-search-admins admin group [puppet] - 10https://gerrit.wikimedia.org/r/552613 (https://phabricator.wikimedia.org/T238905) (owner: 10Dzahn) [18:20:45] (03PS8) 10Elukey: Create airflow-search-admins admin group [puppet] - 10https://gerrit.wikimedia.org/r/552613 (https://phabricator.wikimedia.org/T238905) (owner: 10Dzahn) [18:20:57] (03CR) 10Elukey: [C: 03+2] "Approved by today's SRE meeting." [puppet] - 10https://gerrit.wikimedia.org/r/552613 (https://phabricator.wikimedia.org/T238905) (owner: 10Dzahn) [18:24:25] PROBLEM - Disk space on cp4032 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp4032&var-datasource=ulsfo+prometheus/ops [18:24:35] PROBLEM - Disk space on cp5007 is CRITICAL: DISK CRITICAL - free space: / 12 MB (0% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp5007&var-datasource=eqsin+prometheus/ops [18:25:22] jouncebot: refresh [18:25:22] I refreshed my knowledge about deployments. [18:25:24] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: Check if a GPU fits in any of the remaining stat or notebook hosts - https://phabricator.wikimedia.org/T220698 (10RobH) So, I advise against any guesswork on this. If you want to know if these 5 servers will hold a GPU, each purchase group needs to be... [18:25:25] jouncebot: next [18:25:25] In 0 hour(s) and 34 minute(s): Grafana upgrade (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191125T1900) [18:26:07] PROBLEM - Disk space on cp4029 is CRITICAL: DISK CRITICAL - free space: / 112 MB (1% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp4029&var-datasource=ulsfo+prometheus/ops [18:26:26] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: Check if a GPU fits in any of the remaining stat or notebook hosts - https://phabricator.wikimedia.org/T220698 (10RobH) Please note we likely need to schedule downtime for each of those 4 hosts to shutdown and check them. @elukey: Can you advise how m... [18:26:34] !log cp[245]*: disk space exhausted, rm /var/log/daemon.log + restart rsyslog [18:26:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:57] RECOVERY - Disk space on cp4031 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp4031&var-datasource=ulsfo+prometheus/ops [18:27:49] RECOVERY - Disk space on cp4032 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp4032&var-datasource=ulsfo+prometheus/ops [18:27:59] RECOVERY - Disk space on cp5007 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp5007&var-datasource=eqsin+prometheus/ops [18:29:13] (03PS1) 10RobH: frdb1003 mgmt entry [dns] - 10https://gerrit.wikimedia.org/r/552868 (https://phabricator.wikimedia.org/T239139) [18:29:33] RECOVERY - Disk space on cp4029 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp4029&var-datasource=ulsfo+prometheus/ops [18:29:35] doh [18:29:36] (03CR) 10jerkins-bot: [V: 04-1] frdb1003 mgmt entry [dns] - 10https://gerrit.wikimedia.org/r/552868 (https://phabricator.wikimedia.org/T239139) (owner: 10RobH) [18:29:37] immediate typo [18:29:45] (03PS1) 10Vgutierrez: Revert "ATS: enable reload for global Lua script" [puppet] - 10https://gerrit.wikimedia.org/r/552869 [18:29:52] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: Check if a GPU fits in any of the remaining stat or notebook hosts - https://phabricator.wikimedia.org/T220698 (10elukey) a:05Cmjohnson→03RobH @RobH if we could go one at the time I think that a day before the maintenance is sufficient, I'll take c... [18:30:02] (03PS2) 10RobH: frdb1003 mgmt entry [dns] - 10https://gerrit.wikimedia.org/r/552868 (https://phabricator.wikimedia.org/T239139) [18:30:57] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: Check if a GPU fits in any of the remaining stat or notebook hosts - https://phabricator.wikimedia.org/T220698 (10RobH) a:05RobH→03Cmjohnson Please note that Chris will still be performing this, it needs to stay assigned to him. He will be coordin... [18:31:13] PROBLEM - Check systemd state on cp4031 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:31:20] (03CR) 10RobH: [C: 03+2] frdb1003 mgmt entry [dns] - 10https://gerrit.wikimedia.org/r/552868 (https://phabricator.wikimedia.org/T239139) (owner: 10RobH) [18:31:51] mdholloway, is it normal for machinevision tests to take 20+ mins? ( https://integration.wikimedia.org/zuul/ ) [18:31:58] (03Abandoned) 10Lucas Werkmeister (WMDE): fatalmonitor: exec watch [puppet] - 10https://gerrit.wikimedia.org/r/499761 (owner: 10Lucas Werkmeister (WMDE)) [18:32:54] (03PS5) 10Elukey: airflow: Add upstream configuration [puppet] - 10https://gerrit.wikimedia.org/r/544996 (owner: 10EBernhardson) [18:34:10] subbu: unfortunately yes. we have to wait for a lot of unrelated Wikibase tests to run due to our dependency on WikibaseMediaInfo. [18:34:12] (03PS10) 10Elukey: airflow: Initial deployment for search platform [puppet] - 10https://gerrit.wikimedia.org/r/544989 (https://phabricator.wikimedia.org/T236180) (owner: 10EBernhardson) [18:34:21] mdholloway, ok. [18:34:37] (03PS1) 10Ammarpad: Enable Translate extension on sewikiwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552870 (https://phabricator.wikimedia.org/T239091) [18:35:43] (03PS2) 10Ammarpad: Enable Translate extension on sewikiwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552870 (https://phabricator.wikimedia.org/T239091) [18:36:02] alrighty .. looks like the templatedata patch merged. [18:36:11] (03CR) 10Elukey: [C: 03+2] airflow: Add upstream configuration [puppet] - 10https://gerrit.wikimedia.org/r/544996 (owner: 10EBernhardson) [18:37:27] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-knams.wikimedia.org recovered from Juniper alarm active [18:38:11] (03CR) 10Elukey: [C: 03+2] airflow: Initial deployment for search platform [puppet] - 10https://gerrit.wikimedia.org/r/544989 (https://phabricator.wikimedia.org/T236180) (owner: 10EBernhardson) [18:38:13] is there some swatter available to backport https://gerrit.wikimedia.org/r/c/mediawiki/extensions/TemplateData/+/552620 [18:38:19] 10Operations, 10SRE-Access-Requests: Requesting access to wiki pages, recentchanges, revesions and users info in DB for Matanel11111 and Nahum11 - https://phabricator.wikimedia.org/T239149 (10Matanel11111) [18:39:15] PROBLEM - Disk space on cp4028 is CRITICAL: DISK CRITICAL - free space: / 242 MB (2% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp4028&var-datasource=ulsfo+prometheus/ops [18:40:17] MaxSem, Niharika RoanKattouw ? [18:41:25] RECOVERY - Check systemd state on cp4031 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:43:10] I can do it [18:43:23] (I’m a swatter, just not usually for this slot :) ) [18:43:44] (03CR) 10Ema: [C: 03+2] Revert "ATS: enable reload for global Lua script" [puppet] - 10https://gerrit.wikimedia.org/r/552869 (owner: 10Vgutierrez) [18:44:55] (03PS1) 10Elukey: airflow: move hiera config under role and add missing params [puppet] - 10https://gerrit.wikimedia.org/r/552872 (https://phabricator.wikimedia.org/T236180) [18:45:18] Lucas_WMDE, ok ty ... I see Reedy cherry-picked it already @ https://gerrit.wikimedia.org/r/c/mediawiki/extensions/TemplateData/+/552871 [18:45:24] * Urbanecm is around, if needed [18:45:26] yup, waiting on CI at the moment [18:45:34] Urbanecm: you do enough SWATs as it is, let me have this one :P [18:45:47] Lucas_WMDE: feel free to :D [18:47:27] PROBLEM - Disk space on cp4031 is CRITICAL: DISK CRITICAL - free space: / 1 MB (0% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp4031&var-datasource=ulsfo+prometheus/ops [18:47:49] the deploy should be a no-op since there isn't any deployed parsoid code that uses that new hook yet. but, that will change tomorrow. :) [18:48:06] (03PS2) 10Elukey: airflow: move hiera config under role and add missing params [puppet] - 10https://gerrit.wikimedia.org/r/552872 (https://phabricator.wikimedia.org/T236180) [18:48:17] PROBLEM - Disk space on cp4032 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp4032&var-datasource=ulsfo+prometheus/ops [18:48:31] PROBLEM - Disk space on cp5007 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp5007&var-datasource=eqsin+prometheus/ops [18:48:47] !log mw1298 - pooling [18:48:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:41] 10Operations, 10SRE-Access-Requests: Requesting access to wiki pages, recentchanges, revesions and users info in DB for Matanel11111 and Nahum11 - https://phabricator.wikimedia.org/T239149 (10Masumrezarock100) a:05Matanel11111→03None [18:50:13] RECOVERY - Disk space on cp5007 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp5007&var-datasource=eqsin+prometheus/ops [18:50:15] !log cp[245]*: wipe daemon.log and restart syslog, again [18:50:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:39] PROBLEM - Disk space on cp5010 is CRITICAL: DISK CRITICAL - free space: / 24 MB (0% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp5010&var-datasource=eqsin+prometheus/ops [18:50:40] it’s merged! [18:50:51] RECOVERY - Disk space on cp4031 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp4031&var-datasource=ulsfo+prometheus/ops [18:51:18] 10Operations, 10SRE-Access-Requests: Requesting access to wiki pages, RecentChanges, revisions and users info in DB for Matanel11111 and Nahum11 - https://phabricator.wikimedia.org/T239149 (10Masumrezarock100) [18:51:20] (03PS1) 10CRusnov: coherence: Check device names for correct case [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/552874 (https://phabricator.wikimedia.org/T237469) [18:51:22] !log cumin -b1 'A:cp-ats and A:eqiad' 'run-puppet-agent; ats-backend-restart & ats-tls-restart' [18:51:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:43] RECOVERY - Disk space on cp4032 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp4032&var-datasource=ulsfo+prometheus/ops [18:52:09] (03PS2) 10CRusnov: coherence: Check device names for correct case [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/552874 (https://phabricator.wikimedia.org/T237469) [18:52:09] 10Operations, 10SRE-Access-Requests: Requesting access to wiki pages, RecentChanges, revisions and users info in DB for Matanel11111 and Nahum11 - https://phabricator.wikimedia.org/T239149 (10Urbanecm) isn't that what toolforge allows you to have? See https://wikitech.wikimedia.org/wiki/Help:Toolforge and http... [18:52:23] RECOVERY - Disk space on cp5010 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp5010&var-datasource=eqsin+prometheus/ops [18:52:26] (03CR) 10Elukey: [C: 03+2] airflow: move hiera config under role and add missing params [puppet] - 10https://gerrit.wikimedia.org/r/552872 (https://phabricator.wikimedia.org/T236180) (owner: 10Elukey) [18:52:28] !log cumin -b1 'A:cp-ats and A:codfw' 'run-puppet-agent; ats-backend-restart & ats-tls-restart' [18:52:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:55] PROBLEM - Disk space on cp4028 is CRITICAL: DISK CRITICAL - free space: / 236 MB (2% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp4028&var-datasource=ulsfo+prometheus/ops [18:53:21] !log cumin -b1 'A:cp-ats and A:ulsfo' 'run-puppet-agent; ats-backend-restart & ats-tls-restart' [18:53:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:35] subbu: the change is on mwdebug1001, is there any way to test it at all? [18:53:41] nope. it is a no-op. [18:53:44] ok [18:53:47] then I’ll just sync [18:53:49] thanks! [18:53:53] yup. ty. [18:53:54] !log cumin -b1 'A:cp-ats and A:eqsin' 'run-puppet-agent; ats-backend-restart & ats-tls-restart' [18:53:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:20] !log cumin -b1 'A:cp-ats and A:esams' 'run-puppet-agent; ats-backend-restart & ats-tls-restart' [18:54:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:24] !log cp[245]*: wipe daemon.log and syslog and restart syslog, again [18:54:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:29] (03PS5) 10Herron: logstash: create elk7 ES role and assign to elk7 ES hw hosts [puppet] - 10https://gerrit.wikimedia.org/r/552837 (https://phabricator.wikimedia.org/T234854) [18:55:40] !log lucaswerkmeister-wmde@deploy1001 Synchronized php-1.35.0-wmf.5/extensions/TemplateData/: SWAT: [[gerrit:552871|Implement ParsoidFetchTemplateData hook for Parsoid/PHP (T238954)]] (duration: 00m 53s) [18:55:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:45] T238954: html2wt: Missing implementation of 'ParsoidFetchTemplateData' to fetch templatedata - https://phabricator.wikimedia.org/T238954 [18:56:51] !log Morning SWAT done [18:56:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:35] (03PS6) 10Herron: logstash: create elk7 ES role and assign to elk7 ES hw hosts [puppet] - 10https://gerrit.wikimedia.org/r/552837 (https://phabricator.wikimedia.org/T234854) [18:59:44] 10Operations, 10Gerrit, 10vm-requests: Gerrit VM to test data migration - https://phabricator.wikimedia.org/T239151 (10thcipriani) [19:00:04] cdanis: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Grafana upgrade. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191125T1900). [19:01:07] PROBLEM - Disk space on cp4031 is CRITICAL: DISK CRITICAL - free space: / 186 MB (2% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp4031&var-datasource=ulsfo+prometheus/ops [19:01:17] 10Operations, 10SRE-Access-Requests: Requesting access to wiki pages, RecentChanges, revisions and users info in DB for Matanel11111 and Nahum11 - https://phabricator.wikimedia.org/T239149 (10Matanel11111) I have a tool, but how can i login to the tool with SSH? [19:01:39] (03PS13) 10EBernhardson: airflow: Run webserver and scheduler processes [puppet] - 10https://gerrit.wikimedia.org/r/544990 (https://phabricator.wikimedia.org/T236180) [19:01:59] PROBLEM - Disk space on cp4032 is CRITICAL: DISK CRITICAL - free space: / 114 MB (1% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp4032&var-datasource=ulsfo+prometheus/ops [19:02:11] PROBLEM - Disk space on cp5007 is CRITICAL: DISK CRITICAL - free space: / 133 MB (1% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp5007&var-datasource=eqsin+prometheus/ops [19:02:37] (03PS1) 10RLazarus: poolcounter: Install and run poolcounter-prometheus-exporter alongside poolcounter. [puppet] - 10https://gerrit.wikimedia.org/r/552875 (https://phabricator.wikimedia.org/T237407) [19:03:06] (03PS1) 10CDanis: grafana1002: is just grafana.wm.o now [puppet] - 10https://gerrit.wikimedia.org/r/552876 (https://phabricator.wikimedia.org/T220838) [19:03:12] (03CR) 10Herron: "PCC https://puppet-compiler.wmflabs.org/compiler1002/19599/" [puppet] - 10https://gerrit.wikimedia.org/r/552837 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [19:04:19] (03CR) 10jerkins-bot: [V: 04-1] poolcounter: Install and run poolcounter-prometheus-exporter alongside poolcounter. [puppet] - 10https://gerrit.wikimedia.org/r/552875 (https://phabricator.wikimedia.org/T237407) (owner: 10RLazarus) [19:04:30] 10Operations: (Need By 8/15/19) rack/setup/install ms-be105[7-9].eqiad.wmnet - https://phabricator.wikimedia.org/T237438 (10Cmjohnson) [19:04:32] 10Operations, 10Gerrit, 10vm-requests: Gerrit VM to test data migration - https://phabricator.wikimedia.org/T239151 (10Paladox) a:03Dzahn [19:05:17] thank you jerkins-bot [19:05:22] 10Operations: (Need By 8/15/19) rack/setup/install ms-be105[7-9].eqiad.wmnet - https://phabricator.wikimedia.org/T237438 (10Cmjohnson) @fgiunchedi These are ready for you for implementation. I removed the ops-eqiad tag. if you have an issue please assign to me and add the ops-eqiad tag back [19:05:51] (03PS2) 10RLazarus: poolcounter: Install and run the prometheus exporter alongside poolcounter. [puppet] - 10https://gerrit.wikimedia.org/r/552875 (https://phabricator.wikimedia.org/T237407) [19:06:03] !log making grafana.wikimedia.org read-only (on grafana1001) ✔️ cdanis@grafana1001.eqiad.wmnet ~ 🕑☕ sudo chmod -w /var/lib/grafana/grafana.db [19:06:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:04] (03CR) 10CDanis: [C: 03+2] grafana1002: is just grafana.wm.o now [puppet] - 10https://gerrit.wikimedia.org/r/552876 (https://phabricator.wikimedia.org/T220838) (owner: 10CDanis) [19:07:07] PROBLEM - Disk space on cp4029 is CRITICAL: DISK CRITICAL - free space: / 246 MB (2% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp4029&var-datasource=ulsfo+prometheus/ops [19:07:22] !log stopping grafana-next.wikimedia.org (on grafana1002) [19:07:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:32] (03CR) 10jerkins-bot: [V: 04-1] poolcounter: Install and run the prometheus exporter alongside poolcounter. [puppet] - 10https://gerrit.wikimedia.org/r/552875 (https://phabricator.wikimedia.org/T237407) (owner: 10RLazarus) [19:09:00] (03PS2) 10EBernhardson: Allow analytics-search-users to manage search/airflow instances [puppet] - 10https://gerrit.wikimedia.org/r/552304 (https://phabricator.wikimedia.org/T236180) [19:09:32] (03PS1) 10Elukey: profile::analytics::search::airflow: fix file resource and add deps [puppet] - 10https://gerrit.wikimedia.org/r/552878 (https://phabricator.wikimedia.org/T236180) [19:10:44] 10Operations, 10ops-esams, 10netops: Setup new MX204 in knams - https://phabricator.wikimedia.org/T237030 (10faidon) [19:11:20] (03PS1) 10CDanis: grafana1002: is now the server for grafana.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/552879 (https://phabricator.wikimedia.org/T220838) [19:11:32] 10Operations, 10ops-eqiad, 10DC-Ops: Asset tag remaining cablemgmt in eqiad - https://phabricator.wikimedia.org/T239110 (10Cmjohnson) 05Open→03Resolved asset tags have been added to all and netbox updated [19:11:36] !log copied snapshot of database from grafana1001 to grafana1002 T220838 [19:11:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:41] T220838: Upgrade grafana to 6.4.4 - https://phabricator.wikimedia.org/T220838 [19:11:43] PROBLEM - HTTPS Unified RSA on cp5007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [19:11:49] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp5007 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 error generating metric output - 665 bytes in 0.464 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:11:59] PROBLEM - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp5007 is CRITICAL: connect to address 10.132.0.107 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:12:05] PROBLEM - check_trafficserver_log_fifo_tls_tls on cp5007 is CRITICAL: CRITICAL: /srv/trafficserver/tls/var/log/tls.pipe - fifo-log-demux not reading from pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:12:15] PROBLEM - Ensure traffic_server is running for instance tls on cp5007 is CRITICAL: PROCS CRITICAL: 0 processes with args /srv/trafficserver/tls/bin/traffic_server -M --run-root=/srv/trafficserver/tls/runroot.yaml --httpport 443 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:12:33] PROBLEM - HTTPS Unified ECDSA on cp5007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [19:12:41] PROBLEM - ats-tls HTTPS en.wikipedia.org RSA on cp5007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [19:12:41] PROBLEM - ats-tls HTTPS en.wikipedia.org ECDSA on cp5007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [19:13:03] PROBLEM - Ensure traffic_manager is running for instance tls on cp5007 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/traffic_manager --run-root=/srv/trafficserver/tls --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:13:35] PROBLEM - Juniper alarms on cr2-knams is CRITICAL: JNX_ALARMS CRITICAL - The requested table is empty or does not exist https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [19:13:38] (03CR) 10Elukey: [C: 03+2] profile::analytics::search::airflow: fix file resource and add deps [puppet] - 10https://gerrit.wikimedia.org/r/552878 (https://phabricator.wikimedia.org/T236180) (owner: 10Elukey) [19:13:41] !log restarted grafana-server on grafana1002 T220838 [19:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:21] PROBLEM - Check systemd state on cp4032 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:14:22] cp5007 is a reinstall? [19:14:22] !log cp[245]*: wipe daemon.log and syslog and restart syslog, again [19:14:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:37] i see the host key changed [19:14:43] 10Operations, 10ops-eqiad, 10fundraising-tech-ops, 10Patch-For-Review: (ESCALATED Need By: 11/26/19) rack/setup/install frdb1003.eqiad.wmnet - https://phabricator.wikimedia.org/T239139 (10wiki_willy) @Jgreen - I gave @Jclark-ctr a heads up on this, so he'll starting working on it, when he gets in a bit lat... [19:14:45] RECOVERY - Ensure traffic_manager is running for instance tls on cp5007 is OK: PROCS OK: 1 process with args /usr/bin/traffic_manager --run-root=/srv/trafficserver/tls --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:15:11] RECOVERY - HTTPS Unified RSA on cp5007 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345572 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2020-11-22 07:59:59 +0000 (expires in 362 days) https://wikitech.wikimedia.org/wiki/HTTPS [19:15:15] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp5007 is OK: HTTP OK: HTTP/1.0 200 OK - 19886 bytes in 0.705 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:15:27] RECOVERY - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp5007 is OK: HTTP OK: HTTP/1.1 200 Ok - 30130 bytes in 1.166 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:15:31] RECOVERY - check_trafficserver_log_fifo_tls_tls on cp5007 is OK: OK: TS_MAIN writing to and fifo-log-demux reading from /srv/trafficserver/tls/var/log/tls.pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:15:41] PROBLEM - HTTPS Unified RSA on cp4028 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [19:15:41] RECOVERY - Ensure traffic_server is running for instance tls on cp5007 is OK: PROCS OK: 1 process with args /srv/trafficserver/tls/bin/traffic_server -M --run-root=/srv/trafficserver/tls/runroot.yaml --httpport 443 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:15:47] RECOVERY - Disk space on cp4029 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp4029&var-datasource=ulsfo+prometheus/ops [19:15:53] PROBLEM - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp4028 is CRITICAL: connect to address 10.128.0.128 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:15:59] RECOVERY - HTTPS Unified ECDSA on cp5007 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345523 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2020-11-22 07:59:59 +0000 (expires in 362 days) https://wikitech.wikimedia.org/wiki/HTTPS [19:16:03] (03PS3) 10RLazarus: poolcounter: Install and run the prometheus exporter alongside poolcounter. [puppet] - 10https://gerrit.wikimedia.org/r/552875 (https://phabricator.wikimedia.org/T237407) [19:16:07] RECOVERY - ats-tls HTTPS en.wikipedia.org RSA on cp5007 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345516 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2020-11-22 07:59:59 +0000 (expires in 362 days) https://wikitech.wikimedia.org/wiki/HTTPS [19:16:07] RECOVERY - ats-tls HTTPS en.wikipedia.org ECDSA on cp5007 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345515 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2020-11-22 07:59:59 +0000 (expires in 362 days) https://wikitech.wikimedia.org/wiki/HTTPS [19:16:19] PROBLEM - ats-tls HTTPS en.wikipedia.org RSA on cp4028 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [19:16:25] PROBLEM - Ensure traffic_server is running for instance tls on cp4028 is CRITICAL: PROCS CRITICAL: 0 processes with args /srv/trafficserver/tls/bin/traffic_server -M --run-root=/srv/trafficserver/tls/runroot.yaml --httpport 443 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:16:27] PROBLEM - HTTPS Unified ECDSA on cp4028 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [19:16:29] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp4028 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 error generating metric output - 665 bytes in 0.151 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:17:38] PROBLEM - ats-tls HTTPS en.wikipedia.org ECDSA on cp4032 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [19:17:42] PROBLEM - HTTPS Unified RSA on cp4032 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [19:17:45] (03PS1) 10Herron: logstash: create elk7 logstash role and assign to elk7 collectors [puppet] - 10https://gerrit.wikimedia.org/r/552881 (https://phabricator.wikimedia.org/T234854) [19:20:33] 10Operations, 10SRE-Access-Requests: Requesting access to wiki pages, RecentChanges, revisions and users info in DB for Matanel11111 and Nahum11 - https://phabricator.wikimedia.org/T239149 (10Urbanecm) You need to first ssh to your personal account and then do `become ` [19:20:35] (03PS1) 10Elukey: profile::analytics::search::airflow: fix directory ensure [puppet] - 10https://gerrit.wikimedia.org/r/552882 (https://phabricator.wikimedia.org/T236180) [19:20:54] (03CR) 10jerkins-bot: [V: 04-1] logstash: create elk7 logstash role and assign to elk7 collectors [puppet] - 10https://gerrit.wikimedia.org/r/552881 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [19:21:36] (03PS2) 10Herron: logstash: create elk7 logstash role and assign to elk7 collectors [puppet] - 10https://gerrit.wikimedia.org/r/552881 (https://phabricator.wikimedia.org/T234854) [19:21:44] RECOVERY - ats-tls HTTPS en.wikipedia.org RSA on cp4028 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345596 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2020-11-22 07:59:59 +0000 (expires in 362 days) https://wikitech.wikimedia.org/wiki/HTTPS [19:21:50] RECOVERY - Ensure traffic_server is running for instance tls on cp4028 is OK: PROCS OK: 1 process with args /srv/trafficserver/tls/bin/traffic_server -M --run-root=/srv/trafficserver/tls/runroot.yaml --httpport 443 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:21:50] RECOVERY - HTTPS Unified ECDSA on cp4028 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345590 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2020-11-22 07:59:59 +0000 (expires in 362 days) https://wikitech.wikimedia.org/wiki/HTTPS [19:21:52] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp4028 is OK: HTTP OK: HTTP/1.0 200 OK - 19875 bytes in 0.230 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:22:12] RECOVERY - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp4028 is OK: HTTP OK: HTTP/1.1 200 Ok - 30114 bytes in 0.379 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:22:24] RECOVERY - Disk space on cp5007 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp5007&var-datasource=eqsin+prometheus/ops [19:22:50] RECOVERY - ats-tls HTTPS en.wikipedia.org ECDSA on cp4032 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345550 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2020-11-22 07:59:59 +0000 (expires in 362 days) https://wikitech.wikimedia.org/wiki/HTTPS [19:22:52] RECOVERY - HTTPS Unified RSA on cp4032 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345548 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2020-11-22 07:59:59 +0000 (expires in 362 days) https://wikitech.wikimedia.org/wiki/HTTPS [19:23:08] RECOVERY - Disk space on cp4031 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp4031&var-datasource=ulsfo+prometheus/ops [19:23:10] RECOVERY - Disk space on cp4032 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp4032&var-datasource=ulsfo+prometheus/ops [19:23:26] RECOVERY - HTTPS Unified RSA on cp4028 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345494 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2020-11-22 07:59:59 +0000 (expires in 362 days) https://wikitech.wikimedia.org/wiki/HTTPS [19:23:51] (03CR) 10Elukey: [C: 03+2] profile::analytics::search::airflow: fix directory ensure [puppet] - 10https://gerrit.wikimedia.org/r/552882 (https://phabricator.wikimedia.org/T236180) (owner: 10Elukey) [19:24:38] (03PS1) 10Dzahn: conftool: un-comment mw1298, add back to pool [puppet] - 10https://gerrit.wikimedia.org/r/552884 (https://phabricator.wikimedia.org/T215332) [19:25:00] 10Operations, 10ops-eqiad, 10fundraising-tech-ops, 10Patch-For-Review: (ESCALATED Need By: 11/26/19) rack/setup/install frdb1003.eqiad.wmnet - https://phabricator.wikimedia.org/T239139 (10Jgreen) >>! In T239139#5690948, @wiki_willy wrote: > @Jgreen - I gave @Jclark-ctr a heads up on this, so he'll starting... [19:25:53] 10Operations, 10hardware-requests: Hardware request for Postgres database for censorship monitoring scripts - https://phabricator.wikimedia.org/T238652 (10RobH) Ok, this is an odd one and doesn't meet any of our current configuration requirements. The only reason this isn't a VM is due to storage requirements... [19:27:45] (03CR) 10Dzahn: [C: 03+2] "As Effie pointed out this server was not getting any traffic but was expected to be repooled again." [puppet] - 10https://gerrit.wikimedia.org/r/552884 (https://phabricator.wikimedia.org/T215332) (owner: 10Dzahn) [19:29:12] (03PS1) 10EBernhardson: Add dsh group for search platform airflow [puppet] - 10https://gerrit.wikimedia.org/r/552885 [19:29:50] 10Operations, 10hardware-requests: Hardware request for Postgres database for censorship monitoring scripts - https://phabricator.wikimedia.org/T238652 (10RobH) [19:30:06] !log ema@cumin1001 conftool action : set/pooled=yes; selector: name=cp4032.ulsfo.wmnet,service=nginx [19:30:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:48] RECOVERY - Check systemd state on cp4032 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:31:16] (03CR) 10Elukey: [C: 03+2] Add dsh group for search platform airflow [puppet] - 10https://gerrit.wikimedia.org/r/552885 (owner: 10EBernhardson) [19:31:19] (03PS1) 10Faidon Liambotis: Add three new Sentry PDU expansion units [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/552886 (https://phabricator.wikimedia.org/T227632) [19:31:56] (03CR) 10jerkins-bot: [V: 04-1] Add three new Sentry PDU expansion units [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/552886 (https://phabricator.wikimedia.org/T227632) (owner: 10Faidon Liambotis) [19:32:16] RECOVERY - Disk space on cp4028 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp4028&var-datasource=ulsfo+prometheus/ops [19:32:26] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1298.eqiad.wmnet [19:32:29] (03PS2) 10Faidon Liambotis: Add three new Sentry PDU expansion units [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/552886 (https://phabricator.wikimedia.org/T227632) [19:32:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:07] (03CR) 10Faidon Liambotis: [C: 03+2] Add three new Sentry PDU expansion units [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/552886 (https://phabricator.wikimedia.org/T227632) (owner: 10Faidon Liambotis) [19:34:45] (03CR) 10CDanis: [C: 03+2] grafana1002: is now the server for grafana.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/552879 (https://phabricator.wikimedia.org/T220838) (owner: 10CDanis) [19:35:14] !log mw1298 - scap pull [19:35:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:33] 10Operations, 10hardware-requests: Hardware request for Postgres database for censorship monitoring scripts - https://phabricator.wikimedia.org/T238652 (10RobH) Alternatively, we use existing spare pool system wmf5174 or wmf5175 which have dual SSDs, and then swap in dual 2TB SFF disks if they have any spare i... [19:36:41] 10Operations, 10ops-codfw, 10ops-eqiad, 10netbox, 10Patch-For-Review: Document PDU models - https://phabricator.wikimedia.org/T227632 (10faidon) 05Open→03Resolved I went digging in RT and fixed it for all of them except the old/unracked/offline sdtpa PDUs. [19:40:20] nice ghost recovery page there [19:42:06] PROBLEM - Check systemd state on logstash2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:42:24] RECOVERY - mediawiki-installation DSH group on mw1298 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [19:42:39] apergos: which one did you get? [19:43:02] icinga [19:43:22] the only recoveries i see are not paging? [19:43:35] recovery: icinga1001... see email for details. from 4 minutes ago [19:43:38] does anybody now what needs to be done on deplo1001 when the repo is added for the first time? Does it need to be cloned manually? [19:43:59] hm yes iirc, plus the scap config [19:44:22] I mean you will always pull manually anyways [19:44:25] and "keyholder arm" if a new deployment key is involved [19:45:06] shouldn't need to be cloned manually. You can add it to the heira data for the deploy host (if you're talking about scap3 deploys) [19:45:14] ah right [19:45:30] oh has that been updated since... [19:45:35] well years ago, right. heh [19:45:40] nice! [19:45:52] > hieradata/role/common/deployment_server.yaml [19:46:05] thcipriani: o/ it is there, but I currently see [19:46:15] OSError: [Errno 2] No such file or directory: '/srv/deployment/search/airflow/.git/config-files' [19:46:18] 19:33:42 ERROR - deploy failed: [Errno 2] No such file or directory: '/srv/deployment/search/airflow/.git/config-files' [19:46:53] ok super weird [19:47:06] apergos: likely same issue as mentioned during the meeting today [19:47:08] if I rm -rf the directory of the repo (empty) and run puppet it works [19:47:10] and that volans emailed about :) [19:47:37] yep worked [19:48:11] cdanis: yeah I figured as much [19:48:42] PROBLEM - Check systemd state on logstash2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:48:46] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1298.eqiad.wmnet [19:48:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:01] so in the last month, we've had six icinga recoveries and two alerts -- that means we can ignore the next four alerts, right? [19:49:03] thcipriani: seems that there might be some puppet race condition with https://github.com/wikimedia/puppet/blob/production/modules/scap/lib/puppet/provider/scap_source/default.rb#L163 [19:49:32] PROBLEM - Check systemd state on logstash1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:50:37] rlazarus: eyyyy [19:52:35] the problem with forgiving close bracket matching is that new open brackets always count [19:55:21] (03PS1) 10CDanis: logstash collector: use proper ferm syntax [puppet] - 10https://gerrit.wikimedia.org/r/552887 [19:56:04] PROBLEM - Check systemd state on logstash1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:57:39] elukey: hrm, that's possible. Not sure if I understand the full OOO that would lead to the race-condition, though [19:57:58] 10Operations, 10hardware-requests: Hardware request for Postgres database for censorship monitoring scripts - https://phabricator.wikimedia.org/T238652 (10RobH) [19:58:25] (03CR) 10CDanis: [C: 03+2] logstash collector: use proper ferm syntax [puppet] - 10https://gerrit.wikimedia.org/r/552887 (owner: 10CDanis) [19:58:33] thcipriani: so the repo dir was create (I believe by puppet) but then the clone didn't happen since the dir was already create.. I can try to create a task tomorrow if you want [19:59:50] RECOVERY - Check systemd state on logstash1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:00:05] cscott, arlolra, subbu, halfak, and accraze: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191125T2000). [20:00:40] RECOVERY - Check systemd state on logstash2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:00:43] (03PS1) 10Dzahn: conftool: move mw1298 to the jobrunner section [puppet] - 10https://gerrit.wikimedia.org/r/552888 (https://phabricator.wikimedia.org/T215332) [20:01:01] (03CR) 10Mholloway: [C: 03+2] Update wikifeeds to 2019-11-25-173023-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/552864 (https://phabricator.wikimedia.org/T235652) (owner: 10Mholloway) [20:01:10] RECOVERY - Check systemd state on logstash1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:01:14] 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Upgrade grafana to 6.4.4 - https://phabricator.wikimedia.org/T220838 (10CDanis) 05Open→03Resolved Grafana 6.4.4 is now in use at https://grafana.wikimedia.org. [20:01:21] (03Merged) 10jenkins-bot: Update wikifeeds to 2019-11-25-173023-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/552864 (https://phabricator.wikimedia.org/T235652) (owner: 10Mholloway) [20:02:24] RECOVERY - Check systemd state on logstash2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:02:49] 10Operations, 10ops-eqiad, 10DC-Ops: Asset tag remaining cablemgmt in eqiad - https://phabricator.wikimedia.org/T239110 (10faidon) 05Resolved→03Open Some of these were not done - I suspect partially because my ranges were misparsed as individual items (should had made that clearer, apologies!). The follo... [20:04:12] elukey: yeah, if you could file a task that'd be perfect [20:04:50] !log mholloway-shell@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'wikifeeds' for release 'staging' . [20:04:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:46] !log mholloway-shell@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'wikifeeds' for release 'production' . [20:05:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:00] 10Operations, 10ops-esams: Update spare QFX labels - https://phabricator.wikimedia.org/T237014 (10faidon) [20:06:05] 10Operations, 10hardware-requests: Hardware request for Postgres database for censorship monitoring scripts - https://phabricator.wikimedia.org/T238652 (10RobH) a:05RobH→03faidon Please note this task is now pending the approval of @faidon in conjunction with associated HDD purchase task T238652. @faidon:... [20:07:04] !log mholloway-shell@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'wikifeeds' for release 'production' . [20:07:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:19] (03CR) 10Dzahn: [C: 03+2] "per site.pp mw1298 is a jobrunner, so move it to the correct section in conftool" [puppet] - 10https://gerrit.wikimedia.org/r/552888 (https://phabricator.wikimedia.org/T215332) (owner: 10Dzahn) [20:15:18] 10Operations, 10SRE-Access-Requests: Requesting access to wiki pages, RecentChanges, revisions and users info in DB for Matanel11111 and Nahum11 - https://phabricator.wikimedia.org/T239149 (10Aklapper) 05Open→03Declined This is what Toolforge is for, hence I'm boldly declining this request as it's about pr... [20:15:28] (03PS2) 10Dzahn: conftool: move mw1298 to the jobrunner section [puppet] - 10https://gerrit.wikimedia.org/r/552888 (https://phabricator.wikimedia.org/T215332) [20:18:55] 10Operations, 10ops-esams, 10DC-Ops, 10Patch-For-Review: ESAMS Refresh/Rebuild (October 2019) - https://phabricator.wikimedia.org/T235805 (10RobH) [20:20:50] (03PS3) 10Herron: logstash: create elk7 logstash role and assign to elk7 collectors [puppet] - 10https://gerrit.wikimedia.org/r/552881 (https://phabricator.wikimedia.org/T234854) [20:22:03] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1298.eqiad.wmnet [20:22:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:11] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1298.eqiad.wmnet [20:31:19] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1298.eqiad.wmnet [20:31:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:51] ^ don't know why this server has "weight: 0" [20:32:06] and not 10 [20:36:26] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is CRITICAL: 55.2 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [20:37:46] !log dzahn@cumin1001 conftool action : set/weight=10; selector: name=mw1298.eqiad.wmnet [20:37:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:54] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is OK: (C)60 le (W)70 le 75.46 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [20:40:12] (03PS4) 10Herron: logstash: create elk7 logstash role and assign to elk7 collectors [puppet] - 10https://gerrit.wikimedia.org/r/552881 (https://phabricator.wikimedia.org/T234854) [20:44:05] (03CR) 10Dzahn: admins: add Max Semenik as ldap_only_admin (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552594 (https://phabricator.wikimedia.org/T238960) (owner: 10Dzahn) [20:44:19] (03CR) 10Herron: "PCC https://puppet-compiler.wmflabs.org/compiler1003/19607/" [puppet] - 10https://gerrit.wikimedia.org/r/552881 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [20:46:19] (03CR) 10Dzahn: "After the comments, not sure what is the right solution here. The user is the same user in LDAP (UID 1220) before and after. It needs to m" [puppet] - 10https://gerrit.wikimedia.org/r/552594 (https://phabricator.wikimedia.org/T238960) (owner: 10Dzahn) [20:50:14] (03CR) 10Dzahn: "If the reason to keep it "absented" is to make sure keys are really removed.. we can manually check with cumin before we do. If the reaso" [puppet] - 10https://gerrit.wikimedia.org/r/552594 (https://phabricator.wikimedia.org/T238960) (owner: 10Dzahn) [20:52:27] (03PS3) 10Dzahn: admins: add Max Semenik as ldap_only_admin [puppet] - 10https://gerrit.wikimedia.org/r/552594 (https://phabricator.wikimedia.org/T238960) [20:52:46] (03CR) 10jerkins-bot: [V: 04-1] admins: add Max Semenik as ldap_only_admin [puppet] - 10https://gerrit.wikimedia.org/r/552594 (https://phabricator.wikimedia.org/T238960) (owner: 10Dzahn) [20:53:52] (03PS4) 10Dzahn: admins: add Max Semenik as ldap_only_admin [puppet] - 10https://gerrit.wikimedia.org/r/552594 (https://phabricator.wikimedia.org/T238960) [20:54:55] (03CR) 10CDanis: [C: 03+1] poolcounter: Install and run the prometheus exporter alongside poolcounter. [puppet] - 10https://gerrit.wikimedia.org/r/552875 (https://phabricator.wikimedia.org/T237407) (owner: 10RLazarus) [20:55:11] (03PS2) 10Dzahn: wikimania_scholarships app: use codfw database when in codfw [puppet] - 10https://gerrit.wikimedia.org/r/552551 (https://phabricator.wikimedia.org/T224247) [20:55:16] (03PS1) 10Andrew Bogott: nova: add support for the 'placement' api service [puppet] - 10https://gerrit.wikimedia.org/r/552890 (https://phabricator.wikimedia.org/T239161) [20:56:43] (03CR) 10Dzahn: [C: 03+2] wikimania_scholarships app: use codfw database when in codfw [puppet] - 10https://gerrit.wikimedia.org/r/552551 (https://phabricator.wikimedia.org/T224247) (owner: 10Dzahn) [20:58:19] (03CR) 10jerkins-bot: [V: 04-1] nova: add support for the 'placement' api service [puppet] - 10https://gerrit.wikimedia.org/r/552890 (https://phabricator.wikimedia.org/T239161) (owner: 10Andrew Bogott) [20:59:06] 10Operations, 10MediaWiki-REST-API, 10serviceops, 10wikidiff2, and 2 others: Deploy version 1.10.0 of wikidiff2 to production - https://phabricator.wikimedia.org/T236963 (10jijiki) Version 1.10.0 is live, please mark this as resolved if everything works as expected. [20:59:21] 10Operations, 10ops-eqiad, 10DC-Ops: Asset tag remaining cablemgmt in eqiad - https://phabricator.wikimedia.org/T239110 (10Cmjohnson) Yes, I misread the task....I will update and resolve once completed [21:00:04] Reedy and sbassett: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191125T2100). [21:00:08] (03PS2) 10Andrew Bogott: nova: add support for the 'placement' api service [puppet] - 10https://gerrit.wikimedia.org/r/552890 (https://phabricator.wikimedia.org/T239161) [21:01:49] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10jijiki) [21:05:36] (03PS3) 10Zoranzoki21: Equalization of wgPopupsReferencePreviews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552643 [21:07:14] 10Operations, 10GLOW, 10SRE-Access-Requests: Requesting access to sites from Google Search Console - https://phabricator.wikimedia.org/T238868 (10Iflorez) Thank you @crusnov, Below is the full list. I request access for the following sites and their mobile sites. ID.wikipedia, SU.wikipedia, JV.wikipedia,... [21:08:00] (03PS3) 10Andrew Bogott: nova: add support for the 'placement' api service [puppet] - 10https://gerrit.wikimedia.org/r/552890 (https://phabricator.wikimedia.org/T239161) [21:12:09] (03PS2) 10Dzahn: iegreview app: use codfw database when in codfw [puppet] - 10https://gerrit.wikimedia.org/r/552552 (https://phabricator.wikimedia.org/T224247) [21:14:30] 10Operations, 10ops-eqiad, 10Patch-For-Review: reimage WMF6937/mw1298 - https://phabricator.wikimedia.org/T215332 (10Dzahn) mw1298 is now back here, with weight 10 as a jobrunner https://config-master.wikimedia.org/pybal/eqiad/jobrunner [21:14:39] (03PS4) 10Andrew Bogott: nova: add support for the 'placement' api service [puppet] - 10https://gerrit.wikimedia.org/r/552890 (https://phabricator.wikimedia.org/T239161) [21:14:40] (03PS1) 10Andrew Bogott: nova: add nova config for the placement service [puppet] - 10https://gerrit.wikimedia.org/r/552894 (https://phabricator.wikimedia.org/T239161) [21:16:18] 10Operations, 10ops-eqiad: frqueue1001 system battery needs replacement - https://phabricator.wikimedia.org/T237582 (10Cmjohnson) [21:17:32] (03PS5) 10Andrew Bogott: nova: add support for the 'placement' api service [puppet] - 10https://gerrit.wikimedia.org/r/552890 (https://phabricator.wikimedia.org/T239161) [21:17:34] (03PS2) 10Andrew Bogott: nova: add nova config for the placement service [puppet] - 10https://gerrit.wikimedia.org/r/552894 (https://phabricator.wikimedia.org/T239161) [21:18:16] 10Operations, 10ops-eqiad, 10fundraising-tech-ops, 10Patch-For-Review: (ESCALATED Need By: 11/26/19) rack/setup/install frdb1003.eqiad.wmnet - https://phabricator.wikimedia.org/T239139 (10Jclark-ctr) [21:18:56] (03CR) 10RLazarus: [C: 03+2] poolcounter: Install and run the prometheus exporter alongside poolcounter. [puppet] - 10https://gerrit.wikimedia.org/r/552875 (https://phabricator.wikimedia.org/T237407) (owner: 10RLazarus) [21:21:22] 10Operations, 10ops-eqiad, 10serviceops: (No Need By Date Provided) rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10jijiki) Currently: | Rack |A 5| A6|A 7|B 6|B 7| C 6| D 4 | D 5 |mw servers|6|6|17|21|6| 30|6 (decom)|30 (decom) We will decommission 36 servers from... [21:24:06] PROBLEM - Host ms-be2056 is DOWN: PING CRITICAL - Packet loss = 100% [21:26:12] RECOVERY - Host ms-be2056 is UP: PING OK - Packet loss = 0%, RTA = 36.16 ms [21:27:23] (03PS1) 10RLazarus: poolcounter: Specify port 9106 for the exporter [puppet] - 10https://gerrit.wikimedia.org/r/552896 (https://phabricator.wikimedia.org/T237407) [21:28:15] (03CR) 10CDanis: [C: 03+1] poolcounter: Specify port 9106 for the exporter [puppet] - 10https://gerrit.wikimedia.org/r/552896 (https://phabricator.wikimedia.org/T237407) (owner: 10RLazarus) [21:29:39] (03CR) 10RLazarus: [C: 03+2] poolcounter: Specify port 9106 for the exporter [puppet] - 10https://gerrit.wikimedia.org/r/552896 (https://phabricator.wikimedia.org/T237407) (owner: 10RLazarus) [21:32:38] RECOVERY - Wikitech and wt-static content in sync on labweb1002 is OK: wikitech-static OK - wikitech and wikitech-static in sync (191590 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [21:32:38] RECOVERY - Wikitech and wt-static content in sync on labweb1001 is OK: wikitech-static OK - wikitech and wikitech-static in sync (191590 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [21:32:39] (03CR) 10Dzahn: [C: 03+2] iegreview app: use codfw database when in codfw [puppet] - 10https://gerrit.wikimedia.org/r/552552 (https://phabricator.wikimedia.org/T224247) (owner: 10Dzahn) [21:37:44] 10Operations, 10ops-eqiad, 10fundraising-tech-ops, 10Patch-For-Review: (ESCALATED Need By: 11/26/19) rack/setup/install frdb1003.eqiad.wmnet - https://phabricator.wikimedia.org/T239139 (10Jclark-ctr) Racked Server Connected to port 26 on fasw-c1a & fasw-c2a Entered production idrac password @Jgreen... [21:39:14] (03PS1) 10RLazarus: poolcounter: Set restart => true for the exporter service. [puppet] - 10https://gerrit.wikimedia.org/r/552898 (https://phabricator.wikimedia.org/T237407) [21:40:36] 10Operations, 10ops-eqiad, 10fundraising-tech-ops, 10Patch-For-Review: (ESCALATED Need By: 11/26/19) rack/setup/install frdb1003.eqiad.wmnet - https://phabricator.wikimedia.org/T239139 (10Jclark-ctr) [21:41:37] (03CR) 10CDanis: [C: 03+1] poolcounter: Set restart => true for the exporter service. [puppet] - 10https://gerrit.wikimedia.org/r/552898 (https://phabricator.wikimedia.org/T237407) (owner: 10RLazarus) [21:43:11] (03CR) 10RLazarus: [C: 03+2] poolcounter: Set restart => true for the exporter service. [puppet] - 10https://gerrit.wikimedia.org/r/552898 (https://phabricator.wikimedia.org/T237407) (owner: 10RLazarus) [21:48:11] 10Operations, 10ops-eqiad, 10DC-Ops: Duplicate cable label in cr1-eqiad/cr2-eqiad - https://phabricator.wikimedia.org/T239098 (10wiki_willy) a:03Jclark-ctr [21:49:54] 10Operations, 10ops-eqiad, 10Analytics: analytics1057's BBU is faulty - https://phabricator.wikimedia.org/T239045 (10wiki_willy) a:03Jclark-ctr [21:52:30] (03CR) 10Jhedden: [C: 03+1] nova: add nova config for the placement service [puppet] - 10https://gerrit.wikimedia.org/r/552894 (https://phabricator.wikimedia.org/T239161) (owner: 10Andrew Bogott) [21:57:43] (03CR) 10Jhedden: [C: 03+1] "LGTM overall, non-blocking comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552890 (https://phabricator.wikimedia.org/T239161) (owner: 10Andrew Bogott) [22:01:11] (03CR) 10Dzahn: [C: 03+2] racktables: use codfw database when in codfw [puppet] - 10https://gerrit.wikimedia.org/r/552553 (https://phabricator.wikimedia.org/T224247) (owner: 10Dzahn) [22:05:53] jouncebot: next [22:05:53] In 0 hour(s) and 54 minute(s): Evening SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191125T2300) [22:07:03] (03PS1) 10Jgreen: add frdb1003.frack.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/552899 (https://phabricator.wikimedia.org/T239139) [22:08:07] 10Operations, 10ops-eqiad, 10Analytics: analytics1057's BBU is faulty - https://phabricator.wikimedia.org/T239045 (10Jclark-ctr) @elukey No spare bbu around [22:08:39] 10Operations, 10ops-eqiad, 10fundraising-tech-ops, 10Patch-For-Review: (ESCALATED Need By: 11/26/19) rack/setup/install frdb1003.eqiad.wmnet - https://phabricator.wikimedia.org/T239139 (10RobH) >>! In T239139#5691375, @Jclark-ctr wrote: > Racked Server > > Connected to port 26 on fasw-c1a & fasw-c2a >... [22:10:22] 10Operations, 10ops-eqiad, 10fundraising-tech-ops, 10Patch-For-Review: (ESCALATED Need By: 11/26/19) rack/setup/install frdb1003.eqiad.wmnet - https://phabricator.wikimedia.org/T239139 (10RobH) forgot to remove from disabled group: ` robh@fasw-c-eqiad# show | compare [edit interfaces interface-range dis... [22:12:12] (03CR) 10Jgreen: [C: 03+2] add frdb1003.frack.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/552899 (https://phabricator.wikimedia.org/T239139) (owner: 10Jgreen) [22:13:07] !log authdns update to deploy I21ddc1a3e [22:13:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:19] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Patch-For-Review: Conversion to volunteer NDA for MaxSem - https://phabricator.wikimedia.org/T238960 (10RStallman-legalteam) Hi @MaxSem, Happy to prepare an NDA for you. I will need your mailing address as well as personal email to send you t... [22:13:40] 10Operations, 10ops-eqiad, 10fundraising-tech-ops, 10Patch-For-Review: (ESCALATED Need By: 11/26/19) rack/setup/install frdb1003.eqiad.wmnet - https://phabricator.wikimedia.org/T239139 (10Jgreen) [22:15:57] 10Operations, 10ops-eqiad, 10fundraising-tech-ops, 10Patch-For-Review: (ESCALATED Need By: 11/26/19) rack/setup/install frdb1003.eqiad.wmnet - https://phabricator.wikimedia.org/T239139 (10RobH) [22:20:51] (03PS1) 10Urbanecm: Add gewikimedia to special.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552901 (https://phabricator.wikimedia.org/T239173) [22:21:28] (03CR) 10jerkins-bot: [V: 04-1] Add gewikimedia to special.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552901 (https://phabricator.wikimedia.org/T239173) (owner: 10Urbanecm) [22:23:40] (03PS2) 10Urbanecm: Add gewikimedia to special.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552901 (https://phabricator.wikimedia.org/T239173) [22:24:20] (03CR) 10jerkins-bot: [V: 04-1] Add gewikimedia to special.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552901 (https://phabricator.wikimedia.org/T239173) (owner: 10Urbanecm) [22:25:39] (03PS3) 10Urbanecm: Add gewikimedia to special.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552901 (https://phabricator.wikimedia.org/T239173) [22:28:53] 10Operations, 10ops-eqiad, 10DC-Ops: Duplicate cable label in cr1-eqiad/cr2-eqiad - https://phabricator.wikimedia.org/T239098 (10Jclark-ctr) cr2-eqiad:xe-3/3/3 cable incorrect in netbox . correct is 2649 [22:29:43] 10Operations, 10ops-eqiad, 10DC-Ops: Duplicate cable label in cr1-eqiad/cr2-eqiad - https://phabricator.wikimedia.org/T239098 (10Jclark-ctr) a:05Jclark-ctr→03ayounsi @ayounsi Can you update routers to reflect . Thanks!! [22:31:51] (03PS3) 10Andrew Bogott: nova: add nova config for the placement service [puppet] - 10https://gerrit.wikimedia.org/r/552894 (https://phabricator.wikimedia.org/T239161) [22:32:02] (03PS1) 10RLazarus: poolcounter: Add ferm rule for the exporter; move it to profile::poolcounter. [puppet] - 10https://gerrit.wikimedia.org/r/552904 (https://phabricator.wikimedia.org/T237407) [22:34:53] (03CR) 10CDanis: [C: 03+1] poolcounter: Add ferm rule for the exporter; move it to profile::poolcounter. [puppet] - 10https://gerrit.wikimedia.org/r/552904 (https://phabricator.wikimedia.org/T237407) (owner: 10RLazarus) [22:35:18] (03CR) 10Jhedden: [C: 03+1] nova: add nova config for the placement service [puppet] - 10https://gerrit.wikimedia.org/r/552894 (https://phabricator.wikimedia.org/T239161) (owner: 10Andrew Bogott) [22:38:23] (03PS4) 10DannyS712: Remove `move-rootuserpages` from user on svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552374 (https://phabricator.wikimedia.org/T238842) [22:39:31] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: (ESCALATED Need By: 11/26/19) rack/setup/install frdb1003.eqiad.wmnet - https://phabricator.wikimedia.org/T239139 (10RobH) a:05Jclark-ctr→03Jgreen [22:45:44] 10Operations, 10serviceops: upgrade and rename krypton & create its codfw equivalent - https://phabricator.wikimedia.org/T224247 (10Dzahn) 05Open→03Resolved [22:45:46] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10Dzahn) [22:46:26] (03PS1) 10RLazarus: poolcounter: Refactor: In role::poolcounter::server, use profile::poolcounter. [puppet] - 10https://gerrit.wikimedia.org/r/552910 [22:47:03] (03CR) 10CDanis: [C: 03+1] "LGTM assuming PCC is happy" [puppet] - 10https://gerrit.wikimedia.org/r/552910 (owner: 10RLazarus) [22:48:56] (03CR) 10RLazarus: [C: 03+2] "No changes: https://puppet-compiler.wmflabs.org/compiler1001/19614/poolcounter1004.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/552910 (owner: 10RLazarus) [22:53:29] (03PS2) 10RLazarus: poolcounter: Add ferm rule for the exporter; move it to profile::poolcounter. [puppet] - 10https://gerrit.wikimedia.org/r/552904 (https://phabricator.wikimedia.org/T237407) [22:59:23] 10Operations, 10MediaWiki-REST-API, 10serviceops, 10wikidiff2, and 2 others: Prod compare endpoint missing offset object (with from & to keys) on diff items - https://phabricator.wikimedia.org/T238846 (10Tsevener) 05Open→03Resolved [22:59:27] 10Operations, 10MediaWiki-REST-API, 10serviceops, 10wikidiff2, and 2 others: Deploy version 1.10.0 of wikidiff2 to production - https://phabricator.wikimedia.org/T236963 (10Tsevener) [23:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Evening SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191125T2300). [23:00:05] urandom: A patch you scheduled for Evening SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:10] 10Operations, 10MediaWiki-REST-API, 10serviceops, 10wikidiff2, and 2 others: Deploy version 1.10.0 of wikidiff2 to production - https://phabricator.wikimedia.org/T236963 (10Tsevener) 05Open→03Resolved [23:00:14] (03PS3) 10RLazarus: poolcounter: Add ferm rule for the exporter; move it to profile::poolcounter. [puppet] - 10https://gerrit.wikimedia.org/r/552904 (https://phabricator.wikimedia.org/T237407) [23:00:20] 10Operations, 10MediaWiki-REST-API, 10serviceops, 10wikidiff2, and 2 others: Deploy version 1.10.0 of wikidiff2 to production - https://phabricator.wikimedia.org/T236963 (10Tsevener) Looking good in Prod, thanks everyone! [23:00:42] o/ [23:00:54] urandom: I can SWAT today! [23:01:07] Urbanecm: great; thanks! [23:02:11] (03CR) 10Urbanecm: [C: 03+2] kask-echoseen: Do not report dupes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552257 (https://phabricator.wikimedia.org/T237143) (owner: 10Mobrovac) [23:02:44] (03CR) 10Urbanecm: [C: 03+2] Add gewikimedia to special.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552901 (https://phabricator.wikimedia.org/T239173) (owner: 10Urbanecm) [23:02:56] (03Merged) 10jenkins-bot: kask-echoseen: Do not report dupes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552257 (https://phabricator.wikimedia.org/T237143) (owner: 10Mobrovac) [23:03:23] urandom: can you test at mwdebug1001, please? [23:03:24] (03Merged) 10jenkins-bot: Add gewikimedia to special.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552901 (https://phabricator.wikimedia.org/T239173) (owner: 10Urbanecm) [23:03:32] will do! [23:04:15] thank you urandom [23:06:08] PROBLEM - Check systemd state on wdqs1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:06:12] Urbanecm: LGTM [23:06:26] thanks urandom , syncing [23:06:56] (03CR) 10CDanis: [C: 03+1] poolcounter: Add ferm rule for the exporter; move it to profile::poolcounter. [puppet] - 10https://gerrit.wikimedia.org/r/552904 (https://phabricator.wikimedia.org/T237407) (owner: 10RLazarus) [23:07:09] (03PS1) 10Faidon Liambotis: Fix some spelling issues [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/552919 [23:07:34] (03CR) 10RLazarus: [C: 03+2] poolcounter: Add ferm rule for the exporter; move it to profile::poolcounter. [puppet] - 10https://gerrit.wikimedia.org/r/552904 (https://phabricator.wikimedia.org/T237407) (owner: 10RLazarus) [23:07:44] !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT: d71b0ab: kask-echoseen: Do not report dupes (T237143) (duration: 00m 53s) [23:07:47] urandom: done! [23:07:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:49] T237143: Log warning: Duplicate get(): "officewiki:echo:seen:message:time:{n}" fetched 2 times - https://phabricator.wikimedia.org/T237143 [23:07:54] (03CR) 10jerkins-bot: [V: 04-1] Fix some spelling issues [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/552919 (owner: 10Faidon Liambotis) [23:07:54] Urbanecm: thanks! [23:07:59] yw! [23:09:18] !log urbanecm@deploy1001 Synchronized dblists/: SWAT: aed2369: Add gewikimedia to special.dblist (T239173) (duration: 00m 52s) [23:09:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:23] T239173: gewikimedia's w interwiki links to (nonexistent) gewiki - https://phabricator.wikimedia.org/T239173 [23:09:34] RECOVERY - Check systemd state on wdqs1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:10:03] !log urbanecm@deploy1001 update-interwiki-cache aborted: Update interwiki cache (duration: 00m 01s) [23:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:30] (03PS1) 10Urbanecm: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552922 [23:10:32] (03CR) 10Urbanecm: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552922 (owner: 10Urbanecm) [23:10:53] (03CR) 10Urbanecm: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552922 (owner: 10Urbanecm) [23:11:17] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552922 (owner: 10Urbanecm) [23:12:32] !log urbanecm@deploy1001 Synchronized wmf-config/interwiki.php: Update interwiki cache (duration: 02m 14s) [23:12:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:14:14] !log Evening SWAT done [23:14:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:43] 10Operations, 10ChangeProp, 10Release Pipeline, 10Release-Engineering-Team-TODO, and 4 others: Migrate cpjobqueue to kubernetes - https://phabricator.wikimedia.org/T220399 (10holger.knust) a:03holger.knust [23:21:34] PROBLEM - Check systemd state on wdqs1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:23:17] (03PS1) 10RobH: updating to remove 709-BBFK from required skus [software] - 10https://gerrit.wikimedia.org/r/552926 [23:24:18] (03CR) 10RobH: [C: 03+2] updating to remove 709-BBFK from required skus [software] - 10https://gerrit.wikimedia.org/r/552926 (owner: 10RobH) [23:24:45] (03Merged) 10jenkins-bot: updating to remove 709-BBFK from required skus [software] - 10https://gerrit.wikimedia.org/r/552926 (owner: 10RobH) [23:28:07] (03CR) 10Krinkle: "Remove the HHVMRequestInit.php.txt symlink itself as well. LGTM otherwise, good to go." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542184 (https://phabricator.wikimedia.org/T235142) (owner: 10Jforrester) [23:31:05] 10Operations, 10DNS, 10Internet-Archive, 10Traffic, 10Wikimedia-Video: Please set up a CNAME for videoserver.wikimedia.org to Video Editing Server - https://phabricator.wikimedia.org/T99216 (10Dzahn) >>! In T99216#5689785, @Aklapper wrote: > Ah, thanks. But who exactly is supposed to answer that question... [23:31:47] ACKNOWLEDGEMENT - Check systemd state on wdqs1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Mathew.onipe This server auto deploys wdqs and the last release has some issues with updater. I will revert wdqs release version later today. Server is currently not serving traffic so we should be fine - The acknowledgement expires at: 2019-11-26 10:28:35. https://wikitech.wikimedia.org/wiki [23:31:47] _systemd_state [23:36:44] PROBLEM - Wikitech and wt-static content in sync on labweb1002 is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (200978s 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [23:36:44] PROBLEM - Wikitech and wt-static content in sync on labweb1001 is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (200978s 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [23:39:24] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Patch-For-Review: Conversion to volunteer NDA for MaxSem - https://phabricator.wikimedia.org/T238960 (10Dzahn) Hi @RStallman-legalteam The email address is maxsem.wiki@gmail.com (no worries he already made it public all this time on the wiki us... [23:40:26] RECOVERY - Check systemd state on wdqs1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:40:26] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: mw1239 memory errors - https://phabricator.wikimedia.org/T227867 (10Dzahn) 05Open→03Declined [23:42:47] 10Operations, 10Product-Analytics, 10SRE-Access-Requests: Search Console access for he.wikisource.org - https://phabricator.wikimedia.org/T238090 (10Dzahn) a:03jbond [23:47:18] PROBLEM - Check systemd state on wdqs1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:47:19] (03CR) 10Volans: "A comment inline" (032 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/552874 (https://phabricator.wikimedia.org/T237469) (owner: 10CRusnov) [23:54:38] (03CR) 10Volans: "A comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/551948 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov)